Special tokens like [CLS]
and [SEP]
in Sentence Transformer models serve structural and functional roles inherited from their underlying transformer architectures (e.g., BERT). These tokens help the model process input sequences effectively and generate meaningful sentence embeddings. Here’s how they work:
1. Structural Separation and Contextual Signals
The [SEP]
token acts as a separator between distinct segments of text, such as two sentences in a pair. For example, in training tasks like semantic similarity, inputs might be structured as [CLS] Sentence A [SEP] Sentence B [SEP]
, allowing the model to differentiate between the two sentences. This separation is critical for tasks requiring cross-sentence interaction, such as sentence-pair classification. Meanwhile, [CLS]
(short for "classification") is typically the first token in the input and is designed to aggregate global information. In vanilla BERT, the [CLS]
token’s hidden state is used for classification tasks. However, in Sentence Transformers, its role can vary: some models use it directly as the sentence embedding, while others apply pooling strategies (e.g., mean pooling of all token embeddings) to derive the final representation.
2. Handling Input Formatting and Padding
Special tokens like [PAD]
are used to standardize input lengths by padding shorter sequences, enabling efficient batch processing. Models ignore these tokens during computation using attention masks. Proper use of [SEP]
and [CLS]
ensures the model adheres to the expected input structure. For instance, when encoding a single sentence, the input might still include [CLS]
at the start and [SEP]
at the end (e.g., [CLS] This is a sentence [SEP]
), maintaining consistency with pretraining formats. This consistency is crucial because Sentence Transformers are often fine-tuned versions of BERT-like models, which rely on these tokens during pretraining.
3. Impact on Embedding Quality
While [CLS]
embeddings theoretically capture global context, Sentence Transformers often improve performance by pooling token embeddings instead. For example, models like Sentence-BERT (SBERT) use mean pooling over all non-special tokens (excluding [CLS]
and [SEP]
) to create sentence embeddings, as this approach empirically produces more robust representations. Developers must follow the tokenization and formatting rules specific to their chosen model (e.g., adding [SEP]
between pairs during training) to ensure compatibility and optimal performance. Mismanaging these tokens—like omitting [SEP]
in a sentence pair—can lead to degraded embeddings or incorrect results.