Using the [CLS] token embedding directly in models like BERT often underperforms compared to pooling strategies in Sentence Transformers because the [CLS] token isn’t always optimized for semantic similarity tasks. While the [CLS] token is designed to capture aggregated sentence-level information during pretraining (e.g., for classification tasks), its quality depends heavily on the pretraining objective. For example, BERT’s [CLS] token is trained to predict a single label for the entire sequence, which may not align with the need for dense vector representations that capture fine-grained semantic relationships between sentences. In contrast, pooling strategies like mean or max pooling aggregate information across all token embeddings, which can better represent the full context of the sentence.
Another key factor is how models are fine-tuned. Sentence Transformers often use contrastive or triplet loss objectives during training, which explicitly optimize the distance between sentence embeddings. These losses work best when applied to pooled embeddings because they encourage the model to distribute semantic information across all tokens rather than relying on a single token like [CLS]. For instance, averaging token embeddings can mitigate the risk of overfitting to noise in a single token’s representation. Additionally, during fine-tuning, the pooling layer itself can be part of the trained architecture, allowing the model to learn how to weight or combine token embeddings effectively. The [CLS] token, however, isn’t always updated in a way that maximizes its utility for downstream tasks like retrieval or clustering.
Finally, practical implementation details matter. The [CLS] token’s position embedding may not always align with the semantic focus of the sentence, especially in variable-length inputs. For example, in a sentence like “The product is great, but the delivery was slow,” the [CLS] token might disproportionately emphasize the first few tokens, while mean pooling would balance contributions from both positive (“great”) and negative (“slow”) aspects. Pooling also reduces sensitivity to positional biases introduced during pretraining. In practice, frameworks like Sentence Transformers often default to pooling because it consistently outperforms [CLS] in benchmarks, especially when combined with techniques like layer-wise pooling or normalization.