Sentence Transformers generate fixed-length sentence embeddings from models like BERT or RoBERTa by combining transformer layers with pooling operations and fine-tuning for semantic similarity. Here’s how it works:
First, transformer models like BERT process input text into contextualized token embeddings. These embeddings vary in length depending on the input sentence, as each token (e.g., word or subword) receives its own vector. To convert this variable-length output into a fixed-length representation, Sentence Transformers apply a pooling layer. The most common method is mean pooling, which averages the token embeddings across the sequence. For example, if BERT outputs 10 token vectors of 768 dimensions each, mean pooling averages these into a single 768-dimensional vector. Other pooling strategies include taking the maximum value per dimension (max pooling) or using the [CLS] token embedding, though the latter often requires fine-tuning to work effectively.
Second, Sentence Transformers fine-tune the base transformer model using contrastive learning objectives. Unlike BERT or RoBERTa, which are pretrained for masked language modeling, Sentence Transformers are trained to optimize semantic similarity. For instance, during training, the model might process sentence pairs (e.g., a query and a relevant answer) and minimize the distance between their embeddings while pushing unrelated sentences apart. Loss functions like triplet loss or cosine similarity loss enforce this behavior. This step is critical because raw BERT embeddings without fine-tuning often perform poorly for semantic tasks like clustering or retrieval.
Finally, the architecture combines these components. A typical setup uses a siamese network: two identical transformer models (sharing weights) process input pairs, followed by pooling and a projection layer. For example, the all-mpnet-base-v2
model uses mean pooling on MPNet (a variant of RoBERTa) outputs, followed by a dense layer to refine the embeddings. The result is a fixed-length vector (e.g., 768 dimensions) that captures semantic meaning, enabling tasks like semantic search or text classification without requiring per-task architectural changes. By focusing on pooling and contrastive fine-tuning, Sentence Transformers overcome the limitations of raw transformer outputs, producing embeddings optimized for downstream applications.