A typical Sentence Transformer model, such as Sentence-BERT (SBERT), is designed to generate dense, fixed-length vector representations (embeddings) for sentences or short texts. The architecture combines pre-trained transformer models like BERT with a pooling layer and fine-tuning strategies tailored for sentence-level semantics. Unlike standard BERT, which processes token pairs for tasks like classification, SBERT uses a Siamese or triplet network structure. This means input sentences are processed independently through the same transformer encoder, and their outputs are aggregated into sentence embeddings using pooling. The model is trained to optimize the similarity between embeddings of semantically related sentences.
The core components include a base transformer model (e.g., BERT, RoBERTa), a pooling layer, and a loss function. The base transformer processes input tokens, producing contextualized token embeddings. The pooling layer then aggregates these token-level embeddings into a single sentence vector. Common pooling methods include taking the mean of all token embeddings (mean pooling), using the [CLS] token’s embedding, or applying max pooling. For example, SBERT often uses mean pooling, which averages the output embeddings of all tokens in the sentence. This approach captures the overall semantic content better than relying solely on the [CLS] token, which was shown to underperform for sentence-level tasks in early BERT models. After pooling, the sentence embeddings are normalized (e.g., using L2 normalization) to stabilize training and improve similarity comparisons.
Training involves fine-tuning the model with objectives that enforce semantic similarity. For instance, SBERT might use a triplet loss, where the model learns to minimize the distance between an anchor sentence and a positive example (semantically similar) while maximizing the distance to a negative example (dissimilar). Alternatively, a contrastive loss or cosine similarity loss can be used, depending on the task. For example, in the original SBERT paper, the model was trained on Natural Language Inference (NLI) datasets, where sentence pairs are labeled as entailment, contradiction, or neutral. The model optimizes a softmax classifier over these labels, indirectly learning embeddings that cluster similar sentences. This architecture’s effectiveness stems from its ability to produce embeddings that preserve semantic relationships, enabling efficient similarity comparisons via cosine distance—a key advantage over computationally expensive methods like cross-encoder re-ranking.
