Sentence Transformers are typically trained on datasets that emphasize understanding semantic relationships between sentences. The most common datasets include SNLI (Stanford Natural Language Inference), STS (Semantic Textual Similarity) benchmarks, and combinations of related corpora like MultiNLI and AllNLI. These datasets help models learn to map sentences with similar meanings closer in the embedding space while distancing unrelated ones.
SNLI contains 570,000 sentence pairs labeled as entailment, contradiction, or neutral. It trains models to recognize logical relationships between sentences, which is useful for general embeddings. For example, if a premise ("a man plays guitar") entails a hypothesis ("a musician is performing"), the model learns to align their embeddings. STS benchmarks (e.g., STS-B, STSb) provide sentence pairs annotated with similarity scores (0–5). Training on STS data teaches the model to quantify similarity, which is critical for tasks like retrieval or clustering. Models often use a contrastive loss (e.g., cosine similarity loss) on these datasets to refine embeddings.
Other datasets include AllNLI, which combines SNLI and MultiNLI (covering diverse genres like fiction and government reports), and Quora Question Pairs (400,000 labeled duplicate questions). These expand the model’s ability to handle paraphrasing and domain variations. MS MARCO, a large-scale retrieval dataset, trains models to match queries to relevant passages. Some approaches also use Wikipedia or BooksCorpus indirectly, as base models like BERT are pretrained on them before fine-tuning on task-specific data.
Developers often combine these datasets to improve generalization. For example, Sentence-BERT is first trained on SNLI+MultiNLI to learn sentence relations, then fine-tuned on STS data for similarity scoring. Training might involve triplet loss (using anchor, positive, and negative examples) or multiple objectives across datasets. This multi-dataset approach ensures embeddings work well for diverse applications like semantic search, text classification, or retrieval. By leveraging varied data sources, models capture broad semantic patterns while avoiding overfitting to narrow tasks.