Sentence Transformers enable zero-shot and few-shot scenarios by generating dense semantic embeddings that capture the meaning of text in a way that generalizes across tasks. These models are pre-trained on large datasets using objectives like semantic similarity, which teaches them to map sentences with similar meanings to nearby points in a high-dimensional vector space. For example, a model trained on Natural Language Inference (NLI) data learns to distinguish whether sentences are semantically equivalent, contradictory, or neutral. This pre-training allows the embeddings to act as a reusable semantic representation, even for unseen tasks. When applied to a new task with no training data (zero-shot), developers can compare the embeddings of a query and candidate texts using cosine similarity, effectively ranking candidates by semantic relevance without task-specific fine-tuning.
The key to this capability lies in contrastive learning techniques like triplet loss or Multiple Negatives Ranking (MNR) loss, which explicitly train the model to distinguish between relevant and irrelevant sentence pairs. For instance, in triplet loss, the model learns to minimize the distance between an anchor sentence and a positive example while maximizing the distance to a negative example. This creates embeddings where semantically related texts cluster together, making similarity comparisons effective even for new domains. In few-shot scenarios, developers can provide a small number of labeled examples (e.g., 5-10 query-result pairs) to create a reference set. These examples are embedded, and new inputs are compared against them using similarity metrics or lightweight classifiers like k-nearest neighbors. This approach works because the embeddings already encode generalized semantic relationships from pre-training, requiring minimal adjustment for specific tasks.
A practical example is building a document retrieval system without labeled data. Using a pre-trained Sentence Transformer, you could encode all documents into vectors upfront. For a search query, encode it into the same space and retrieve the top-k nearest neighbors via cosine similarity. In a few-shot case, if users provide three examples of "relevant" and "irrelevant" results for their specific domain, you could fine-tune the model on these examples using a contrastive loss for 1-2 epochs, adapting the embeddings to domain nuances. This efficiency stems from the bi-encoder architecture used in Sentence Transformers, where queries and documents are encoded separately, enabling pre-computation of document vectors and real-time similarity calculations. The model’s ability to generalize from its pre-training while remaining adaptable with minimal data makes it particularly useful for scenarios where collecting labeled data is expensive or impractical.