Sentence Transformer embeddings are context-dependent for words, and they handle polysemy by leveraging the surrounding context in a sentence to adjust word representations dynamically. Here’s how it works:
Contextual Embeddings via Transformers: Sentence Transformers are built on transformer architectures like BERT, which generate token-level embeddings that depend on the entire input sentence. For example, the word "bank" in "I deposited money at the bank" and "The river flows near the bank" will have different token embeddings because the model uses self-attention to weigh relationships between words. The transformer’s attention mechanism identifies relevant context (e.g., "money" vs. "river") to disambiguate meanings. When generating sentence embeddings, models like Sentence-BERT pool these contextual token embeddings (e.g., using mean pooling), resulting in a sentence vector that reflects the collective context of all words, including resolving polysemy.
Handling Polysemy Through Training Objectives: Sentence Transformers are often fine-tuned on tasks like semantic textual similarity (STS) or paraphrase detection. During training, the model learns to map sentences with similar meanings closer in the embedding space, even if they contain polysemous words. For instance, if two sentences use "bat" (e.g., "He swung the baseball bat" vs. "The bat flew at night"), the model adjusts embeddings based on co-occurring words ("baseball" vs. "flew") to distinguish between the sports equipment and the animal. The contrastive loss or triplet loss objectives explicitly push the model to differentiate between such nuances by comparing positive and negative sentence pairs.
Practical Example and Limitations: Consider the word "light" in "The room is light" (bright) versus "The suitcase is light" (not heavy). The transformer’s attention mechanism focuses on "room" or "suitcase" to influence the embedding of "light." However, since Sentence Transformers produce fixed-length sentence embeddings (not word-level), polysemy resolution depends on how well the context is captured at the sentence level. While effective, this approach isn’t perfect—if context clues are weak (e.g., ambiguous sentences like "She saw the bat"), the model might struggle. But in most cases, the combination of transformer architecture and task-specific training enables robust handling of multiple meanings.
