A Sentence Transformer model might produce unexpectedly low similarity scores for semantically similar sentences due to three primary factors: inadequate training data, preprocessing issues, or suboptimal model configuration.
First, the model's training data and objective significantly impact its performance. If the model wasn't fine-tuned on domain-specific data, it might struggle with specialized terminology or phrasing. For example, a general-purpose model might fail to recognize that "myocardial infarction" and "heart attack" are medically equivalent. Similarly, if the training used a contrastive loss function with poorly chosen negative examples, the model might not learn nuanced semantic relationships effectively. A model trained primarily on short news headlines might also perform poorly on technical documentation or conversational language.
Second, preprocessing inconsistencies can distort embeddings. Tokenization mismatches—such as differing treatments of hyphens ("state-of-the-art" vs. "state of the art"), capitalization ("US" vs. "us"), or numerical formats ("1,000" vs. "1000")—create divergent token sequences. Truncation of long sentences might remove critical context, while improper handling of negations ("good" vs. "not good") or multiword expressions ("hot dog") could alter meaning. For example, the model might process "fast food restaurant" and "quick meal establishment" differently if compound terms aren't recognized during tokenization.
Third, configuration choices affect similarity measurements. Using cosine similarity with unnormalized embeddings, selecting inappropriate pooling strategies (mean vs. max pooling), or employing an undersized model architecture (e.g., distilbert instead of all-mpnet-base-v2) can reduce sensitivity. Language mismatches—like applying an English model to non-English text—or evaluating across dissimilar granularities (short phrases vs. paragraphs) also contribute. For instance, comparing "Python list comprehension" (programming) with "reading list comprehension" (education) might yield low scores despite shared vocabulary if the model lacks context awareness.
To troubleshoot, verify the model's training domain matches your use case, inspect tokenization outputs for consistency, and experiment with different similarity metrics or larger models. For domain-specific tasks, consider fine-tuning with labeled sentence pairs to improve performance.
