1. Choosing the Wrong Model or Ignoring Domain Specificity
A common mistake is using a generic pre-trained Sentence Transformer model without considering the domain of the data. For example, a model like all-MiniLM-L6-v2
works well for general text but might fail in specialized domains like legal or medical texts, where terminology and phrasing differ significantly. If your task involves technical jargon, a domain-specific model (e.g., one trained on scientific papers) will capture nuances better. Similarly, using a multilingual model for monolingual tasks can introduce noise, as these models balance language coverage over performance in one language. Always verify if your use case aligns with the model’s training data and objectives.
2. Poor Input Preprocessing and Normalization
Sentence Transformers are sensitive to input formatting. Failing to clean text (e.g., leaving HTML tags, special characters, or inconsistent casing) can degrade embedding quality. For instance, a model might treat "COVID-19" and "covid 19" as unrelated terms if not normalized. Another oversight is truncating long texts without considering context loss. Models like all-mpnet-base-v2
have a 384-token limit; truncating a 500-word document could remove critical information. Additionally, not normalizing embeddings to unit vectors before computing cosine similarity can skew results, as cosine similarity assumes vectors are scaled to unit length.
3. Misalignment Between Task and Similarity Metric
Using an inappropriate similarity metric or misinterpreting scores is another pitfall. For example, relying solely on cosine similarity for short texts (e.g., product titles) might not work if the embeddings lack semantic density. Conversely, using Euclidean distance for long documents might overemphasize irrelevant features. Another issue arises when the model’s training objective doesn’t match the task. A model trained on paraphrase detection (e.g., paraphrase-MiniLM-L3-v2
) might struggle with topic-based similarity, as it’s optimized to identify rephrased sentences rather than broader thematic connections. Always validate embeddings with task-specific evaluation (e.g., clustering metrics or downstream classifiers) rather than assuming generic similarity scores suffice.