The choice of distance metric (cosine vs. L2) is tightly coupled with how the embedding model was trained and the properties of its output vectors. Cosine similarity measures the angle between two vectors, ignoring their magnitudes, while L2 (Euclidean) distance accounts for both direction and magnitude. If the embedding model is optimized for one metric but evaluated with another, retrieval performance can degrade because the geometric relationships the model learned during training won’t align with the metric’s assumptions.
For example, models trained with objectives like triplet loss or contrastive learning often assume a specific distance metric. If a model is designed to minimize cosine distance between similar pairs, its embeddings are typically normalized to unit length during training. In this case, using L2 distance for retrieval would introduce unnecessary sensitivity to vector magnitudes (which are artificially constrained to 1), making the metric less effective. Conversely, a model trained without normalization to optimize L2 distance might encode meaningful information in vector magnitudes (e.g., confidence scores or feature intensity). Using cosine similarity here would discard that magnitude information, leading to mismatched comparisons.
A practical mismatch scenario occurs when using pre-trained models. For instance, Sentence-BERT models are often fine-tuned with cosine similarity objectives and produce normalized embeddings. If a developer mistakenly uses L2 distance with these embeddings, the results might still work but could underperform compared to cosine. Conversely, using cosine on non-normalized embeddings (e.g., from older word2vec models) might require explicit L2 normalization beforehand to avoid magnitude skew. Without this step, raw cosine similarity could overemphasize high-magnitude vectors, even if their directional similarity is low.
To avoid suboptimal results, developers should align the retrieval metric with the model’s training objective and preprocessing steps. Always check the model’s documentation: if it mentions normalization or recommends a specific metric, follow that guidance. When in doubt, test both metrics empirically—for some models, the difference might be negligible, but for others, the mismatch could significantly harm recall or ranking accuracy.