One might choose the dot product as a similarity metric for embeddings that are not normalized because it inherently incorporates both the direction and magnitude of the vectors. This is useful in applications where the magnitude of the embedding carries meaningful information. For example, in recommendation systems, embeddings might represent user preferences or item characteristics. A larger vector magnitude could indicate stronger user preferences or more prominent item features. The dot product rewards vectors that are both aligned in direction and large in magnitude, which might correlate with higher relevance in such contexts. In contrast, cosine similarity normalizes vectors to unit length, discarding magnitude information and focusing purely on angular similarity. If the application benefits from retaining magnitude (e.g., distinguishing between "weak" and "strong" signals), the dot product is more appropriate.
Mathematically, the dot product and cosine similarity are directly related. The dot product of two vectors A and B is defined as A · B = ||A|| ||B|| cos(θ), where θ is the angle between them. Cosine similarity is derived by dividing the dot product by the product of the vectors’ L2 norms: cos(θ) = (A · B) / (||A|| ||B||). This normalization step bounds cosine similarity between -1 and 1, making it invariant to vector magnitudes. For unit-normalized vectors, the dot product equals cosine similarity. However, when vectors are not normalized, the dot product scales with the magnitudes of both vectors. For instance, if A doubles in magnitude, its dot product with B also doubles, while its cosine similarity with B remains unchanged unless B’s magnitude also changes.
The choice between the two metrics depends on the problem’s requirements. In neural networks, dot products are often used in final layers (e.g., for logits) because they preserve magnitude information learned during training. In contrast, cosine similarity is preferred for tasks like semantic text similarity, where direction (semantic meaning) matters more than magnitude. A practical example is word embeddings: unnormalized embeddings might use dot products to prioritize frequently occurring words (which often have larger magnitudes), while normalized embeddings might use cosine similarity to focus on semantic alignment. The computational simplicity of the dot product (no need to compute norms) can also be a factor in high-throughput scenarios.