To compute cosine similarity between two sentence embeddings using Python, you typically use libraries like NumPy or scikit-learn. Here's a straightforward example:
Using NumPy
Cosine similarity measures the cosine of the angle between two vectors. The formula is the dot product of the vectors divided by the product of their magnitudes. For two embeddings emb1
and emb2
(1D arrays), the code is:
import numpy as np
def cosine_similarity(emb1, emb2):
dot_product = np.dot(emb1, emb2)
norm_emb1 = np.linalg.norm(emb1)
norm_emb2 = np.linalg.norm(emb2)
return dot_product / (norm_emb1 * norm_emb2)
This works for single vectors. If your embeddings are 2D (e.g., from a batch process), flatten them first: emb1.flatten()
.
Using scikit-learn
For batch operations or pairwise comparisons, sklearn.metrics.pairwise.cosine_similarity
is efficient. For two embeddings stored as 2D arrays (shape (1, embedding_size)
):
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(emb1.reshape(1, -1), emb2.reshape(1, -1))[0][0]
This handles reshaping to ensure compatibility and returns a scalar value.
Key Considerations
- Normalization: If embeddings are pre-normalized (common in libraries like Sentence Transformers), cosine similarity simplifies to
np.dot(emb1, emb2.T)
. - Batch Processing: For multiple embeddings, pass a matrix to
cosine_similarity
to get pairwise results. - Performance: NumPy is sufficient for small-scale tasks; scikit-learn optimizes batch operations.
Example with actual values:
emb1 = np.array([0.2, 0.5, 0.8])
emb2 = np.array([0.3, 0.4, 0.7])
print(cosine_similarity(emb1, emb2)) # Output: ~0.992
This approach works with any embedding format (PyTorch, TensorFlow) by converting them to NumPy arrays first.