Similarity Metrics for Vector Search
You can’t compare apples and oranges. Or can you? Vector databases like Milvus allow you to compare any data point you can vectorize. You can even do it right in your Jupyter Notebook. But how does a vector database similarity search work?
The ability to calculate the similarity between two vectors is the prerequisite for a large number of vector search applications in mathematics, artificial intelligence, natural language processing, and machine learning. It is an important condition for nearest neighbor search and clustering in high dimensions, as well as many machine learning and data mining algorithms. Also, it can measure different types of data structure preservation in dimensionality reduction. A suitable similarity metric directly impacts the performance of vector search and clustering in high dimensions. However, a number of domain-specific search and clustering methods utilize specific similarity measures and are not capable of handling other similarity measures that are equally important in different domains.
Vector search has two critical conceptual components: indexes and distance metrics. Some popular vector indexes include HNSW, IVF, and ScaNN. There are three primary distance metrics: L2 or Euclidean distance, cosine similarity measure, and inner product. Other metrics for binary vectors include the Hamming Distance and the Jaccard Index.
In this article, we’ll cover:
Vector Similarity Metrics
L2 or Euclidean
How Does L2 Distance Work?
When Should You Use Euclidean Distance?
Cosine Similarity
How Does Cosine Similarity Work?
When Should You Use Cosine Similarity?
Inner Product
How Does Inner Product Work?
When Should You Use Inner Product?
Other Interesting Vector Similarity or Distance Metrics
Hamming Distance
Jaccard Index
Summary of Vector Similarity Search Metrics
Vector Similarity Metrics
As we have vectorized our data, we need to be able to compare two vectors (i.e. two data points in terms of a query vector and coordinates) to see if they are similar, but what should we mean by "similar"? We can define vector search accordingly as the search for vectors that are close to a query vector. Vectors that are close in distance are similar in features. Vectors can be represented as lists of numbers or as an orientation and a magnitude. For the easiest way to understand this, you can imagine vectors as line segments pointing in specific directions in space.
The L2 or Euclidean metric is the “hypotenuse” metric of two vectors. It measures the magnitude of the distance between where the lines of your vectors end.
The cosine similarity is the angle between your lines where they meet.
The inner product is the “projection” of one vector onto the other. Intuitively, it measures both the distance and angle between the vectors.
So, how do you choose which metric to use? The choice of similarity measure depends on the type and dimension of data being clustered and the specific problem being solved. For example, working with continuous data such as gene expression data, the Euclidean distance or cosine similarity may be appropriate. You can find information on what metric was used for the models on sites like Hugging Face.
L2 or Euclidean
The most common and intuitive distance metric is L2 or Euclidean distance. We can imagine this as the amount of space between two data objects. For example, how far your screen is from your face.
How Does L2 or Euclidean Distance Work?
So, we’ve imagined how L2 distance works in space; how does it work in math? Let’s begin by imagining both vectors as a list of numbers. Line the lists up on top of each other and subtract downwards. Then, square all of the results and add them up. Finally, take a square root.
Milvus skips the square root because the square-rooted and un-square-rooted rank order is the same. This way, we can skip an operation and get the same result, lowering latency and cost and increasing throughput. Below is an example of how Euclidean or L2 distance works.
d(Queen, King) = √(0.3-0.5)2 + (0.9-0.7)2
= √(0.2)2 + (0.2)2
= √0.04 + 0.04
= √0.08 ≅ 0.28
When Should You Use L2 or Euclidean Distance?
The use of Euclidean distance for vectors with different magnitudes is a nuanced topic in data analysis and machine learning. While Euclidean distance is sensitive to both direction and magnitude of vectors, making it suitable for scenarios where absolute positions in vector space are meaningful, this sensitivity can be both an advantage and a disadvantage depending on the application. In natural language processing, where semantic meaning is often represented by vector direction rather than magnitude, alternatives like cosine similarity are frequently preferred. Euclidean distance can be used to measure semantic distance between word vectors, but it's not always the optimal choice. The decision to use Euclidean distance should be based on whether magnitude carries significant information in your dataset. If you do use it with word vectors, normalizing the vectors first is a common practice to mitigate the impact of magnitude differences. Ultimately, the choice of similarity measure depends on your data's characteristics and analysis goals, often requiring experimentation to determine the most effective approach.
Cosine Similarity
We use the term “cosine similarity” or “cosine distance” to denote the difference between the orientation of two vectors. For example, how far would you turn to face two opposite directions from the front door? This cosine similarity formula or measure is a good place to start when considering similarity between two vectors.
Fun and applicable fact: despite the fact that “similarity” and “distance” have different meanings alone, adding cosine before both terms makes them mean almost the same thing! This is another example of semantic similarity at play.
How Does Cosine Similarity Work?
So, we know that cosine similarity measures the angle between two vectors. Once again, we imagine our vectors as a list of numbers. The process to compare vectors is a bit more complex this time, though.
We begin by lining the vectors on top of each other again. Start by multiplying the numbers down and then adding all of the results up. Now save that number; call it “x”. Next, we must square each number and add the numbers in the same direction for each vector. Imagine squaring each number horizontally and adding them together for both vectors.
Take the square root of both sums, then multiply them, and call dot product of this result “y.” We find the value of our cosine distance as “x” divided by “y.”
When Should You Use Cosine Similarity?
Cosine similarity is primarily used to measure similarity in NLP applications. The main thing that cosine similarity measures is the difference in semantic orientation. If you work with normalized vectors, cosine similarity is equivalent to the inner product. An important property of cosine similarity is that it's invariant to scaling, meaning it only considers the angle between vectors and not their magnitudes, which is particularly useful when comparing documents or word embeddings of different lengths.
Inner Product
The inner product is the projection of one vector onto the other. The inner product's value is the vector's length drawn out. The bigger the angle between the two vectors, the smaller the inner product. It also scales with the length of the smaller vector. So, we use the inner product when we care about orientation and distance. For example, you would have to run a straight distance through the walls to your refrigerator.
How Does Inner Product Work?
The inner product should look familiar. It’s just the first ⅓ of the cosine calculation. Line those vectors up in your mind and go down the row, multiplying downward. Then, sum them up. This measures the straight line distance between you and the nearest dim sum.
When Should You Use Inner Product?
The inner product is like a cross between Euclidean distance and cosine similarity. When it comes to normalized datasets, it is the same as cosine similarity, so IP is suited for either large datasets or normalized or non-normalized datasets. It is a faster option than cosine similarity, and it is a more flexible option.
One thing to keep in mind with Inner Product is that it doesn’t follow the triangle inequality. Larger lengths (large magnitudes) are prioritized. This means we should be careful when using IP with Inverted File Index or a graph index like HNSW.
Other Interesting Vector Distance or Similarity Metrics
The three vector metrics mentioned above are the most useful regarding measuring similarity between vector embeddings. However, they’re not the only ways to measure the distance or similarity between two two vectors. Here are two other ways to measure distance or similarity between vectors.
Hamming Distance
Group 13401.png
Hamming distance can be applied to vectors or strings. For our use cases, let’s stick to vectors. Hamming distance measures the “difference” in the entries of two vectors. For example, “1011” and “0111” have a Hamming distance of 2.
In terms of vector embeddings, Hamming distance only really makes sense to measure for binary vectors. Float vector embeddings, the outputs of the second to last layer of neural networks, are made up of floating point numbers between 0 and 1. Examples could include [0.24, 0.111, 0.21, 0.51235] and [0.33, 0.664, 0.125152, 0.1].
As you can see, the Hamming distance between two vector embeddings will almost always come out to just the length of the vector itself. There are just too many vectors divided too many possibilities for each value. That’s why Hamming distance can only be applied to binary or sparse vectors. The type of vectors that are produced from a process like TF-IDF, BM25, or SPLADE.
Hamming distance is good to measure something like the difference in wording between two texts, the difference in the spelling of words, or the difference between any two binary vectors. But it’s not good for measuring the difference between vector embeddings.
Here’s a fun fact. Hamming distance is equivalent to summing the result of an XOR operation on two vectors.
Jaccard Distance (also known as Jaccard similarity coefficient)
Jaccard similarity or distance is another way to measure two vectors' similarity or distance. The interesting thing about Jaccard is that there is both a Jaccard Index and a Jaccard Distance. Jaccard distance is 1 minus the Jaccard index, the same similarity function distance metric Milvus implements.
Calculating Jaccard distance or index is an interesting task because it doesn’t exactly make sense at first glance. Like Hamming distance, Jaccard only works on binary data. I find the traditional formation of “unions” and “intersections” confusing. The way I think about it is with logic. It’s essentially A “OR” B minus A “AND” B divided by A “OR” B.
As shown in the image above, we count the number of entries where either A or B is 1 as the “union” and where both A and B are 1 as the “intersection.” So the Jaccard index for A (01100111) and B (01010110) is ½. In this case, the Jaccard distance, 1 minus the Jaccard index, is also ½.
Summary of Vector Similarity Search Metrics
In this post, we learned about the three most useful vector similarity search metrics: L2 (also known as Euclidean) distance, cosine distance, and inner and dot product similarity. Each of these has different use cases. Euclidean is for when we care about the difference in magnitude. Cosine is for when we care about the difference in orientation. The inner product is when we care about the difference in magnitude and orientation.
Check these videos to learn more about Vector Similarity Metrics, or read the docs to learn how to configure these metrics in Milvus.
Vector Similarity Metrics: Cosine Similarity
Vector Similarity Metrics: Inner Product
Vector Similarity Metrics: L2 or Euclidean
- Vector Similarity Metrics
- L2 or Euclidean
- Cosine Similarity
- Inner Product
- Other Interesting Vector Distance or Similarity Metrics
- Summary of Vector Similarity Search Metrics
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free