Approximate Nearest Neighbor (ANN) benchmark datasets and evaluations often focus on Euclidean distance as a default, but many frameworks and studies explicitly evaluate algorithms under multiple distance metrics. This approach ensures that algorithms are tested for versatility and robustness across different use cases. Here’s how benchmarks typically handle distance metrics:
1. Default Assumption and Common Metrics Most ANN benchmarks start with Euclidean distance (L2) as a baseline because it is widely used in applications like image retrieval, clustering, and regression. For example, datasets like MNIST, SIFT-1M, or Deep Image are often evaluated using L2. However, benchmarks like ANN-Benchmarks or those from research papers (e.g., Aumüller et al., 2020) also include metrics like cosine similarity and inner product (dot product), which are critical for text embeddings (e.g., GloVe) or recommendation systems. Some benchmarks even test Manhattan (L1) or Hamming distance for binary data. These evaluations ensure that algorithms are not overly optimized for a single metric, which could limit real-world applicability.
2. Dataset-Specific Metric Pairings Certain datasets are inherently tied to specific metrics. For instance, GloVe word vectors are often paired with cosine similarity to measure semantic similarity, while datasets like NYTimes (news articles) might use inner product. Benchmarks like the Big-ANN Challenge or LAION-AI’s evaluations explicitly pair datasets with their natural metrics. Algorithms like FAISS or HNSW are then tested across these pairings to measure consistency. This approach highlights how algorithm performance can vary: tree-based methods (e.g., KD-trees) degrade with high-dimensional cosine similarity, while graph-based methods (eNSG, HNSW) adapt better due to their flexibility in distance computation during graph traversal.
3. Evaluation Frameworks and Customization Tools like ANN-Benchmarks allow users to configure evaluations for multiple metrics, reporting metrics like recall, query time, and index build time for each. Researchers often test algorithms across 3–5 metrics to identify trade-offs. For example, a quantization-based method like IVF-PQ might perform well under L2 but struggle with cosine due to normalization requirements. Open-source frameworks also let users add custom datasets and metrics, encouraging community-driven benchmarking. However, not all studies prioritize multi-metric evaluations—some focus solely on L2 for simplicity, risking biased conclusions about an algorithm’s general effectiveness.
In summary, while Euclidean distance remains a common default, modern ANN benchmarks increasingly stress-test algorithms under diverse metrics. This reflects real-world needs, where data type (text, images, binary) dictates the optimal metric. Developers should verify evaluation setups in benchmarks to ensure alignment with their application’s requirements.