What is mean average precision (mAP) in similarity search, and how is it used to evaluate vector database retrieval results?
Mean Average Precision (mAP) is a metric used to assess the quality of ranked retrieval results in tasks like similarity search. It quantifies how well a system retrieves relevant items and ranks them higher than non-relevant ones. For a single query, Average Precision (AP) is calculated by averaging precision values at each position where a relevant item appears in the ranked list. mAP extends this by averaging AP across all queries, providing an aggregate measure of performance.
In similarity search, when a query is made to a vector database (e.g., finding images similar to an input), the system returns a ranked list of nearest neighbors. AP for one query is computed by first identifying positions of relevant items in this list. For example, if a query has three relevant items returned at positions 1, 3, and 5, precision is calculated at each of these positions (1/1, 2/3, 3/5). AP averages these values: (1 + 2/3 + 3/5)/3 ≈ 0.76. mAP is the mean of AP values across all queries, ensuring robustness even when the number of relevant items per query varies.
How does mAP apply to vector databases? mAP evaluates both recall (retrieving relevant items) and precision (ranking them highly). For instance, in a product image database, a query for a "black backpack" might return relevant items at positions 1, 4, and 7. If another query for "blue sneakers" has relevant items at positions 2, 5, and 9, mAP combines the AP scores for both queries into a single performance metric. This helps developers compare algorithms (e.g., HNSW vs. IVF indexes) or adjust embedding models—higher mAP indicates better ranking of relevant results.
Practical considerations and limitations mAP requires ground truth relevance labels for each query, which can be labor-intensive to create. It also assumes binary relevance (items are either relevant or not), which might not capture graded relevance. For example, in a music recommendation system, some songs might be more relevant than others, but mAP treats all relevant items equally. Additionally, if a query has no relevant items in the database, its AP is typically excluded from the mAP calculation to avoid skewing results. Despite these limitations, mAP remains a standard metric for benchmarking vector databases because it directly reflects user expectations: relevant results should appear early in the ranked list.