Lexical search quality and relevance are typically evaluated using metrics that measure how well retrieved documents match user expectations or labeled ground truth data. The most common metrics include Precision, Recall, and F1-score. Precision measures the proportion of retrieved documents that are actually relevant, while Recall measures how many relevant documents were retrieved out of all possible relevant ones. The F1-score balances these two, giving a single performance measure. These metrics are essential when tuning ranking parameters like BM25’s k1 and b to ensure optimal trade-offs between accuracy and completeness.
For ranked search systems, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) are more advanced metrics. MAP averages precision across all relevant documents at various cutoffs, emphasizing consistent ranking performance. NDCG, on the other hand, assigns higher importance to relevant documents that appear earlier in the results list, mirroring user behavior where top-ranked results matter most. Both metrics help developers fine-tune ranking algorithms and evaluate improvements like query expansion, stemming, or hybrid scoring methods.
When Lexical search is part of a hybrid retrieval system using Milvus, additional metrics like Recall@K and MRR (Mean Reciprocal Rank) become useful for assessing joint performance. For example, Recall@10 measures how often the correct document appears in the top 10 results after combining Lexical and vector signals. This helps developers balance the contribution of BM25 scores and Milvus’s semantic similarity. By monitoring these metrics together, teams can systematically evaluate whether Lexical precision, semantic recall, or hybrid scoring provides the best real-world relevance for users.
