nDCG (Normalized Discounted Cumulative Gain) nDCG measures the quality of ranking in search results by accounting for the position and relevance of items. It assigns higher scores when relevant items appear earlier in the ranked list and supports graded relevance (e.g., items labeled as “highly relevant” vs. “somewhat relevant”). The “discounted” aspect reduces the weight of relevant items appearing later, reflecting real-world user behavior where top results matter most. For example, in an e-commerce search, nDCG would penalize a system that places a highly relevant product on page 2 compared to one that surfaces it immediately. Normalization ensures scores are comparable across queries by dividing by the ideal DCG (the best possible ranking). This metric is particularly useful when the order of results impacts user satisfaction, such as recommendation systems or document retrieval.
MRR (Mean Reciprocal Rank) MRR evaluates how quickly a system retrieves the first relevant result. It calculates the reciprocal of the rank of the first correct answer (e.g., 1/rank) and averages this across all queries. For instance, if the first relevant result appears in position 3 for a query, the reciprocal rank is 1/3. MRR emphasizes scenarios where users expect a single correct answer quickly, like voice assistants or FAQ retrieval. It doesn’t penalize systems for additional relevant results beyond the first, making it less suitable for tasks requiring diverse results. However, it’s valuable for applications where speed to the first correct answer is critical, such as technical support chatbots or fact-based question answering.
F1-Score F1-score balances precision (fraction of relevant results) and recall (fraction of all relevant items retrieved) into a single metric. It’s the harmonic mean of the two, making it useful when there’s an uneven class distribution (e.g., few relevant items in a large dataset). For example, in a legal document search, F1 would highlight systems that retrieve most relevant cases (high recall) without overwhelming users with irrelevant ones (high precision). However, F1 ignores ranking order, treating all retrieved items equally. It’s best suited for binary relevance tasks (relevant/not relevant) where the focus is on balancing false positives and negatives, such as spam detection or medical record retrieval. Unlike nDCG or MRR, it doesn’t reflect user experience with ranked lists but provides a clear tradeoff between two foundational metrics.