To measure the accuracy of the retrieval component in a RAG system, precision@K and recall@K are commonly used metrics. These metrics evaluate how well the retriever fetches relevant documents from a corpus. Here’s how they work:
Precision@K measures the proportion of relevant documents in the top K retrieved results. For example, if a query retrieves 10 documents (K=10) and 7 are relevant, precision@10 is 70%. This metric emphasizes the retriever’s ability to avoid irrelevant results. However, it doesn’t penalize the system for missing relevant documents outside the top K. Recall@K, on the other hand, measures how many of the total relevant documents for a query are captured in the top K. If there are 15 relevant documents in the corpus and the top 10 retrieved results include 5 of them, recall@10 is ~33%. Recall highlights coverage but doesn’t account for irrelevant results in the retrieved set. Both metrics require a ground-truth set of relevant documents for each query to compute.
Other metrics like Mean Reciprocal Rank (MRR) or Normalized Discounted Cumulative Gain (NDCG) can complement these. For example, MRR focuses on the rank of the first relevant document, which matters in scenarios where users care most about the top result. NDCG accounts for the graded relevance of documents (e.g., highly relevant vs. marginally relevant) and their positions in the ranked list. These metrics are useful when document ranking matters, such as in systems where higher-ranked results are more likely to influence the generator’s output.
To implement these metrics, you need a labeled dataset where each query has known relevant documents. For example, you could use benchmarks like MS MARCO or create a custom evaluation set. After running the retriever on the queries, compare the top K results against the ground truth to calculate precision, recall, or other scores. Tools like Python’s sklearn.metrics
can compute these metrics programmatically. However, limitations exist: precision@K and recall@K assume binary relevance (documents are either relevant or not), which may oversimplify real-world scenarios where relevance is nuanced. Additionally, these metrics don’t directly measure how the retrieved documents affect the generator’s output quality, so they should be paired with end-to-end evaluation of the full RAG pipeline.