To evaluate retrieval performance in Retrieval-Augmented Generation (RAG) systems, several standardized benchmarks and datasets are widely used. These typically focus on open-domain question answering (QA) tasks, where the system retrieves relevant documents or passages from a large corpus and generates accurate answers. Common benchmarks include Natural Questions (NQ), WebQuestions (WebQ), TriviaQA, and HotpotQA, each designed to test different aspects of retrieval and generation.
Natural Questions (NQ) is a prominent benchmark derived from real Google search queries paired with answers from Wikipedia. It tests a system’s ability to retrieve precise passages and generate answers that match human-annotated short or long responses. NQ emphasizes real-world complexity, as questions are diverse and answers often require synthesizing multiple pieces of information. For example, a query like “When did the Titanic sink?” requires retrieving the correct date and context from Wikipedia. WebQuestions (WebQ) is a smaller dataset focused on factual answers from Freebase, a knowledge graph. It tests systems on answering factoid questions (e.g., “Who founded Microsoft?”), with a focus on direct retrieval from structured and unstructured sources. Both NQ and WebQ measure retrieval accuracy through metrics like exact answer match and F1 score.
TriviaQA is another key dataset with trivia-style questions, designed to evaluate systems on complex, multi-step reasoning. For instance, a question like “Which river flows through Paris?” requires retrieving information about both Paris and its geography. Unlike NQ, TriviaQA includes both Wikipedia and web-sourced documents, testing retrieval across diverse sources. HotpotQA extends this by requiring multi-hop reasoning, where answering a question (e.g., “Which actor starred in the first movie directed by Christopher Nolan?”) demands retrieving and connecting information from multiple documents. HotpotQA evaluates both retrieval (via supporting evidence identification) and answer accuracy. These datasets often use recall@k (e.g., whether the correct document is in the top-k retrieved results) and answer-based metrics like EM (Exact Match) to quantify performance.
Beyond QA-specific datasets, MS MARCO (Microsoft Machine Reading Comprehension) is a large-scale benchmark for passage retrieval and answer generation. It includes real user queries from Bing, requiring systems to rank passages by relevance. For example, a query like “How to fix a leaky faucet” would need retrieving step-by-step guides from a web corpus. KILT (Knowledge-Intensive Language Tasks) is a unified benchmark covering multiple tasks, including fact-checking and entity linking, using Wikipedia as the knowledge source. These datasets test retrieval robustness in diverse scenarios, such as handling ambiguous queries or out-of-domain topics. Tools like BEIR (Benchmarking IR Systems) provide a standardized framework for evaluating retrieval models across 18 datasets, including NQ and HotpotQA, to assess generalization. Metrics like MRR (Mean Reciprocal Rank) and NDCG (Normalized Discounted Cumulative Gain) are often used to rank retrieval quality. Together, these benchmarks enable developers to rigorously test RAG systems’ ability to retrieve and utilize relevant information effectively.