To create a test set for a RAG (Retrieval-Augmented Generation) system, start by leveraging existing question-answering (QA) datasets and modifying them to include explicit context references. For example, datasets like SQuAD, Natural Questions, or HotpotQA already provide questions paired with answers and source passages. Extract the relevant context documents from these sources, ensuring each question is mapped to the specific text snippets or documents that contain the answer. This establishes a direct link between the question, the ground-truth answer, and the context a RAG system should retrieve. If the original dataset lacks granular context references (e.g., only providing Wikipedia article titles), augment it by splitting source documents into smaller chunks (e.g., paragraphs or sections) and associating each answer with the exact chunk(s) needed to answer it.
Next, simulate real-world retrieval scenarios by introducing noise or irrelevant context. For instance, take a question from SQuAD and pair it with its correct context passage along with 4-5 distractor passages from unrelated articles. This tests the RAG system’s ability to prioritize relevant context. To ensure diversity, include questions requiring multi-hop reasoning (e.g., HotpotQA’s linked questions) and ensure the context documents span multiple sources. For ground-truth answers, use the original dataset’s answers but verify they’re directly derivable from the provided context. If answers depend on external knowledge not in the context, either revise the answer or update the context to include missing information.
Finally, split the data rigorously to avoid leakage. Use existing train/test splits from the original dataset, but ensure no overlap in context documents between splits. For evaluation, define metrics like retrieval accuracy (e.g., whether the correct context is in the top-k retrieved documents) and answer correctness (e.g., exact match or F1 score against the ground truth). Tools like BEIR or LlamaIndex’s evaluation modules can automate parts of this process. For example, using the WikiQA dataset, you could map each question to its Wikipedia paragraph and add distractors from other articles, then measure if the RAG system retrieves the correct paragraph and generates the expected answer. This approach balances reuse of existing data with controlled modifications to stress-test the system’s retrieval and generation capabilities.
