The relevance of retrieved documents in a Retrieval-Augmented Generation (RAG) system directly determines the accuracy of the final answer. If the retrieved documents are irrelevant or low-quality, the generator model lacks the necessary context to produce a correct response. For example, if a user asks about quantum computing and the retrieval system returns articles about classical physics, the generator may either fabricate an answer ("hallucinate") or produce a generic, incomplete response. Conversely, high-quality documents provide factual grounding, enabling the model to synthesize accurate, context-aware answers. The retrieval step acts as a filter—only information present in the retrieved documents can be used by the generator, making relevance a prerequisite for correctness.
Several metrics can quantify the impact of document quality on answer accuracy. For retrieval, precision@k (the proportion of top-k retrieved documents that are relevant) measures how well the system filters noise. Recall@k assesses whether critical documents are included in the top results. For generation, exact match (EM) and F1 score compare the generated answer to a ground truth, directly reflecting factual accuracy. Additionally, answer relevance scoring (e.g., using a trained model to assess if the answer addresses the query) can isolate the impact of retrieval from generation errors. For example, if EM drops when precision@k decreases, it demonstrates a causal link between retrieval quality and answer accuracy. Tools like RAGAS (RAG Assessment) combine retrieval and generation metrics to evaluate end-to-end performance.
To isolate the effect of document relevance, controlled experiments can be conducted. For instance, replacing a portion of retrieved documents with irrelevant ones and measuring the decline in answer quality (e.g., using EM or F1) quantifies the sensitivity of the system to retrieval errors. Similarly, analyzing coverage metrics—such as the percentage of key facts from the ground truth present in the retrieved documents—can reveal gaps in retrieval that correlate with answer inaccuracies. These approaches help developers identify whether poor performance stems from retrieval limitations or generator shortcomings, guiding targeted improvements in the RAG pipeline.