To evaluate ranked retrieval outputs in a RAG system using nDCG, start by defining graded relevance scores for documents relative to a query. In RAG, document order can influence the generator’s output, so relevance should reflect both the inherent usefulness of a document and its impact on the final answer. For example, a document that directly answers a factual question might receive a higher relevance score than a tangentially related one. These scores can be determined through human annotation, synthetic judgments (e.g., similarity scores between a document and the ideal answer), or proxy metrics like click-through rates in user interactions. The key is to ensure relevance grades align with the generator’s ability to leverage the document’s content.
Next, compute the Discounted Cumulative Gain (DCG) for the retrieved document list. DCG sums the relevance scores of each document, with a logarithmic discount applied based on their rank (e.g., a document at position i contributes relevance / log2(i + 1)). This penalizes relevant documents appearing lower in the list, reflecting the intuition that users (and generators) prioritize top results. To normalize the score, divide the DCG by the Ideal DCG (IDCG), which is the maximum possible DCG if documents were perfectly ordered by relevance. The resulting nDCG (a value between 0 and 1) quantifies how close the retrieval order is to optimal. For RAG, this helps identify whether the retriever’s ranking aligns with the information the generator needs to produce high-quality outputs.
However, nDCG alone may not capture the full impact of document order on the generator. For instance, a document that’s moderately relevant but critical for disambiguation might disproportionately affect the output if placed higher. To address this, pair nDCG with generator-specific metrics like answer correctness (e.g., BLEU, ROUGE) or downstream task performance (e.g., accuracy in QA). For example, if two retrieval lists have identical nDCG but different answer quality scores, it suggests the generator is sensitive to nuances beyond basic relevance rankings. This combination allows developers to tune the retriever’s ranking strategy (e.g., adjusting the discount factor in DCG or reranking based on generator feedback) to optimize both retrieval quality and end-task performance.