To determine if a RAG system’s answer is hallucinated or grounded, human judges can focus on three core criteria: factual consistency with source documents, traceability of claims, and handling of ambiguity or gaps in knowledge. Each criterion requires systematic comparison between the generated answer and the retrieved sources, alongside an evaluation of the answer’s coherence and specificity.
First, factual consistency involves verifying whether the answer aligns with the information in the retrieved documents. Judges should cross-check specific claims (e.g., dates, statistics, events) in the response against the source material. For example, if the RAG system states, "The company reported $5M revenue in 2023," but the source document says "$4.8M in Q1 2023," this discrepancy suggests a hallucination. Judges might also assess whether the answer introduces unsupported details. For instance, if the answer claims a "30% increase in user engagement" without mentioning this figure in the sources, it’s likely fabricated. To streamline this, judges could use a binary or scaled score for each claim (e.g., "fully supported," "partially supported," or "unsupported").
Second, traceability evaluates whether the system provides clear references to specific sections of the source material. Judges should check if citations (e.g., document IDs, page numbers) accurately map to the claims they support. For example, if an answer cites "Document A, Section 2" but that section discusses a unrelated topic, the citation is invalid. Judges might also flag answers that lack citations altogether for critical claims. Additionally, granularity matters: vague references like "according to our data" are less trustworthy than precise pointers. Tools like highlighting text in source documents or using annotation overlays can help judges quickly validate traceability.
Third, handling of ambiguity or knowledge gaps examines whether the system acknowledges uncertainty or avoids inventing information when sources are incomplete or conflicting. For example, if sources provide conflicting dates for an event, a grounded answer might say, "Sources conflict, with dates ranging from 2020 to 2022," whereas a hallucinated answer might arbitrarily pick one date without justification. Judges should also assess if the system oversteps by answering questions outside the scope of the retrieved documents. For instance, if asked about a niche topic not covered in the sources, a grounded response would admit uncertainty, while a hallucinated one might fabricate an answer. Judges could score this by checking for hedging phrases (e.g., "likely," "according to some sources") or explicit disclaimers.
By combining these criteria, judges can systematically identify hallucinations. This approach balances rigor with practicality, ensuring evaluations are both thorough and scalable for developers iterating on RAG systems.