Answer Correctness in RAG In Retrieval-Augmented Generation (RAG), "answer correctness" refers to whether the generated answer accurately addresses the query and aligns with the facts or context provided by the retrieved documents. Unlike generic text similarity, which measures surface-level overlap between texts (e.g., shared keywords or phrases), correctness focuses on factual consistency, logical coherence, and relevance to the specific question. For example, a RAG system might retrieve a document stating "Paris is France's capital" and generate "The capital of France is Paris." Text similarity metrics like cosine similarity might score this highly, but correctness would also require validating that the answer isn’t contradicted by other sources and doesn’t introduce unsupported claims (e.g., adding "and its population is 10 million" if the source doesn’t mention this).
How Correctness Differs from Text Similarity Text similarity metrics (e.g., BLEU, ROUGE) compare generated text to a reference answer, prioritizing lexical or structural alignment. In contrast, correctness in RAG requires verifying that the answer is faithful to the retrieved context. For instance, if a user asks, "What causes earthquakes?" and the system retrieves a document explaining tectonic plate movement, a correct answer must reflect that explanation without inventing unscientific details. A similarity-based metric might reward a fluent but incorrect answer like "Earthquakes are caused by weather changes" if it shares words with the query, but correctness would penalize it for factual inaccuracy. Tools like entailment verification (e.g., using models like BERT to check if the answer logically follows from the source) or fact-checking pipelines (comparing claims in the answer against the retrieved data) are better suited for measuring correctness.
Practical Measurement Approaches To measure correctness in RAG, developers can use:
- Human evaluation: Experts assess if answers are factually consistent with the retrieved context.
- Automated claim extraction: Tools like FactScore decompose answers into individual claims and verify each against the source documents.
- Contradiction detection: Models like DeBERTa or NLI (Natural Language Inference) classifiers flag answers that conflict with retrieved evidence.
- Retrieval-grounded metrics: Frameworks like RAGAS calculate correctness by comparing the answer’s claims to the retrieved passages, ignoring generic similarity. For example, if the source states "The Eiffel Tower was completed in 1889," an answer saying "Built in 1889" would score high on correctness, while "Built in 1890" would fail, even if the rest of the text is similar.
By prioritizing factual alignment over surface-level overlap, these methods ensure RAG systems produce trustworthy, contextually grounded answers.