The RAG (Retrieval-Augmented Generation) triad of metrics—answer relevance, support relevance, and correctness—evaluates different aspects of a system’s ability to retrieve accurate information and generate coherent responses. These metrics work together to assess the quality of both the retrieval and generation stages, ensuring the system meets user expectations for accuracy, reliability, and contextual alignment.
Answer relevance measures how directly the generated response addresses the user’s query. For example, if a user asks, “What causes auroras?” a response explaining solar wind interacting with Earth’s magnetic field is relevant, while a tangent about general space weather phenomena is not. This metric ensures the system stays on-topic and avoids unnecessary or unrelated information. It focuses on the output’s alignment with the intent of the question, acting as a first-layer check for usefulness.
Support relevance evaluates whether the documents or data retrieved by the system are logically connected to the generated answer. Suppose the system answers the aurora question correctly but cites a document about volcanic activity instead of solar physics. Here, support relevance would flag the mismatch, even if the answer itself is correct. This metric ensures the system’s retrieval component reliably surfaces contextually appropriate evidence, which is critical for traceability and trust. Without it, answers may appear correct by coincidence rather than being grounded in valid sources.
Correctness verifies factual accuracy. For instance, if the system states, “Auroras are caused by lunar radiation,” this is factually incorrect, regardless of relevance. Correctness is often validated against ground-truth data or expert-reviewed answers. It acts as the final gatekeeper, ensuring the system doesn’t propagate misinformation. Together, these three metrics address distinct failure points: answer relevance prevents off-topic responses, support relevance ensures proper sourcing, and correctness guarantees factual integrity. By measuring all three, developers can identify whether issues stem from retrieval (poor support relevance), generation (low answer relevance), or factual gaps (incorrectness), enabling targeted improvements to the RAG pipeline.