Here are key metrics and tools used to evaluate how well generated answers align with source documents:
1. Faithfulness/Consistency Metrics (RAGAS, TruLens): Tools like RAGAS measure faithfulness by using LLMs to check if claims in an answer are fully supported by the provided context. For example, RAGAS breaks the answer into atomic statements, then prompts an LLM to verify each against the source documents. Similarly, TruLens’ groundedness metric uses text similarity (e.g., cosine similarity between embeddings) to compare the answer to the source. These methods focus on detecting hallucinations—claims unsupported by the documents. For instance, if a document states "Company X revenue grew 5% in 2023," an answer claiming "10% growth" would receive a low faithfulness score. These metrics are often automated but rely on the LLM’s ability to parse context accurately.
2. Answer Entailment and Fact Verification: Metrics like answer entailment use NLP models to check logical consistency between the answer and source. For example, a BERT-based model might compute whether the source text entails (supports) the answer. Precision (fraction of answer claims supported by documents) and recall (fraction of document facts reflected in the answer) are also used. Work like Haluptzky et al.’s hallucination detection evaluates answers by training classifiers to identify unsupported claims. These methods are more granular but require labeled data or predefined rules. Tools like FactScore and benchmarks like FEVER (Fact Extraction and Verification) apply similar principles for fact-checking.
3. Hybrid and Custom Approaches: Libraries like LangChain’s evaluation module or custom pipelines combine metrics. For example:
- BERTScore: Compares semantic overlap between answer and source using contextual embeddings.
- Rule-based checks: Flag numerical mismatches or unsupported entities via NER (Named Entity Recognition).
- Human evaluations: Still a gold standard for nuanced cases, though costly.
Developers should consider trade-offs: Automated metrics are scalable but may miss context-specific nuances. Combining methods (e.g., RAGAS for LLM-based checks + BERTScore for semantic similarity) often yields more reliable results. Tools like Arize Phoenix or Galileo provide visualization dashboards to analyze these metrics in practice.