What are custom retrieval-based metrics for RAG evaluation? Custom retrieval-based metrics for RAG (Retrieval-Augmented Generation) systems focus on quantifying how well generated answers align with retrieved source documents. These metrics ensure answers are grounded in the provided context and avoid hallucinations. Below are three practical approaches:
1. Sentence Coverage via Embedding Similarity This metric checks if each sentence in the generated answer is supported by the retrieved sources. Instead of relying on exact string matches (which are rare), it uses sentence embeddings (e.g., from models like Sentence-BERT) to compute semantic similarity. For each answer sentence, the maximum similarity score against all source sentences is calculated. A threshold (e.g., 0.8 cosine similarity) determines if the sentence is "covered." The final score is the percentage of answer sentences above the threshold. For example, if 4 out of 5 answer sentences match sources, the score is 80%. This handles paraphrasing but requires embedding computation, which can be optimized using approximate nearest-neighbor libraries like FAISS.
2. Fact/Entity Consistency This extracts key facts (e.g., entities, dates, relationships) from the generated answer and verifies their presence in the retrieved documents. Tools like spaCy or Stanza can identify entities, while relation extraction models (e.g., OpenNRE) map relationships. For example, if the answer states "Einstein developed relativity in 1905," the metric checks if "relativity" and "1905" appear in the sources with a valid connection. The score is the ratio of verified facts to total facts. This works well for factual accuracy but may miss abstract claims or require domain-specific extraction rules.
3. Token Overlap with Source Weighting This measures lexical overlap between the answer and sources using metrics like ROUGE or TF-IDF-weighted token matches. For example, ROUGE-L computes the longest common subsequence between the answer and sources, emphasizing fluency and content overlap. Alternatively, TF-IDF weights rare tokens (e.g., technical terms) higher than common ones, ensuring critical terms from the answer appear in the sources. A hybrid approach could combine both: ROUGE for structure and TF-IDF for specificity. While simpler than embedding methods, it struggles with paraphrased or reordered content.
Implementation Considerations
- Thresholds: Embedding similarity and fact verification require tuning thresholds (e.g., what constitutes a "match").
- Efficiency: Embedding-based metrics can be slow for large source sets; chunking or pre-indexing sources helps.
- Trade-offs: Token overlap is fast but less nuanced, while entailment models (e.g., using DeBERTa for NLI) offer deeper validation at higher computational cost.
These metrics can be combined (e.g., 70% weight on fact consistency, 30% on sentence coverage) to create a composite score tailored to specific use cases.