To design a metric that penalizes ungrounded content in an answer, the goal is to measure how much of the answer aligns with supporting documents. A precision-like approach can be used, where the metric calculates the proportion of the answer’s content that is directly supported by the provided documents. Here’s a structured approach:
1. Segment the Answer into Verifiable Units First, break the answer into smaller units (e.g., sentences, clauses, or explicit factual claims). For example, the answer “Paris is the capital of France, which has a population of 67 million” contains two claims: (1) Paris is France’s capital, and (2) France’s population is 67 million. Each unit is treated as an individual item to validate. Tools like NLP libraries (e.g., spaCy) can split sentences and extract entities or claims. Alternatively, semantic parsing can identify distinct propositions. The granularity depends on the desired balance between rigor and computational complexity.
2. Validate Each Unit Against Supporting Documents For each unit, check if it is supported by the documents. This involves:
- Semantic Matching: Use embeddings (e.g., SBERT, OpenAI embeddings) to compute similarity between the answer unit and document passages. Set a threshold (e.g., cosine similarity ≥ 0.8) to determine support.
- Entailment Checking: Use models like T5 or BART fine-tuned on textual entailment tasks to verify if the document text logically supports the answer unit.
- Exact or Paraphrased Matches: For simpler cases, check for keyword overlap or paraphrased equivalents (e.g., “67 million” vs. “67,000,000”).
A unit is marked as “grounded” if any document passage supports it. For example, if the documents confirm Paris is France’s capital but lack population data, the first claim is supported, while the second is not.
3. Compute the Metric The metric is the ratio of supported units to total units. For example, if 3 out of 5 answer claims are supported, the score is 0.6. Optional refinements include:
- Weighting: Assign higher weights to critical claims (e.g., central facts vs. minor details).
- Partial Credit: Use similarity scores (e.g., 0.7 similarity = 0.7 contribution to the score) instead of binary thresholds.
- Error Analysis: Track common failure modes (e.g., unsupported numerical data, unverified entities) to improve the metric’s design.
Example Implementation Suppose an answer is split into 4 sentences. After validation, 3 are supported by documents. The metric returns 0.75. If one sentence is partially supported (e.g., similarity score 0.6), the score could adjust to (3 + 0.6)/4 = 0.9, depending on the scoring rules. This approach balances simplicity with flexibility, allowing customization based on use-case requirements like document quality or answer complexity.
