To explicitly measure "supporting evidence coverage," you need a systematic way to verify that every claim or assertion in an answer is grounded in at least one retrieved document. This involves breaking down the answer into smaller units (e.g., sentences, claims, or propositions) and mapping each unit to specific sections of the retrieved documents. Automated methods can leverage semantic similarity, natural language inference (NLI), or model attribution techniques, while hybrid approaches might combine automated checks with human validation. The goal is to quantify the percentage of answer segments that have clear, traceable support in the source material.
One practical approach is to use semantic similarity metrics and NLI models. For example, split the answer into sentences or claims and encode them (along with the retrieved documents) into vector embeddings using models like SBERT or BERT. Compute cosine similarity between answer segments and document passages to identify overlaps. For deeper validation, apply NLI models (e.g., trained on datasets like SNLI) to determine if a document entails a given answer segment. For instance, if an answer states, "Climate change increases extreme weather events," an NLI model can check if a retrieved document passage logically supports this claim, even if phrased differently. Tools like FAISS or Elasticsearch can accelerate similarity searches across large document sets, while thresholds (e.g., similarity scores > 0.8 or entailment confidence > 90%) help standardize what qualifies as "supported."
Challenges include handling implicit reasoning (e.g., answers that combine multiple documents) and avoiding false negatives where support exists but isn’t surface-level obvious. To address this, consider hybrid methods: use automated metrics for initial scoring, then sample low-confidence cases for human review. For example, if an answer segment has low similarity scores but the document discusses related concepts (e.g., "rising temperatures" instead of "climate change"), annotators can manually validate the connection. Additionally, model attribution techniques—like analyzing attention weights in transformer-based models—can highlight which document passages influenced specific answer segments during generation. However, this requires access to the model’s internals, which isn’t always feasible. Ultimately, a combination of automated checks (NLI, similarity) and targeted human evaluation provides a balanced measure of coverage while accounting for nuance.