To evaluate multi-hop QA systems effectively, metrics must assess both the correctness of the final answer and the system’s ability to integrate information from multiple sources. Traditional QA metrics like exact match (EM) or F1 score, which measure surface-level overlap between predicted and reference answers, are insufficient because they don’t verify if the reasoning process used multiple documents. For example, a model might guess the correct answer using only one document, bypassing the required multi-hop reasoning. Instead, a combination of answer accuracy and intermediate reasoning validation is necessary.
One approach is to use evidence retrieval metrics alongside answer correctness. For instance, datasets like HotpotQA provide annotated supporting facts, allowing evaluation of whether the model retrieves all required documents. Precision and recall can measure how well the model identifies relevant passages. If the correct answer relies on documents A and B, the model must retrieve both. Additionally, structured explanations can be evaluated: if the model outputs intermediate reasoning steps (e.g., “Document A states X, and Document B states Y, so the answer is Z”), automated checks can verify if both sources are cited and logically connected. This ensures the model isn’t shortcutting the process.
Another key metric is semantic answer similarity, which addresses phrasing variations. Tools like BERTScore compare embeddings of predicted and reference answers to measure semantic equivalence, avoiding over-reliance on exact wording. For example, if the answer requires combining “Document A: John lives in Paris” and “Document B: Paris is in France,” a correct answer like “John resides in France” should score highly despite lacking keyword overlap. Finally, adversarial testing can expose flaws: removing one critical document from the input and checking if the model’s accuracy drops ensures it isn’t relying on a single source. Combined, these metrics provide a robust evaluation of multi-hop reasoning.