When evaluating multi-step retrieval systems, the dataset must be designed to test the ability to combine information from multiple sources. Here are the key considerations:
1. Explicit Multi-Hop Queries and Document Relationships The dataset must include questions that cannot be answered using a single document. For example, a query like "What was the revenue of Company A in 2023, and how does it compare to its acquisition target, Company B?" requires retrieving Company A’s financial report and Company B’s pre-acquisition disclosures. Each question should be paired with a ground-truth set of documents explicitly marked as necessary to answer it. This ensures the evaluation measures whether the system can identify and connect relevant documents. Additionally, documents should contain shared identifiers (e.g., unique entity IDs, project names, or dates) or implicit relationships (e.g., a merger announcement in one doc and financial results in another) to enable validation of cross-document reasoning.
2. Noise and Distractors To avoid testing trivial retrieval, the dataset should include:
- Irrelevant documents: Unrelated to the query but sharing keywords (e.g., a document mentioning "Company B" in a different context).
- Partial matches: Documents that address one part of the query but lack critical information (e.g., Company A’s 2022 revenue report instead of 2023).
- Ambiguous links: Documents that require disambiguation (e.g., two companies with similar names or overlapping timelines). This tests whether the system can prioritize correct connections over plausible-but-incorrect ones. For instance, a query about a product launch might require linking a technical spec document with a marketing timeline, while ignoring a separate engineering blog post about unrelated R&D.
3. Granular Ground-Truth Annotation Beyond marking which documents are relevant, the dataset should specify:
- Dependencies: The order in which documents must be retrieved (e.g., identifying a patent document first might help resolve ambiguity in a technical manual).
- Conflict resolution: Cases where documents contradict each other (e.g., differing revenue figures in a press release vs. an SEC filing).
- Answer synthesis requirements: Whether the final answer requires arithmetic (e.g., calculating growth rates from two reports) or logical inference (e.g., deducing a timeline from event mentions across docs). For reproducibility, annotations might include “reasoning chains” that map how the ground-truth answer derives from specific document sections.
Example: A dataset for legal contract analysis might include a query like “Does Party X owe penalties if Project Y is delayed due to force majeure?” with two marked documents: a force majeure clause in a master agreement and a project-specific SLA. The system must retrieve both and recognize that the SLA overrides the master agreement’s general terms.
Without these considerations, evaluations risk measuring single-step lookup performance rather than true multi-document reasoning.