To obtain ground truth data for identifying which document or passage answers a question, several methods exist, each with distinct trade-offs in accuracy, cost, and scalability. Here are three primary approaches:
1. Annotated Datasets (e.g., SQuAD, Natural Questions) Pre-built datasets like SQuAD provide structured question-answer pairs with explicit links to source passages. For example, SQuAD uses Wikipedia articles, where crowdworkers write questions and highlight the exact text span containing the answer. Similarly, Google’s Natural Questions dataset pairs real search queries with human-annotated answers from Wikipedia. These datasets are reliable for training and benchmarking because they explicitly map questions to passages. However, they are limited to specific domains (e.g., Wikipedia) and require significant effort to create. Tools like Prodigy or Amazon Mechanical Turk are often used to scale manual annotation, but costs and time increase with dataset size.
2. Distant Supervision and Heuristics Distant supervision uses existing structured data (e.g., knowledge bases like Wikidata) to automatically link answers to passages. For instance, if a fact like “Einstein born in 1879” exists in a knowledge base, a system might search documents for sentences mentioning both “Einstein” and “1879” and label those as relevant. Heuristics like keyword matching (e.g., overlapping terms between a question and passage) can also generate weak labels. While cost-effective, this method risks inaccuracies due to paraphrasing, missing context, or indirect references. For example, a passage discussing “the physicist’s birth year” without explicitly naming Einstein might be overlooked.
3. User Interaction Data and Synthetic Labeling Implicit signals from user behavior, such as click-through rates on search results, can indicate which documents users found relevant. Platforms like Stack Overflow or FAQs also provide question-answer pairs that can be mapped to source passages, though this requires manual curation. Synthetic labeling involves training a model on a small annotated dataset and using it to label larger datasets. While scalable, this method propagates errors from the initial model. For example, a model trained on SQuAD might mislabel technical documents outside Wikipedia’s domain without fine-tuning.
Trade-offs and Use Cases Annotated datasets offer high precision but are domain-specific and expensive. Distant supervision and heuristics are cheaper but less reliable. User data and synthetic methods balance scalability and accuracy but depend on existing infrastructure. The choice depends on the problem’s scope: SQuAD suits general QA, while domain-specific applications may require custom annotation or hybrid approaches combining multiple methods.
