A dataset for evaluating hallucination should be designed to test a system’s ability to distinguish between answerable and unanswerable questions, detect factual inaccuracies, and handle ambiguous or incomplete information. Here’s how to structure it effectively:
1. Controlled Knowledge Base and Question Categorization The dataset must include a clearly defined knowledge base (KB) and questions labeled as either answerable (answers exist in the KB) or unanswerable (answers are missing or ambiguous). For example, in a medical QA system, an answerable question might be “What is the treatment for influenza?” (if the KB includes flu treatments), while an unanswerable one could be “What is the cure for a fictional disease X?” (if X isn’t in the KB). Each question should map to specific KB entries or explicitly state that no answer exists. To avoid ambiguity, unanswerable questions should not be phrased as trick questions but as realistic queries that a user might ask, such as “When did event Y occur?” when Y is not documented in the KB.
2. Diversity in Question Types and Contexts Include a mix of question types (factual, hypothetical, opinion-based) and domains (e.g., science, history, current events) to test generalization. For instance:
- Factual questions with missing data: “What is the GDP of Country Z in 2023?” (if Z’s 2023 data isn’t in the KB).
- Ambiguous queries: “How do I fix error code 123?” (if the KB only covers error codes up to 100).
- Adversarial examples: Questions that resemble answerable ones but contain subtle inaccuracies, like “What year did the Titanic sink in the Pacific Ocean?” (the Titanic sank in the Atlantic). This diversity ensures the model isn’t overfitting to a specific pattern and can handle edge cases, such as partial matches (e.g., a question about “AI advancements in 2024” when the KB only covers up to 2023).
3. Annotations and Evaluation Metrics Each question should have ground-truth annotations indicating:
- The correct answer (if answerable).
- A flag for “unanswerable” or “requires abstention.”
- Contextual metadata (e.g., KB coverage, question source). For evaluation, track metrics like:
- Correct abstention rate: How often the system refrains from answering unanswerable questions.
- Hallucination rate: Instances where the system invents false answers for unanswerable questions.
- Answer accuracy: Precision/recall for answerable questions. For example, if a system answers “I don’t know” to “What is the capital of Mars?” (unanswerable) but correctly answers “What is the capital of France?” (answerable), it demonstrates proper behavior. Including confidence scores (if the system provides them) can further measure calibration—e.g., low confidence for unanswerable questions.
By combining structured KB alignment, diverse question types, and rigorous annotations, the dataset will reliably test a system’s ability to avoid hallucination while maintaining accuracy on valid queries.