Open-Book QA and Its Relation to RAG Open-book question answering (QA) refers to a scenario where a model can access external information sources, such as documents or databases, to answer questions. Unlike closed-book QA, which relies solely on a model’s pre-trained knowledge, open-book QA mimics real-world tasks where referencing materials is allowed. For example, a system answering medical questions might retrieve research papers before generating a response. Retrieval-Augmented Generation (RAG) is a specific implementation of open-book QA. In RAG, the model first retrieves relevant documents from a knowledge source and then uses that context to generate an answer. This two-step process combines retrieval (finding useful information) with generation (synthesizing a coherent response). RAG is not the only open-book approach, but it is widely used because it explicitly separates retrieval and generation, making it easier to update the knowledge source without retraining the entire model.
Evaluating Closed-Book QA In closed-book settings, evaluation focuses on the model’s ability to recall factual knowledge from its training data. Metrics like accuracy, precision, and recall are applied to standardized benchmarks (e.g., TriviaQA or Natural Questions) to measure how well the model answers questions without external help. The emphasis is on memorization—whether the model learned specific facts during training. For instance, if a closed-book model incorrectly states that Paris is the capital of Germany, this reflects a gap in its training data or learning process. Evaluators also test generalization: Can the model answer nuanced or paraphrased questions without access to new information? Limitations arise when questions require up-to-date or domain-specific knowledge not present in the training data, as closed-book models cannot adapt to new information post-training.
Evaluating Open-Book QA In open-book settings, evaluation must account for both retrieval quality and generation accuracy. Metrics include retrieval precision (how many retrieved documents are relevant) and answer correctness relative to the provided context. For example, if a model retrieves a document about climate change but generates an answer contradicting it, the error lies in synthesis, not retrieval. Evaluators also test robustness to noisy or incomplete data: Can the model avoid hallucination when retrieved content is irrelevant? Tools like Recall@k (measuring if correct documents are in the top-k results) and context-aware accuracy scores (checking if answers align with the context) are critical. Additionally, efficiency metrics—like latency during retrieval—matter in practical applications. Unlike closed-book evaluation, open-book testing requires datasets with explicit grounding in external sources (e.g., HotpotQA), where answers depend on specific documents. This setup helps isolate whether errors stem from poor retrieval, flawed synthesis, or insufficient knowledge in the source material.