To test whether a RAG system properly handles queries requiring multiple pieces of evidence, start by designing test cases where each query explicitly depends on combining information from distinct documents. For example, a question like, "What caused Event X, and what were its long-term environmental impacts?" requires retrieving one document explaining the cause and another detailing the effects. The test should verify that omitting either document results in an incomplete or incorrect answer. Create a controlled knowledge base containing only the necessary documents for each test case, and ensure adversarial examples (e.g., irrelevant or partially related documents) are included to assess the system’s ability to filter noise.
Next, implement evaluation metrics that check for the presence of all required evidence. Break the answer into individual claims (e.g., "Event X was caused by Factor A" and "Long-term impacts included Effect B") and map each claim to the specific document it should derive from. Use automated fact-checking tools or structured validation (e.g., regular expressions or keyword checks) to confirm that all expected claims are present. For example, if the system answers the environmental impact question but skips the "cause" section, the test fails. Additionally, run ablation tests: remove one critical document from the retrieval step and check if the generated answer becomes incorrect. If the answer remains correct despite missing a required document, this indicates the system may be relying on incorrect assumptions or external knowledge.
Finally, validate the system’s robustness by testing retrieval and generation together. For instance, use a medical query like, "What are the symptoms of Condition Y and its first-line treatment?" where symptoms and treatments are in separate documents. If the system retrieves only the symptoms document, the generated answer should lack treatment details. To automate this, create a checklist of required facts tied to each document and programmatically verify their presence in the output. Tools like LangChain’s document attribution or custom scripts can trace generated answers back to source documents. This approach ensures the system isn’t hallucinating or paraphrasing in a way that masks missing evidence. Regularly iterating on these tests with edge cases (e.g., overlapping but incomplete evidence) helps refine the system’s dependency on multiple sources.