To evaluate a RAG (Retrieval-Augmented Generation) system in domains without standard datasets, such as internal company documents, you must build a custom test set that reflects real-world use cases. Start by identifying the specific tasks the RAG system is expected to perform—for example, answering employee questions about internal policies or retrieving technical documentation. Collaborate with domain experts to collect a representative sample of user queries and their corresponding ideal answers. These queries should cover common scenarios, edge cases, and potential ambiguities in the domain. For instance, if the system is designed for HR documentation, include questions about leave policies, benefits, and compliance procedures. Ensure the test set balances breadth (covering diverse topics) and depth (variations of the same question type).
Next, create ground-truth data for evaluation. For each query, define the expected retrieved documents and the ideal generated answer. This requires manual annotation by subject-matter experts (SMEs) to ensure accuracy. For example, if a user asks, "What’s the process for submitting an expense report?" the ground truth might include links to specific sections of a finance manual and a step-by-step summary. To reduce bias, have multiple SMEs review and validate the test cases. Additionally, simulate realistic noise—such as typos or informal phrasing in user queries—to test the system’s robustness. Tools like crowdsourcing or internal employee feedback can help identify gaps in the test set.
Finally, define evaluation metrics tailored to the domain. Use standard retrieval metrics (e.g., recall@k, precision@k) to assess document relevance and generation metrics (e.g., BLEU, ROUGE) for answer quality. However, supplement these with human evaluation, as automated metrics often fail to capture context-specific correctness. For example, a generated answer about compliance must be factually precise, not just semantically similar to the ground truth. Establish a scoring rubric for human evaluators (e.g., accuracy, clarity, completeness) and iterate on the test set based on evaluation results. Continuously update the test set as the domain evolves—such as when internal documents are revised—to maintain evaluation relevance.