A RAG-generated answer might achieve high BLEU/ROUGE scores against a reference answer but still perform poorly in practice because these metrics focus on surface-level lexical overlap rather than semantic accuracy, coherence, or relevance. BLEU measures n-gram precision against a reference, while ROUGE emphasizes recall of overlapping words or phrases. However, neither metric evaluates whether the generated answer addresses the user’s intent, maintains logical consistency, or avoids factual errors. For example, a RAG model might produce an answer with correct terminology and phrasing that matches the reference (e.g., “Climate change is caused by greenhouse gases like CO₂”) but insert an unrelated or contradictory claim (e.g., “but most emissions come from volcanic activity”). The overlapping terms (“climate change,” “CO₂”) would boost BLEU/ROUGE scores, but the factual error would render the answer misleading.
Another issue arises when the RAG model generates a response that is technically correct but lacks practical utility. For instance, if a user asks, “How do I debug a memory leak in Python?” a RAG answer might list general steps like “use a profiler” or “check for circular references” without providing actionable details (e.g., specific tools like tracemalloc
or code examples). The answer could overlap lexically with a reference (e.g., “profiler,” “memory leaks”), scoring well on metrics, but fail to address the user’s need for concrete guidance. Similarly, overly verbose or redundant answers might repeat keywords from the reference, inflating ROUGE scores while frustrating users who expect concise solutions.
Finally, BLEU/ROUGE ignore contextual nuances and domain-specific requirements. For example, in medical or legal contexts, precise terminology and adherence to guidelines are critical. A RAG model might generate an answer that paraphrases a reference using synonyms (e.g., “administer medication” instead of “prescribe aspirin”) and score well on metrics, but the substitution could alter the meaning or violate protocol. Similarly, cultural or temporal relevance matters: an answer about “best practices for web security” might cite outdated techniques (e.g., SHA-1 hashing) that align lexically with a reference but are insecure in practice. These gaps highlight why human evaluation and task-specific metrics remain essential despite high automated scores.