ROUGE and METEOR, while widely used for text generation evaluation, have significant limitations when applied to RAG (Retrieval-Augmented Generation) systems. These metrics primarily measure lexical overlap between generated text and reference answers, which fails to account for semantic equivalence in cases where valid responses use different phrasing or structure. For example, if a RAG system answers a question with "The event occurred in 1945" while the reference states "It happened in 1945," ROUGE would penalize the lack of exact n-gram matches, even though both answers are correct. METEOR’s inclusion of synonyms and stemming slightly mitigates this but still struggles with reordered information or context-dependent paraphrases not covered by its predefined linguistic resources (e.g., WordNet). This rigidity makes both metrics unreliable when multiple correct answers exist, as they prioritize surface-level similarity over factual consistency or logical coherence.
Another key issue is their dependence on high-quality reference texts. RAG systems often synthesize information from multiple retrieved documents, creating answers that combine details not present in any single reference. For instance, if a question about climate change requires integrating data from two sources (e.g., temperature trends from study A and CO2 levels from study B), a correct RAG output might merge these points. However, ROUGE/METEOR would compare the generated text against each reference individually, potentially scoring it poorly if no single reference contains the combined information. This limitation is exacerbated in real-world scenarios where ideal references are scarce or incomplete, leading to underestimation of the system’s ability to aggregate knowledge.
Finally, these metrics ignore the retrieval component of RAG systems entirely. They evaluate only the final output, not whether the system retrieved relevant source material. For example, a RAG model might generate a correct answer by coincidentally matching a reference, even if it retrieved irrelevant documents. Conversely, it might produce a valid answer using correctly retrieved information that differs lexically from the reference, resulting in a low score. This decoupling of retrieval quality from text generation evaluation creates a blind spot, as a RAG system’s true performance hinges on both stages working together effectively. To address these gaps, complementing ROUGE/METEOR with semantic similarity metrics (e.g., BERTScore) or task-specific human evaluations is often necessary.