How is BLEU calculated? BLEU (Bilingual Evaluation Understudy) measures the similarity between a generated text and reference texts using n-gram overlap. The calculation involves two main steps. First, it computes modified n-gram precision for n-grams of size 1 to 4. For each n-gram size, the count of matching n-grams in the generated text is divided by the total n-grams in the generated text, but with a cap based on the maximum occurrences in any reference (to avoid overcounting). For example, if the generated text has three instances of "the cat" but the reference only has two, only two are counted. Next, a brevity penalty is applied to penalize overly short outputs. This penalty is 1 if the generated text matches the reference length, but decreases exponentially if shorter. The final BLEU score is the geometric mean of the n-gram precisions multiplied by the brevity penalty, typically scaled to a 0–100 range.
Does BLEU favor lexical similarity or factual correctness? A higher BLEU score indicates greater lexical overlap with the reference, not factual correctness. For instance, if a generated answer rearranges words from the reference but changes a critical fact (e.g., "The cat sat on the mat" vs. "The dog sat on the mat"), BLEU might still reward the n-gram "sat on the mat" despite the factual error. Conversely, a factually correct answer using synonyms (e.g., "The feline rested on the rug") could score lower due to lexical mismatches. BLEU’s reliance on exact word matches makes it insensitive to semantic equivalence or factual accuracy. It also struggles with paraphrasing, structural variations, or domain-specific terminology not present in the reference.
Limitations and practical considerations BLEU’s design assumes that high n-gram overlap correlates with quality, which is not always true. For example, in machine translation, a grammatically correct but meaning-altering substitution (e.g., translating "bank" as a financial institution instead of a riverbank) might still yield a high BLEU score if the surrounding words match. Additionally, BLEU depends heavily on the quality and diversity of reference texts—if references are limited or biased, scores may not reflect true performance. Factual accuracy requires deeper semantic understanding, which BLEU does not measure. Alternatives like ROUGE (for summarization) or task-specific metrics (e.g., QA accuracy) are better suited for evaluating factual correctness, while BLEU remains a tool for assessing surface-level similarity.