To evaluate factual correctness when a reference answer is available, two common metrics are exact match and F1 score. These are widely used in question-answering (QA) benchmarks like SQuAD. Exact match checks if the predicted answer matches the reference exactly, while F1 measures token-level overlap between the prediction and reference. Both methods are simple to compute but have trade-offs in handling variations in phrasing, synonyms, or partial correctness.
Exact match works by comparing the model’s output to the reference answer character-for-character. For example, if the reference is "Paris" and the model answers "paris" (lowercase), exact match would fail unless case sensitivity is ignored. This approach is strict and works well for short, unambiguous answers (e.g., dates, names) but struggles with flexibility. For instance, "Barack Obama" versus "Obama" would be marked incorrect even though the latter is contextually valid. To address this, some implementations normalize answers (lowercasing, removing articles) before comparison, but this still doesn’t account for semantic equivalence (e.g., "U.S. president in 2020" vs. "Donald Trump").
F1 score evaluates overlap at the token level. It splits both the prediction and reference into words, calculates precision (how many predicted tokens are in the reference) and recall (how many reference tokens are in the prediction), then computes their harmonic mean. For example, if the reference is "a blue sedan" and the prediction is "blue car," the shared token "blue" gives precision=1/2, recall=1/3, and F1=0.4. This works better for longer answers where partial correctness matters, but it still ignores semantics. For instance, "not a blue sedan" would have high token overlap with the reference but is factually wrong. F1 also struggles with reordered words or synonyms (e.g., "automobile" vs. "car").
While exact match and F1 are efficient and objective, they have limitations. They don’t account for paraphrasing, contextual relevance, or factual errors in answers with correct token overlap (e.g., "World War II ended in 1946" shares tokens with the correct "1945" reference but is wrong). For more nuanced evaluation, hybrid approaches are often used, such as combining F1 with human checks or leveraging semantic similarity metrics (e.g., BERTScore). In practice, the choice depends on the task: exact match suits short, structured answers, while F1 is better for verbose or partially correct responses. However, neither fully replaces human judgment for complex factual accuracy.