BERTScore is an evaluation metric that uses contextual embeddings from models like BERT to measure the similarity between two texts. Unlike traditional metrics such as BLEU or ROUGE, which rely on exact word or phrase matches, BERTScore compares texts by analyzing the semantic meaning captured in their embeddings. Here’s how it works: The generated and reference texts are tokenized, and each token is converted into a high-dimensional vector using a pre-trained BERT model. The similarity between these vectors is computed using cosine similarity, and the final score aggregates these token-level similarities, balancing precision (how well the generated text matches the reference) and recall (how well the reference is covered by the generated text). This approach allows BERTScore to capture paraphrases, synonyms, and contextual nuances that rigid n-gram-based methods miss.
Embedding-based metrics like BERTScore are particularly useful for evaluating tasks where semantic accuracy matters more than exact wording. For example, in question answering, a generated answer might rephrase the reference answer (e.g., “canine” instead of “dog”) while retaining correctness. Traditional metrics would penalize this, but BERTScore recognizes the semantic equivalence. Similarly, in summarization, a generated summary might use different sentence structures but still convey the same key points as the source text. Embedding-based metrics can compare the summary directly to the source, ensuring alignment even without a human-written reference. Studies have shown that such metrics correlate better with human judgments than older methods, making them valuable for automated evaluation pipelines.
However, embedding-based metrics have limitations. They require computational resources to generate embeddings, which can be slower than n-gram approaches. The choice of model (e.g., BERT vs. RoBERTa) and layers used for embeddings can also affect results, requiring careful configuration. Additionally, while they excel at semantic similarity, they don’t directly measure factual correctness or coherence. Despite this, their ability to handle paraphrasing and contextual variations makes them a significant improvement over traditional metrics for tasks like machine translation, summarization, and QA. Alternatives like MoverScore (which considers token alignment costs) or Sentence-BERT (for sentence-level embeddings) offer similar benefits, giving developers flexibility in choosing the right tool for their evaluation needs.