Common Metrics for Evaluating TTS Quality
Text-to-speech (TTS) systems are evaluated using a mix of subjective and objective metrics to assess aspects like naturalness, intelligibility, and fidelity. These metrics help developers identify strengths and weaknesses in synthesized speech.
Subjective Metrics: Mean Opinion Score (MOS) The most widely used subjective metric is the Mean Opinion Score (MOS), where human listeners rate speech samples on a scale (e.g., 1–5). Ratings typically focus on naturalness, clarity, and overall quality. For example, a score of 5 might indicate speech indistinguishable from a human, while 1 reflects poor intelligibility. MOS tests require careful design, including diverse listener groups and randomized samples to reduce bias. While time-consuming, MOS provides direct insight into human perception, making it a gold standard despite its reliance on manual effort.
Objective Metrics: Signal Fidelity and Intelligibility Objective metrics automate evaluation by comparing synthesized speech to ground-truth recordings or using algorithmic analysis. Mel Cepstral Distortion (MCD) measures spectral differences between synthesized and reference audio, with lower values indicating better fidelity. Word Error Rate (WER) evaluates intelligibility by transcribing TTS output with an automatic speech recognition (ASR) system—lower WER suggests clearer articulation. Prosody metrics, like pitch (F0) and duration variability, assess naturalness by quantifying stress and intonation patterns. For example, unnatural pauses or monotone pitch would result in higher prosody error scores.
Speaker Similarity and Specialized Tests Speaker similarity metrics gauge how well a TTS system mimics a target speaker’s voice. Techniques like speaker embedding comparison (e.g., using neural networks to measure similarity in latent spaces) are common. Additionally, diagnostic tests evaluate specific capabilities, such as handling rare words or emotional tone. For instance, a system might be tested on pronouncing technical terms correctly or conveying anger or happiness in speech. These targeted assessments complement broader metrics to ensure robustness across diverse use cases.
By combining subjective and objective approaches, developers can holistically optimize TTS systems for both technical accuracy and human-centric qualities.