The standard evaluation metrics for text-to-speech (TTS) quality fall into two categories: subjective human assessments and objective algorithmic measurements. These metrics help developers gauge how natural, intelligible, and accurate synthesized speech sounds compared to human speech.
1. Mean Opinion Score (MOS) MOS is the most widely used subjective metric. Human listeners rate synthesized speech on a scale (e.g., 1–5) for naturalness, clarity, and overall quality. For example, a score of 4.0 might indicate near-human quality, while 2.5 suggests noticeable artifacts. MOS is reliable but resource-intensive, requiring controlled listening tests with many participants. Developers often use it as a benchmark during system comparisons, though its reliance on human labor makes it impractical for frequent iteration. For instance, a TTS model targeting conversational agents might aim for a MOS above 3.5 to ensure user acceptance.
2. Objective Metrics: MCD, PESQ, and STOI Objective metrics automate evaluation by comparing synthesized speech to ground-truth recordings.
- Mel-Cepstral Distortion (MCD) measures spectral differences using mel-frequency cepstral coefficients (MFCCs). Lower MCD values (e.g., 6 dB vs. 10 dB) indicate better alignment with the reference speech.
- Perceptual Evaluation of Speech Quality (PESQ) predicts MOS-like scores by analyzing audio signals for distortions. It’s commonly used in telecom but adapted for TTS.
- Short-Time Objective Intelligibility (STOI) estimates how understandable the speech is, scoring 0–100%. These metrics are efficient for iterative development but may not fully capture perceptual nuances, such as prosody or emotional expression.
3. Speaker Similarity and Task-Specific Metrics For voice cloning or personalized TTS, speaker similarity is critical. This is measured using cosine similarity between speaker embeddings (e.g., from pre-trained models like ECAPA-TDNN). A score of 0.8 might indicate strong resemblance to a target speaker. Task-specific metrics like word error rate (WER) assess transcription accuracy via automatic speech recognition (ASR) systems, ensuring the TTS output matches the input text. For example, a WER below 5% is often required for accessibility tools like screen readers. Developers combine these metrics based on use-case priorities, balancing naturalness, intelligibility, and speaker identity.