To evaluate customized text-to-speech (TTS) output, developers can use a mix of objective and subjective metrics tailored to the specific customization goals. These metrics assess aspects like audio quality, speaker similarity, naturalness, and alignment with the intended use case. Here are three key categories of metrics:
1. Audio Quality and Intelligibility Objective metrics like Mel-Cepstral Distortion (MCD) measure spectral differences between synthesized and reference speech, indicating how well the TTS captures acoustic features. Lower MCD values suggest higher fidelity. Word Error Rate (WER), calculated using automatic speech recognition (ASR) systems, evaluates intelligibility by measuring how accurately the synthesized speech is transcribed. For example, a WER of 5% implies fewer errors than 20%. Subjective metrics like Mean Opinion Score (MOS) involve human raters scoring naturalness on a scale (e.g., 1-5). A custom voice for audiobooks might aim for a MOS ≥4.0 to ensure smooth listening.
2. Speaker Similarity and Customization Accuracy When cloning a specific speaker’s voice, speaker embedding similarity (e.g., using d-vectors or x-vectors) quantifies how closely the TTS output matches the target voice. Tools like Resemblyzer compare embeddings to compute similarity scores. Prosody metrics like pitch (F0) and duration variability assess whether the TTS preserves the speaker’s unique rhythm and intonation. For instance, a custom voice for a virtual assistant should match the original speaker’s average pitch (e.g., 200 Hz) within a small margin of error (e.g., ±10 Hz).
3. Task-Specific Performance For domain-specific applications, metrics like user engagement (e.g., time spent listening) or task success rate (e.g., correct responses to voice commands) matter. A custom TTS for navigation systems might track how often users correctly interpret directions. Latency (e.g., 200ms generation time) and resource efficiency (e.g., GPU memory usage) are critical for real-time applications. Developers might also test multilingual support by measuring accuracy in language-specific phoneme pronunciation.
Combining these metrics ensures the TTS output meets technical and user-centric requirements. For example, a custom celebrity voice for a video game might prioritize speaker similarity (≥90% embedding match) and low latency (<300ms) while maintaining a MOS ≥4.2 for immersion.