What methods are used to measure intelligibility in TTS outputs?
Intelligibility in text-to-speech (TTS) systems is measured using subjective and objective methods. Subjective evaluations rely on human listeners to assess clarity, while objective methods use automated metrics. Hybrid approaches combine both to balance accuracy and scalability.
Subjective Evaluations The most common method is Mean Opinion Score (MOS), where listeners rate speech samples on a scale (e.g., 1-5) for clarity. For example, a score of 5 might mean "perfectly intelligible," while 1 indicates "unintelligible." Another approach is the Diagnostic Rhyme Test (DRT), which tests listeners’ ability to distinguish phonetically similar words (e.g., "bat" vs. "pat"). The Modified Rhyme Test (MRT) expands this by evaluating consonant recognition in controlled word pairs. These tests require careful design, a diverse participant pool, and statistical analysis to ensure reliability. While subjective methods are considered the gold standard, they are time-consuming and expensive to scale.
Objective Metrics Automated metrics like Word Error Rate (WER) use automatic speech recognition (ASR) systems to transcribe TTS output and compare it to the original text. Lower WER indicates higher intelligibility. However, WER depends on the ASR system’s accuracy, which may introduce bias. Short-Time Objective Intelligibility (STOI) analyzes acoustic features to predict how well humans would understand speech, particularly in noisy conditions. While objective methods are faster and repeatable, they may not fully align with human perception. For instance, a TTS system might achieve low WER but still sound unnatural to listeners.
Hybrid Approaches Some methods combine subjective and objective techniques. For example, ASR-based pre-screening can filter clearly unintelligible samples, reducing the workload for human evaluators. Another approach uses crowdsourcing platforms to gather large-scale subjective ratings efficiently. Tools like Amazon Mechanical Turk enable rapid data collection but require quality checks to filter unreliable responses. Hybrid methods aim to balance the depth of human evaluation with the scalability of automation, though they still face challenges in standardizing results across different environments.
In practice, developers often use a mix of these methods. For example, during early testing, WER or STOI might identify glaring issues, while final validation relies on MOS or DRT to ensure user-centric quality. The choice depends on the project’s stage, resources, and the need for precision versus speed.