Lexicons and pronunciation dictionaries are critical components in text-to-speech (TTS) systems because they ensure accurate and context-aware pronunciation of words. A lexicon acts as a structured vocabulary that maps words to linguistic information like part-of-speech tags, word forms, and syntactic context. This helps the TTS system resolve ambiguities in homographs (words spelled the same but pronounced differently) by analyzing surrounding text. For example, the word "lead" can be a noun (the metal, pronounced /lɛd/) or a verb (to guide, pronounced /liːd/). The lexicon provides context clues, such as grammatical role, enabling the system to select the correct pronunciation from the pronunciation dictionary.
A pronunciation dictionary complements the lexicon by providing explicit phonetic representations for words, often using symbols like IPA (International Phonetic Alphabet) or system-specific notation (e.g., ARPAbet). It directly maps each word to its phonetic sequence, which the TTS system converts into speech sounds. For instance, the dictionary entry for "tomato" might include regional variants like /təˈmeɪtoʊ/ (American English) or /təˈmɑːtəʊ/ (British English). This ensures the system adapts to dialects or user preferences. The dictionary also handles uncommon terms, such as technical jargon or proper nouns (e.g., "São Paulo" as /sɐ̃w ˈpaʊlu/), which generic language models might mispronounce without explicit guidance.
Together, these components address challenges like text normalization (e.g., converting "Dr." to "Doctor" or "$100" to "one hundred dollars") and phonetic accuracy. They enable the TTS system to handle edge cases, such as heteronyms (e.g., "wind" as airflow /wɪnd/ vs. twisting /waɪnd/), by combining contextual analysis from the lexicon with precise phonetic data from the dictionary. Without them, TTS output would lack naturalness and reliability, especially for specialized domains or languages with complex pronunciation rules. Developers often extend these resources with custom entries to improve performance in specific applications, such as medical or legal TTS tools.