Embedding models handle out-of-vocabulary (OOV) words through techniques that generalize beyond their fixed vocabulary, often by breaking words into smaller components or using contextual clues. Traditional word embeddings like Word2Vec or GloVe assign a unique vector to each known word during training but struggle with OOV terms since they lack representations for unseen words. To address this, modern approaches use subword information, character-level modeling, or context-aware strategies. For example, FastText represents words as the sum of their character n-grams (e.g., triples like "cat" → ["ca", "at"]), allowing it to build vectors for new words by combining their subword parts. This method works well for morphologically rich languages or technical jargon, where even unfamiliar words share subword patterns with known terms.
Contextual embedding models like BERT or RoBERTa take a different approach: they process text at the subword level using tokenization methods like WordPiece or Byte-Pair Encoding (BPE). These algorithms split rare or unknown words into smaller, known units. For instance, the word "unhappiness" might be split into ["un", "happiness"] or ["un", "hap", "##piness"], depending on the tokenizer’s rules. Each subword unit receives its own embedding, and the model combines them during processing. This allows the system to handle OOV words by approximating their meaning through their components. For example, if "antidisestablishmentarianism" is split into ["anti", "dis", "establish", "##ment", "##arian", "##ism"], the model can infer the word’s meaning from the prefixes, root, and suffixes it recognizes.
When subword or character-level methods aren’t sufficient, models may fall back to default behaviors. Some systems use a generic "unknown" token (e.g., <UNK>
) to represent all OOV words, though this loses specificity. Others compute an average vector of known words or use a hashing trick to map OOV terms to a fixed set of buckets. For developers, the choice depends on the task: subword methods are ideal for semantic accuracy, while simpler fallbacks might suffice for lightweight applications. Libraries like Hugging Face’s transformers
automate OOV handling by integrating tokenizers that split words into subwords, making implementation straightforward. By combining these strategies, embedding models balance flexibility and efficiency, ensuring they can handle new words without requiring retraining.