LLMs handle out-of-vocabulary (OOV) words using subword tokenization techniques like Byte Pair Encoding (BPE) or WordPiece. These methods split rare or unseen words into smaller units (subwords) or characters that are part of the model’s vocabulary. For example, the word “unhappiness” might be tokenized as [“un”, “happiness”] or [“un”, “hap”, “pi”, “ness”].
By breaking OOV words into subwords, the model can process and understand their components, even if the exact word has not been seen during training. This allows LLMs to generalize better to new inputs. Subword tokenization also helps in encoding domain-specific terms or technical jargon by reusing familiar components.
While effective, subword tokenization has limitations. Over-segmentation can sometimes lead to loss of semantic meaning. To mitigate this, developers can fine-tune the model on domain-specific data or expand the vocabulary to include specialized terms, ensuring better performance on OOV inputs.