Your AI Reference Guide
How do LLMs handle out-of-vocabulary words?

How do LLMs handle out-of-vocabulary words?

LLMs handle out-of-vocabulary (OOV) words using subword tokenization techniques like Byte Pair Encoding (BPE) or WordPiece. These methods split rare or unseen words into smaller units (subwords) or characters that are part of the model’s vocabulary. For example, the word “unhappiness” might be tokenized as [“un”, “happiness”] or [“un”, “hap”, “pi”, “ness”].

By breaking OOV words into subwords, the model can process and understand their components, even if the exact word has not been seen during training. This allows LLMs to generalize better to new inputs. Subword tokenization also helps in encoding domain-specific terms or technical jargon by reusing familiar components.

While effective, subword tokenization has limitations. Over-segmentation can sometimes lead to loss of semantic meaning. To mitigate this, developers can fine-tune the model on domain-specific data or expand the vocabulary to include specialized terms, ensuring better performance on OOV inputs.

VectorDB for GenAI Apps

Zilliz Cloud is a managed vector database perfect for building GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

How do I select embedding models for non-English languages?

Selecting embedding models for non-English languages requires careful consideration of language support, training data,

Read Now

How can DeepSeek's models be integrated into existing systems?

DeepSeek's models can be integrated into existing systems using a combination of APIs, SDKs, and straightforward data pi

Read Now

How do you design the neural network for the reverse diffusion step?

Designing a neural network for the reverse diffusion step involves creating an architecture that effectively learns how

Read Now

Your AI Reference Guide
How do LLMs handle out-of-vocabulary words?

How do LLMs handle out-of-vocabulary words?

Recommended AI Learn Series

VectorDB for GenAI Apps

Share this article

Keep Reading

AI Assistant

Your AI Reference GuideHow do LLMs handle out-of-vocabulary words?

How do LLMs handle out-of-vocabulary words?

Recommended AI Learn Series

VectorDB for GenAI Apps

Share this article

Keep Reading

AI Assistant

Your AI Reference Guide
How do LLMs handle out-of-vocabulary words?