DeepSeek's R1 model tackles out-of-vocabulary (OOV) words through a combination of subword tokenization and context-based embeddings. When the model encounters a word that it has not seen during training, it does not simply discard it. Instead, it breaks the word down into smaller, more manageable pieces. This process is commonly called subword tokenization. For example, the word "unhappiness" might be split into the components "un-", "happi", and "-ness." This allows the model to leverage its existing vocabulary and the meanings of these smaller parts to derive some understanding of the original word.
Additionally, the R1 model employs contextual embeddings, which means it takes into account the surrounding words when determining the meaning of an OOV term. For instance, if the model encounters the phrase "unhappiness in the workplace," it can analyze the context—like "unhappiness" being linked to "workplace"—to infer that the word relates to negative feelings about one's job. This way, the model can create a sense of meaning even for words that it has not trained on and provide more relevant responses or predictions based on the context in which they appear.
Overall, DeepSeek's R1 model effectively minimizes the impact of OOV words by using subword tokenization to break down unfamiliar terms and context-based embeddings to understand their meanings based on surrounding text. This strategy helps improve the robustness and flexibility of the model, ensuring that it remains effective across diverse inputs, including newly coined terms or specialized jargon not present during training. These techniques are essential for maintaining performance in real-world applications where linguistic diversity is common.