Tokenization is the process of breaking down text into smaller units, called tokens, which are used as input for LLMs. Tokens can be individual words, subwords, or even characters, depending on the tokenization method. For example, the sentence “The cat sat” might be tokenized into [“The”, “cat”, “sat”] or subword units like [“Th”, “e”, “cat”, “sat”].
Tokenization is essential because LLMs process numerical representations of tokens rather than raw text. Once the text is tokenized, each token is converted into a numeric value or embedding, which the model uses to perform computations. This enables the model to understand and generate text efficiently.
Modern tokenization methods, such as Byte Pair Encoding (BPE) or WordPiece, are commonly used in LLMs. These methods strike a balance between splitting text into meaningful units and maintaining compact representations. Proper tokenization is critical for the model’s performance, as it impacts how well the model understands input and generates coherent outputs.