Natural Language Processing (NLP) Advanced Guide

Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models

Nov 07, 20237 min read

This post covers Natural Language Processing fundamentals that are essential to understanding all of today’s language models.

By Frank Liu

Read the entire series

Introduction

ChatGPT (GPT-3.5) and other LLMs, such as Pi, Claude, Bard, etc, have taken the world by storm. How exactly do these language models work, and why do they work so well for the tasks they're trained on? While nobody knows the complete answer, having a proper understanding of Natural Language Processing fundamentals can help demystify their inner workings. In particular, knowledge of tokens and n-grams is invaluable for understanding nearly all modern autoregressive and autoencoding models (and are broadly applicable to today's LLMs). Let's dive right in.

Tokens and N-grams

Introductory computer science courses in C or C++ often teach the concept of strings early on. A string in C, for example, can be represented as an array of characters terminated by the null character:

char my_str[128] = "Milvus";

In this example, each character can be thought of as a discrete unit, and the combination of each of these put together results in text that has meaning - in this case, my_str corresponds to the world's most widely adopted vector database. Put simply, this is the definition of an n-gram: a series of characters (or some other discrete unit which we'll discuss in the next paragraph), which, when strung together, have coherent meaning. N in this instance would correspond to the total number of characters in the string (7, in this case).

The concept of n-grams doesn't have to be single characters - they can be extended to words as well. The string below, for example, is a trigram (3-gram) of words:

char my_str[128] = "Milvus vector database"

In the example above, it's fairly clear that my_str is composed of three words, but things get a bit tricky once we take punctuation into consideration:

char my_str[128] = "Milvus's architecture is unparalleled"

The above string, strictly speaking, is four words, but the first word Milvus's is a possessive noun which uses another word Milvus as the base. For a language model, it can be helpful to split words such as these into discrete units so that the extra context is preserved: Milvus and 's. These are called tokens, and the underlying method for splitting sentences into words is called tokenization. With this strategy, the string above is now a 5-gram of tokens.

All modern language models perform some sort of input tokenization before the data is transformed. There are a number of different tokenizers out there - WordPiece, for example, is a popular one which is used in most flavors of BERT. We won't dive too deep into tokenizers in this series - for folks who want to learn a bit more, check out Huggingface's tokenizer summary.

N-gram Models

Now that we have a better understanding of n-grams, we can turn our attention towards n-gram models. In simple terms, an n-gram model is a simple, probabilistic language model that outputs the probability that a particular token appears after an existing string of tokens. For example, we can model the probability ( $p$ ) that a particular token follows another token ( $|$ ) in a sentence or phrase:

$p(database | vector) = 0.1$

The above statement says that, for this particular language model, the word "vector" follows the word "database" with 10% probability. For n-gram models, these models are always computed by looking at the count of bigrams in the input corpus of documents, but in other language models, they can be set manually or taken from the output of a machine learning model.

The example above is a bigram model, but we can extend this to sequences of arbitrary length. Here's an example of a trigram:

$p(database | Milvus, vector) = 0.9$

This encodes the fact that the word "database" will follow the tokens "Milvus vector" with 90% probability. Likewise, we can write:

$p(chocolate | Milvus, vector) = 0.001$

This encodes the fact that the subsequent word after "Milvus vector" is unlikely to be "chocolate" (0.1% probability, to be exact). Applying this to a sequence of longer length, we have:

$p(Milvus | the, most, widely, adopted, vector, database, is) = 0.999$

Now that we have a solid understanding of probabilistic n-gram models, let's turn our attention to an arguably more important question: how do we compute these probabilities? The short and simple answer is: we count the number of occurrences in our document or corpus of documents. Let's walk through an example with the following three phrases (the <S> at the start of each sentence denotes a special start-of-sentence token). For clarity, I've also added an extra space between the ending period and its preceding token in each sentence.

<S> Milvus is the most widely adopted vector database .
<S> vector search with Milvus .
<S> Milvus rocks .

Let's list out bigrams that begin with either <S>, Milvus, or vector:

some_bigrams = {
    # these bigrams begin with <S>
    ("<S>", "Milvus"): 2,
    ("<S>", "vector"): 1,
    # these bigrams begin with Milvus
    ("Milvus", "is"): 1,
    ("Milvus", "."): 1,
    ("Milvus", "rocks"): 1,
    # these bigrams begin with vector
    ("vector", "database"): 1,
    ("vector", "search"): 1
}

Given these occurrences, we can compute probabilities by normalizing across the total number of times each token appears. For example:

$p(Milvus | <S>) = \frac{c(<S>, Milvus)}{c(<S>)} = \frac{2}{3}$

Similarly:

$p(rocks | Milvus) = \frac{c(Milvus, rocks)}{c(Milvus)} = \frac{1}{3} p(database | vector) = \frac{c(vector database)}{c(vector)} = \frac{1}{2}$

Armed with this knowledge, let's generalize this and write some code to build a bigram model. For simplicity, we'll assume each token in all input documents are separated by some whitespace (recall from the earlier section that modern tokenizers often have much more complex rules). Let's start off by defining the model itself, i.e. bigram counts and token counts.

from typing import Dict, Tuple
from collections import defaultdict


# keys correspond to tokens
# values are the number of occurences
token_counts = defaultdict(int)

# keys correspond to 2-tuples bigram pairs
# values are the number of occurences
bigram_counts = defaultdict(int)


def build_bigram_model(corpus: Dict[str]):
    """Bigram model.
    """

    # loop through all documents in the corpus
    for doc in corpus:

        prev = "<S>"
        for word in doc.split():

            # update token counts
            token_counts[word] += 1

            # update bigram counts
            bigram = (prev, word)
            bigram_counts[bigram] += 1

            prev = word

        # add a dummy end-of-sequence token so probabilities add to one
        bigram_counts[(word, "</S>")] += 1

    return (token_counts, bigram_counts)


def bigram_probability(bigram: Tuple[str]):
    """Computes the likelihood of the bigram from the corpus.
    """

    return bigram_counts[bigram] / token_counts[bigram[0]]

The build_bigram_model then loops through the entire corpus of documents, splitting each by whitespace before storing bigram and token counts. We can then call the bigram_probability function, which looks up the corresponding bigram count and token count and returns the ratio.

Now let's test this on President Biden's 2022 State of the Union address. You can download the transcript here and try it out with the code above, if you'd like.

>>> with open("state_of_the_union.txt", "r") as f:
...     build_bigram_model([f.read()])
...
>>> print(bigram_probability(("keep", "moving")))
0.07692307692307693

Bag-of-Words Models

Following on our discussion of n-grams, it's also worth it to briefly mention bag-of-words models (BoW). BoW models represent a document or corpus of documents as an unordered set of tokens - in this sense, it maintains the frequency with which each token appears, but disregards the order in which they appear within each document. As such, whole documents in BoW models can be converted to a sparse vector, where each entry of the vector corresponds to the frequency with which a particular word appears in the document. Here, we'll represent the document "Milvus is the most widely adopted vector database. Vector search with Milvus is easy." as a BoW sparse vector:

# limited vocabulary
bow_vector = [
    0, # a
    1, # adopted
    0, # bag
    0, # book
    0, # coordinate
    1, # database
    1, # easy
    0, # fantastic
    0, # good
    0, # great
    2, # is
    0, # juggle
    2, # Milvus
    1, # most
    0, # never
    0, # proof
    0, # quotient
    0, # ratio
    0, # rectify
    1, # search
    1, # the
    0, # undulate
    2, # vector
    1, # widely
    1, # with
    0, # yes
    0, # zebra
]

These sparse vectors can then be used in a variety of NLP tasks, such as text and sentiment classification. We won't discuss training and inference of BoW models in this blog post; if you'd like to know a bit more, Jason Brownlee's blog is a good resource.

While BoW models are simple to understand and use, they have clear limitations. They do not capture context nor the semantic meaning of individual tokens, making them less suitable for anything beyond the simplest tasks. For this reason, neural networks are exclusively used today; a topic we will discuss in a later post.

Natural Language Processing Fundamentals Wrap up

In this post, we discussed three core fundamentals of Natural Language Processing - tokenization, n-grams, and bag-of-words models. In particular, concepts around n-grams will be broadly useful in later topics around ways autoregressive and autoencoding models are trained today.

In the next tutorial, we'll continue our Embeddings for NLP series by discussing "modern" NLP, i.e. recurrent networks and text embeddings. Stay tuned!

Frank Liu
Frank Liu is the Director of Operations & ML Architect at Zilliz, where he serves as a maintainer for the Towhee open-source project. Prior to Zilliz, Frank co-founded Orion Innovations, an ML-powered indoor positioning startup based in Shanghai and worked as an ML engineer at Yahoo in San Francisco. In his free time, Frank enjoys playing chess, swimming, and powerlifting. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.

Next: Primer on Neural Networks and Embeddings for Language Models

Content

Take Zilliz for a Spin for Free

Get Started Free

Share this article