Sparse and Dense Embeddings
Learn about sparse and dense embeddings, their use cases, and a text classification example using these embeddings.
Read the entire series
- Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models
- Primer on Neural Networks and Embeddings for Language Models
- Sparse and Dense Embeddings
- Sentence Transformers for Long-Form Text
- Training Your Own Text Embedding Model
- Evaluating Your Embedding Model
- Class Activation Mapping (CAM): Better Interpretability in Deep Learning Models
- CLIP Object Detection: Merging AI Vision with Language Understanding
- Discover SPLADE: Revolutionizing Sparse Data Processing
- Exploring BERTopic: An Advanced Neural Topic Modeling Technique
- Streamlining Data: Effective Strategies for Reducing Dimensionality
- All-Mpnet-Base-V2: Enhancing Sentence Embedding with AI
- Time Series Embedding in Data Analysis
- Enhancing Information Retrieval with Sparse Embeddings
- What is BERT (Bidirectional Encoder Representations from Transformers)?
Introduction
In the previous post, we explored neural networks in depth and implemented our [recurrent neural network](recurrent neural network) (RNN) in Python. This implementation allowed us to examine how RNNs process sequential data, such as text or time series. By sampling the hidden state at each timestep/token t, we can obtain a vector representation of all the content processed up to that point. This representation is known as a dense embedding - a fixed-length array of floating point numbers that encodes the input in a compact, continuous vector space.
Dense embeddings have become a fundamental component of many natural language processing (NLP) and machine learning tasks, offering a powerful way to capture semantic relationships between words or other discrete entities. They're called "dense" because every dimension in the embedding vector typically contains a non-zero value, making them information-rich but potentially computationally intensive for very large vocabularies or datasets.
In this blog post, we'll build on the knowledge we acquired in the previous two articles, exploring the strengths (and weaknesses) of both dense and sparse embeddings. We'll also walk through a practical text classification example to illustrate how these embeddings can be applied in real-world situations. It's important to note that sparse embeddings, in contrast to dense embeddings, have most of their values as zero, making them more memory-efficient but potentially less expressive for certain tasks.
To set expectations, this post is not intended to be an exhaustive deep dive into embeddings. Instead, it serves as a brief and high-level overview of what embeddings are and how they can be effectively utilized in classification tasks. We'll touch on key concepts such as dimensionality reduction, semantic similarity, and the trade-offs between dense and sparse representations.
Throughout this post, we'll also briefly mention some popular word embedding techniques like Word2Vec, GloVe, and FastText for dense embeddings, and techniques like one-hot encoding and TF-IDF for sparse embeddings. These methods have revolutionized various fields of machine learning and natural language processing, enabling more nuanced understanding and processing of textual data.
As we progress, we'll discuss how embeddings can be used to transform raw text into a format that machine learning models can understand and process effectively. We'll explore how dense embeddings can capture complex relationships between words, while sparse embeddings can be efficient for certain types of models and datasets.
Let's dive in and explore the fascinating world of embeddings, their applications, and how they can significantly enhance our machine learning models' performance on text-based tasks. By the end of this post, you'll have a clearer understanding of when and how to use different types of embeddings in your own NLP projects.
Sparse versus dense embeddings: a summary
Sparse Vectors
The embedding dimension of Sparse vectors are characterized by their high dimensionality and the presence of few non-zero values. This structure makes them particularly well-suited for traditional information retrieval applications. In most cases (though not exclusively), the number of dimensions used in sparse vectors correspond to different tokens across one or more languages. Each dimension is assigned a value that indicates the relative importance of that token within the document. This layout proves advantageous for tasks that involve some level of keyword matching.
A notable algorithm for generating sparse vectors is BM25, which builds upon the foundations of TF-IDF (Term Frequency-Inverse Document Frequency). BM25 enhances the TF-IDF approach by incorporating two key improvements: a saturation function for term frequency and a length normalization factor. These additions help to address some of the limitations of TF-IDF, resulting in more robust and effective sparse representations.
Dense Vectors
The embedding dimension of Dense vectors, in contrast, are embeddings derived from neural networks. When arranged in an ordered array, these vectors capture the semantic essence of the input text. It's worth noting that dense embeddings are not limited to text processing; they are also extensively employed in computer vision to represent the semantic content of visual data. Text embedding models typically generate these vectors, and they are distinguished by the fact that most, if not all, of their elements are non-zero.
The non-zero nature of dense embeddings makes them particularly effective for semantic search applications. In such contexts, they can return highly similar results based on vector distance calculations, even in the absence of exact keyword matches. This capability allows for more nuanced and context-aware search results, often capturing relationships between concepts that might be missed by keyword-based approaches.
To summarize the key differences: dense embeddings excel at encoding the semantics or fuzzy meaning of a whole word or piece of text. They can capture subtle relationships and contextual nuances that might not be immediately apparent from the words alone. Sparse embeddings, on the other hand, are superior for encoding exact or adjacent concepts. They provide a clear and efficient representation of specific terms and their importance within a document.
Both types of embeddings play crucial roles in text search applications, each bringing its own strengths to the table. However, there's a growing consensus in the field that dense embeddings have the potential to be more versatile and adaptable. They can be fine-tuned or targeted for specific tasks while maintaining their ability to capture broader semantic relationships.
The choice between sparse and dense embeddings often depends on the specific requirements of the task at hand. Sparse embeddings might be preferred in situations where exact keyword matching is crucial, such as in legal document searches or when dealing with highly specialized technical vocabularies. Dense embeddings, on the other hand, might be more suitable for tasks that require understanding of context and semantics, such as conversational AI or content recommendation systems.
As we continue to explore this topic, we'll examine a practical example of both sparse embed, and dense embeddings to illustrate their differences and applications more concretely. This hands-on approach will help solidify our understanding of these important concepts in natural language processing and information retrieval.
Classification example
At their core, sparse and dense vectors are specialized types of feature vectors. In traditional machine learning, feature vectors were often derived from handcrafted algorithms applied to raw data. These vectors served as inputs to train various models, such as classifiers or regressors. For instance, in computer vision, images could be represented as a collection of key points, often using techniques like SIFT (Scale-Invariant Feature Transform) descriptors.
The evolution from handcrafted features to learned representations, such as those in sparse and dense embeddings, marks a significant advancement in machine learning. This shift has allowed for more nuanced and context-aware representations of data, particularly in complex domains like natural language processing and computer vision.
To illustrate the practical application of both sparse and dense embeddings, let's walk through an example addressing a classical problem in natural language processing (NLP) - sentiment classification. This task involves determining the emotional tone of a piece of text, typically categorizing it as positive, negative, or neutral. We'll train a classifier using two different approaches: dense embeddings from a pre-trained Recurrent Neural Network (RNN) and sparse embeddings generated using the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm.
It's important to note that both types of embeddings carry valuable information, albeit in different forms. Dense embeddings from the RNN will capture semantic relationships and contextual nuances, while sparse embeddings from TF-IDF will highlight the importance of specific words or phrases in determining sentiment.
For our example, we'll use the IMDB dataset, a popular benchmark in sentiment analysis. This dataset consists of movie reviews from the Internet Movie Database (IMDB) website, each paired with a corresponding sentiment label (positive or negative). The IMDB dataset is particularly useful for this task as it provides a diverse range of writing styles, vocabulary, and sentence structures, making it a challenging and realistic test for our sentiment classification models.
To begin, we'll download the IMDB dataset using Hugging face's datasets library. Hugging face has become a central hub for NLP resources, offering easy access to a wide range of datasets and pre-trained models. The datasets library simplifies the process of acquiring and preprocessing NLP datasets, allowing us to focus on the core task of building and training our models.
Here's how we can use the datasets library to load the IMDB dataset:
from datasets import load_dataset
# load the IMDB dataset
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")
With the dataset in place, let's generate embeddings from a pre-trained large RNN on the IMDB dataset. Here, we'll use RWKV
, a "modern" RNN that's readily accessible in Huggingface's transformers
library:
import torch
from transformers import AutoTokenizer, RwkvModel
# instantiate the tokenizer and model
model_name = "RWKV/rwkv-4-169m-pile"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = RwkvModel.from_pretrained(model_name)
def generate_embeddings(dataset):
"""Generates embeddings for the input dataset.
"""
embeddings = []
for n, row in enumerate(dataset):
if n % 32 == 0:
print(f"{n}/{len(dataset)}")
inputs = tokenizer(row["text"], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings.append({"embedding": outputs.last_hidden_state, "label": row["label"]})
return embeddings
# generate train and test embeddings
train_embeddings = generate_embeddings(train_dataset)
test_embeddings = generate_embeddings(test_dataset)
The generate_embeddings
function takes the hidden state output of the RNN at the last token of the full sequence. We run this twice - once for the train
split and once for the test
split.
From here, we'll then use scikit-learn
to train an SVM classifier using the train
split embeddings and labels:
from sklearn.svm import SVC
# train an linear SVM classifier using the computed embeddings and the labelled training data
classifier = SVC(kernel="linear")
X = [row["embedding"][0,-1,:] for row in train_embeddings]
classifier.fit(X, train_dataset["label"])
From here, we can perform predictions using the test embeddings. scikit-learn
provides a great framework for evaluations with just two lines of code:
from sklearn.metrics import accuracy_score
# predict on the test set
test_predictions = classifier.predict([row["embedding"][0,-1,:] for row in test_embeddings])
# score the classifier by percentage correct
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"accuracy: {accuracy}")
If you run this code, you should get accuracy: 0.86248
.
Now, let's do the same, except for sparse text classification. We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer here. It combines term frequency (TF), a measure of how often a word appears in a document, with inverse document frequency (IDF), which reduces the weight of words common across the corpus. This results in sparse vectors where each dimension corresponds to a word's adjusted frequency, emphasizing words unique to a document and de-emphasizing common words.
We'll go through the same process: computing the sparse embeddings on the train
and test
splits, training a linear SVM classifier, and evaluating the trained classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
# create a TF-IDF vectorizer and transform the training data
vectorizer = TfidfVectorizer(max_features=10000)
train_vectors = vectorizer.fit_transform(train_dataset["text"])
# transform the test data using the same vectorizer
test_vectors = vectorizer.transform(test_dataset["text"])
# repeat the process - train a linear SVM classifier, predict, and evaluate
classifier = SVC(kernel="linear")
classifier.fit(train_vectors, train_dataset["label"])
test_predictions = classifier.predict(test_vectors)
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"accuracy: {accuracy}")
You should see the following: accuracy: 0.88144
. In this instance, sparse vectors perform slightly better than embeddings generated from RWKV
.
Putting this all together, we get a script that looks like this:
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, RwkvModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# load the IMDB dataset
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")
# instantiate the tokenizer and model
model_name = "RWKV/rwkv-4-169m-pile"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = RwkvModel.from_pretrained(model_name)
def generate_embeddings(dataset):
"""Generates embeddings for the input dataset.
"""
embeddings = []
for n, row in enumerate(dataset):
if n % 32 == 0:
print(f"{n}/{len(dataset)}")
inputs = tokenizer(row["text"], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings.append({"embedding": outputs.last_hidden_state, "label": row["label"]})
return embeddings
vectorizer = TfidfVectorizer(max_features=10000)
# generate train and test embeddings
train_embeddings = generate_embeddings(train_dataset)
test_embeddings = generate_embeddings(test_dataset)
train_vectors = vectorizer.fit_transform(train_dataset["text"])
test_vectors = vectorizer.transform(test_dataset["text"])
# classify and predict with dense vectors
classifier = SVC(kernel="linear")
X = [row["embedding"][0,-1,:] for row in train_embeddings]
classifier.fit(X, train_dataset["label"])
test_predictions = classifier.predict([row["embedding"][0,-1,:] for row in test_embeddings])
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"dense vector (RNN) accuracy: {accuracy}")
# classify and predict with sparse vectors
classifier = SVC(kernel="linear")
classifier.fit(train_vectors, train_dataset["label"])
test_predictions = classifier.predict(test_vectors)
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"sparse vector (tf-idf) accuracy: {accuracy}")
Remember that you'll need a decent amount of RAM and a good runtime if you'd like to run this experiment independently.
Wrapping up
In this post, we explored the process of generating both dense and sparse embeddings for the IMDB dataset, subsequently using these embeddings to train a Support Vector Machine (SVM) classifier. This hands-on approach allowed us to compare the effectiveness of different embedding techniques in a practical sentiment analysis task.
For the dense embeddings, we leveraged RWKV, a sophisticated Recurrent Neural Network (RNN) that has been pre-trained on a vast corpus of text data. RWKV's pre-training allows it to capture complex semantic relationships and nuances in language, which can be particularly beneficial for tasks like sentiment analysis.
In parallel, we generated sparse word embeddings by using the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. TF-IDF creates high-dimensional, sparse representations that highlight the importance of specific words within the context of the entire dataset. This approach can be particularly effective in capturing key terms that are strongly associated with positive or negative sentiment.
We then performed the same classification exercise using both types of embeddings. In our specific experiment, the dense vectors demonstrated superior performance. However, it's important to note that this outcome isn't universal and can vary depending on the specific dataset, task, and model architecture.
An interesting avenue for future exploration is the possibility of combining both dense and sparse embeddings in an ensemble approach matrix. Ensembles can often leverage the strengths of different representation techniques to achieve even better performance. We'll look deeper into ensemble methods later in this series, exploring how they can be used to create more robust and accurate models.
Looking ahead to our next tutorial, we'll shift our focus from RNNs to transformers. Transformers can be conceptualized as unrolled RNNs with built-in attention mechanisms, offering several advantages in processing sequential data. We'll repeat this sentiment classification experiment using transformer-based embeddings, providing a direct comparison with the RNN-based approach we've explored here.
Additionally, we'll expand our discussion to cover the use of embeddings in retrieval situations. This is particularly relevant as it represents one of the most common applications of vector databases in today's machine learning landscape. Retrieval tasks, such as semantic search or recommendation systems, heavily rely on efficient embedding representations to find relevant information quickly.
By exploring these advanced topics, we'll gain a more comprehensive understanding of how different embedding techniques can be applied across various NLP tasks. We'll also develop insights into the trade-offs and considerations involved in choosing the most appropriate embedding method for a given problem.
Stay tuned for our upcoming tutorials, where we'll continue to deepen our exploration of these embedding model techniques, their applications, and their impact on modern NLP and machine learning systems.
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free