Sparse and Dense Embeddings
Learn about sparse and dense embeddings, their use cases, and a text classification example using these embeddings.
Read the entire series
- Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models
- Primer on Neural Networks and Embeddings for Language Models
- Sparse and Dense Embeddings
- Sentence Transformers for Long-Form Text
- Training Your Own Text Embedding Model
- Evaluating Your Embedding Model
- Class Activation Mapping: Unveiling The Visual Story
- CLIP Object Detection: Merging AI Vision with Language Understanding
- Discover SPLADE: Revolutionizing Sparse Data Processing
- Exploring BERTopic: A New Era of Neural Topic Modeling
- Streamlining Data: Effective Strategies for Reducing Dimensionality
- All-Mpnet-Base-V2: Enhancing Sentence Embedding with AI
- Time Series Embedding in Data Analysis
- Enhancing Information Retrieval with Learned Sparse Embeddings
Introduction
In the previous post, we recapped neural networks and implemented our recurrent neural network in Python. By sampling the hidden state at each timestep/token t
, we can get a representation of all the content up to then. This is known as a dense embedding - a fixed-length array of floating point numbers that encodes the input.
In this blog post, we'll build on the knowledge we learned in the previous two articles, exploring dense and sparse embeddings' strengths (and weaknesses) and going through a text classification example. Just to let you know, this post is not meant to be a deep dive into embeddings and is intended to serve as a brief and high-level overview of what embeddings are and how they can be used in a classification scenario.
Let's dive in.
Sparse versus dense embeddings: a summary
Sparse vectors are very high-dimensional but contain few non-zero values, making them suitable for traditional information retrieval use cases. Typically (but not always), the dimensions represent different tokens in one or more languages, with values assigned to each, indicating their relative importance in that document. This type of layout is beneficial for tasks that require some keyword matching. BM25 is a well-known algorithm for generating sparse vectors, improving upon TF-IDF by incorporating a saturation function for term frequency and a length normalization factor.
Dense vectors, on the other hand, are embeddings from neural networks that, when combined in an ordered array, capture the semantics of the input text (they are also employed in computer vision for representing the semantics of visual data). These vectors are typically generated by text embedding models and are characterized by most or all elements being non-zero. This makes them highly effective for semantic search, as they return the most similar results based on distance, even without exact keyword matches.
In short, dense embeddings are better for encoding the semantics or fuzzy meaning of a piece of text, while sparse embeddings are better for encoding exact or adjacent concepts. Both are incredibly useful for text search, but it's generally believed that dense embeddings can be made to be more general-purpose and/or targeted.
Let's go through a quick example of both.
Classification example
At its core, sparse and dense vectors are just special types of feature vectors. In more traditional machine learning, feature vectors were often derived from handcrafted algorithms atop the raw data and used as inputs to train a classifier or regressor. Images, for example, could be represented as a collection of key points (SIFT descriptors).
Let's walk through an example of using both sparse and dense embeddings to undergo a classical problem in natural language processing (NLP) - sentiment classification. We'll then train a classifier using dense embeddings from a pre-trained RNN and sparse embeddings from the TF-IDF algorithm. Both types of embeddings carry valuable information.
Let's start by downloading the IMDB dataset. In this dataset, all samples correspond to a movie review on IMDB and a corresponding sentiment (positive or negative). We can grab the dataset using Huggingface's datasets
library:
from datasets import load_dataset
# load the IMDB dataset
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")
With the dataset in place, let's generate embeddings from a pre-trained large RNN on the IMDB dataset. Here, we'll use RWKV
, a "modern" RNN that's readily accessible in Huggingface's transformers
library:
import torch
from transformers import AutoTokenizer, RwkvModel
# instantiate the tokenizer and model
model_name = "RWKV/rwkv-4-169m-pile"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = RwkvModel.from_pretrained(model_name)
def generate_embeddings(dataset):
"""Generates embeddings for the input dataset.
"""
embeddings = []
for n, row in enumerate(dataset):
if n % 32 == 0:
print(f"{n}/{len(dataset)}")
inputs = tokenizer(row["text"], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings.append({"embedding": outputs.last_hidden_state, "label": row["label"]})
return embeddings
# generate train and test embeddings
train_embeddings = generate_embeddings(train_dataset)
test_embeddings = generate_embeddings(test_dataset)
The generate_embeddings
function takes the hidden state output of the RNN at the last token of the full sequence. We run this twice - once for the train
split and once for the test
split.
From here, we'll then use scikit-learn
to train an SVM classifier using the train
split embeddings and labels:
from sklearn.svm import SVC
# train an linear SVM classifier using the computed embeddings and the labelled training data
classifier = SVC(kernel="linear")
X = [row["embedding"][0,-1,:] for row in train_embeddings]
classifier.fit(X, train_dataset["label"])
From here, we can perform predictions using the test embeddings. scikit-learn
provides a great framework for evaluations with just two lines of code:
from sklearn.metrics import accuracy_score
# predict on the test set
test_predictions = classifier.predict([row["embedding"][0,-1,:] for row in test_embeddings])
# score the classifier by percentage correct
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"accuracy: {accuracy}")
If you run this code, you should get accuracy: 0.86248
.
Now, let's do the same, except for sparse text classification. We'll use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer here. It combines term frequency (TF), a measure of how often a word appears in a document, with inverse document frequency (IDF), which reduces the weight of words common across the corpus. This results in sparse vectors where each dimension corresponds to a word's adjusted frequency, emphasizing words unique to a document and de-emphasizing common words.
We'll go through the same process: computing the sparse embeddings on the train
and test
splits, training a linear SVM classifier, and evaluating the trained classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
# create a TF-IDF vectorizer and transform the training data
vectorizer = TfidfVectorizer(max_features=10000)
train_vectors = vectorizer.fit_transform(train_dataset["text"])
# transform the test data using the same vectorizer
test_vectors = vectorizer.transform(test_dataset["text"])
# repeat the process - train a linear SVM classifier, predict, and evaluate
classifier = SVC(kernel="linear")
classifier.fit(train_vectors, train_dataset["label"])
test_predictions = classifier.predict(test_vectors)
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"accuracy: {accuracy}")
You should see the following: accuracy: 0.88144
. In this instance, sparse vectors perform slightly better than embeddings generated from RWKV
.
Putting this all together, we get a script that looks like this:
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, RwkvModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# load the IMDB dataset
train_dataset = load_dataset("imdb", split="train")
test_dataset = load_dataset("imdb", split="test")
# instantiate the tokenizer and model
model_name = "RWKV/rwkv-4-169m-pile"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = RwkvModel.from_pretrained(model_name)
def generate_embeddings(dataset):
"""Generates embeddings for the input dataset.
"""
embeddings = []
for n, row in enumerate(dataset):
if n % 32 == 0:
print(f"{n}/{len(dataset)}")
inputs = tokenizer(row["text"], return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings.append({"embedding": outputs.last_hidden_state, "label": row["label"]})
return embeddings
vectorizer = TfidfVectorizer(max_features=10000)
# generate train and test embeddings
train_embeddings = generate_embeddings(train_dataset)
test_embeddings = generate_embeddings(test_dataset)
train_vectors = vectorizer.fit_transform(train_dataset["text"])
test_vectors = vectorizer.transform(test_dataset["text"])
# classify and predict with dense vectors
classifier = SVC(kernel="linear")
X = [row["embedding"][0,-1,:] for row in train_embeddings]
classifier.fit(X, train_dataset["label"])
test_predictions = classifier.predict([row["embedding"][0,-1,:] for row in test_embeddings])
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"dense vector (RNN) accuracy: {accuracy}")
# classify and predict with sparse vectors
classifier = SVC(kernel="linear")
classifier.fit(train_vectors, train_dataset["label"])
test_predictions = classifier.predict(test_vectors)
accuracy = accuracy_score(test_dataset["label"], test_predictions)
print(f"sparse vector (tf-idf) accuracy: {accuracy}")
Remember that you'll need a decent amount of RAM and a good runtime if you'd like to run this experiment independently.
Wrapping up
In this post, we generated dense and sparse embeddings for the IMDB dataset before training an SVM classifier using labeled data. For the dense embeddings, we used RWKV - an RNN pre-trained across a large quantity of text data. We also generated TF-IDF sparse vectors and performed the same exercise. In this instance, the dense vectors performed better, but there's always the possibility of using both together in an ensemble. We'll discuss this much later in this series.
In the following tutorial, we'll move away from RNNs and discuss embeddings from transformers, which you can think of as unrolled RNNs with built-in attention mechanisms. We'll run this same experiment with transformers and compare the results. We'll also discuss using embeddings in a retrieval scenario, one of the most common ways vector databases are used today. Stay tuned!