Evaluating Your Embedding Model
We'll review some key considerations for selecting a model and a practical example of using Arize Pheonix and RAGAS to evaluate different text embedding models.
Read the entire series
- Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models
- Primer on Neural Networks and Embeddings for Language Models
- Sparse and Dense Embeddings
- Sentence Transformers for Long-Form Text
- Training Your Own Text Embedding Model
- Evaluating Your Embedding Model
- Class Activation Mapping: Unveiling The Visual Story
- CLIP Object Detection: Merging AI Vision with Language Understanding
- Discover SPLADE: Revolutionizing Sparse Data Processing
- Exploring BERTopic: A New Era of Neural Topic Modeling
- Streamlining Data: Effective Strategies for Reducing Dimensionality
- All-Mpnet-Base-V2: Enhancing Sentence Embedding with AI
- Time Series Embedding in Data Analysis
- Enhancing Information Retrieval with Sparse Embeddings
Latest update: June 28
Introduction to Evaluating your Embedding Models
In the past couple of blog posts, we discussed the architecture of today's dense embedding models and looked at some basic usage of the sentence-transformers
library. Many pre-trained models are available via sentence-transformers.
Still, nearly all use the same architecture as the original SBERT model - pooled features over a transformer encoder trained with masked language modeling.
From the perspective of building an application, choosing a proper text embedding model is crucial and often depends on the application's specific needs. In this blog, we'll review some key considerations for selecting a model. We'll also review a practical example of using Arize Pheonix and RAGAS to evaluate different text embedding models.
Key Considerations
Most applications today use OpenAI's embedding endpoint to generate embeddings. While it's an excellent general-purpose embedding model to get started with, it's often prudent to move to a) either your embedding model or b) a different open- or closed-source model.
Be careful when evaluating these numbers - some overfit the benchmarks and may not be ideal for your use case. Always evaluate.
Task Type and Complexity
The complexity of the embedding task should greatly influence the choice of an embedding model. Simple tasks like sentiment analysis or keyword matching can likely use any general-purpose model on the MTEB leaderboard and achieve reasonable performance. There are many applications, however, that require specialized embeddings. Take, for example, two sequences:
- "Let's eat, Chris."
- "Let's eat Chris."
The first and second sequences are nearly identical except for an extra comma. As such, most general-purpose models would place these two sequences very close to each other in a high-dimensional embedding space. However, for specific applications (such as ones that emphasize "appropriateness"), these two sequences should be on opposite ends of the spectrum, with low similarity. Taking this one step further, more complex tasks such as question answering, language translation, or sentiment analysis will require models that can capture the subtleties and nuances of the task. There is never one correct answer.
Model Performance vs. Cost
There's often a trade-off between the performance of an embedding model and its computational efficiency. High-performing models such as e5-large-v2
offer can be more "accurate" but require more parameters and much higher runtime. This can be a limiting factor for applications with real-time requirements or limited hardware capabilities. Real-time, high-throughput applications such as user-facing chatbots or recommender systems need fast results with low latency. In such cases, choosing a more compact model to minimize costs is often prudent. On the other hand, it's often better to use a much bigger model for applications where accuracy is paramount, such as semantic search across a company's limited internal corpus of legal or financial documents.
Domain of Text
The domain specificity of the language used in the application's text data is another crucial consideration. Many embedding models are trained on general language data, which might not capture the nuances of specialized vocabularies or jargon. Models trained or fine-tuned on domain-specific datasets can provide more accurate embeddings for texts within those domains. In applications like medical diagnosis from patient records, legal document analysis, or technical support for specific products, domain-specific models can significantly outperform general-purpose models by better understanding the specialized language used in these fields.
Evaluating Text Embedding Models
With the right tools for evaluation, embedding models can be powerful tools. In this section, we'll look at two ways to evaluate text embedding models.
Arize-phoenix
Arize AI's Phoenix library is an excellent multipurpose tool that helps evaluate LLMs and embed models. In particular, it provides an easy and flexible way to log and view high-dimensional embeddings to understand where things may go wrong. According to the README: "Phoenix provides an A/B testing framework to help you understand how your embeddings are changing over time and how they are changing between different versions of your model."
Let's run a quick example of using Phoenix to understand problematic embeddings. To do this, we'll first revisit the IMDb dataset from before. Recall how we previously generated embeddings with the sentence-transformers
library:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
# load the IMDB dataset
dataset = load_dataset("imdb", split="test")
# instantiate the model
model = SentenceTransformer("intfloat/e5-small-v2")
def generate_embeddings(dataset):
"""Generates embeddings for the input dataset.
"""
global model
return model.encode([row["text"] for row in dataset])
# generate embeddings
embeddings = generate_embeddings(dataset)
We can load these into a pandas
dataframe:
import pandas as pd
# create the pandas dataframe
df = pd.DataFrame({"embedding": embeddings.tolist()})
df["text"] = [row["text"] for row in dataset]
df["label"] = [row["label"] for row in dataset]
Which can then be directly used in Phoenix:
% pip install -U arize-phoenix
import phoenix as px
# create the schema
schema = px.Schema(
feature_column_names=["text"],
actual_label_column_name="label",
embedding_feature_column_names={
"text_embedding": px.EmbeddingColumnNames(
vector_column_name="embedding",
#link_to_data_column_name="text",
),
},
)
ds = px.Dataset(df, schema)
session = px.launch_app(primary=ds)
session.url
Here, we first created a Schema
object. This schema defines the data associated with each embedding, which, in our case, is the text and the label. Once that's done, we specify the data frame we created earlier and its schema for Phoenix before launching the app.
Opening the URL in a browser, we get a GUI that looks like this:
I've only displayed the first 100 elements of the IMDB test dataset here, along with an extra "mystery" embedding that isn't a movie review. We might expect that embedding to look like an outlier among the rest of the data. Indeed, it is:
Ragas
Ragas (Retrieval-Augmented Generation ASsessment) is an open-source library for evaluating RAG pipelines. I will only go over these in a bit of detail here, but check out our page on RAG if you'd like to learn more
While building these RAG pipelines is supported by many existing tools such as Llamaindex or Haystack, measuring their performance can be challenging. That's where Ragas comes into play. It offers tools to evaluate the text generated by LLMs, providing insights into how well your RAG pipeline is performing. Additionally, Ragas is built to integrate with CI/CD processes, allowing for regular performance checks to maintain and improve the quality of outcomes.
Let's look at how we can use ragas to evaluate the performance of our embedding model. We'll need to use a different dataset, but before we do that, let's add in our OpenAI key so Ragas can use it for specific metrics:
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
% pip install ragas
from ragas.metrics import context_recall, context_precision
From here, we'll import metrics associated with the quality of our embedding model. The amnesty_qa
dataset on Huggingface datasets is purpose-built for Ragas. It contains twenty rows with four pieces of data (columns) each: the question given to the LLM, a ground truth response, a response from the LLM, and the relevant context retrieved using our embedding model and vector database.
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa
DatasetDict({
eval: Dataset({
features: ['question', 'ground_truth', 'answer', 'contexts'],
num_rows: 20
})
})
from ragas import evaluate
result = evaluate(
amnesty_qa["eval"],
metrics=[
context_precision,
context_recall,
],
)
result
We can do the same for the IMDb dataset by storing all the dataset vectors in a vector database, creating a couple of sample questions, and running through the same insertion and search process documented in an earlier blog.
Wrapping up
In this post, we looked at high-level strategies for evaluating embedding models. Choosing the correct text embedding model is a reasonably strategic decision - you'll need to understand the data each model was trained on and how they were trained. By carefully considering the requirements of the underlying task you're trying to solve in addition to using tools for visualizing embeddings (such as arize-phoenix
) or retrieval systems (such as ragas
), you can optimize your embeddings as well.