Optimizing RAG with Rerankers: The Role and Trade-offs
Rerankers can enhance the accuracy and relevance of answers in RAG systems, but these benefits come with increased latency and computational costs.
Read the entire series
- Build AI Apps with Retrieval Augmented Generation (RAG)
- Mastering LLM Challenges: An Exploration of Retrieval Augmented Generation
- Key NLP technologies in Deep Learning
- How to Evaluate RAG Applications
- Optimizing RAG with Rerankers: The Role and Trade-offs
- Exploring the Frontier of Multimodal Retrieval-Augmented Generation (RAG)
- Enhancing ChatGPT with Milvus: Powering AI with Long-Term Memory
- How to Enhance the Performance of Your RAG Pipeline
- Enhancing ChatGPT with Milvus: Powering AI with Long-Term Memory
- Pandas DataFrame: Chunking and Vectorizing with Milvus
- How to build a Retrieval-Augmented Generation (RAG) system using Llama3, Ollama, DSPy, and Milvus
- A Guide to Chunking Strategies for Retrieval Augmented Generation (RAG)
- Improving Information Retrieval and RAG with Hypothetical Document Embeddings (HyDE)
- Building RAG with Milvus Lite, Llama3, and LlamaIndex
- Enhancing RAG with RA-DIT: A Fine-Tuning Approach to Minimize LLM Hallucinations
RAG Basics
Retrieval-augmented generation (RAG) is a rising AI stack that enhances the capabilities of large language models (LLMs) by supplying them with additional, up-to-date knowledge. A standard RAG application consists of four key technology components:
Embedding models for converting external documents and user queries into vector embeddings,
A vector database for storing these embeddings and retrieving the Top-K most relevant information,
Prompts-as-code that combines both the user query and the retrieved context,
An LLM for generating answers.
Some vector databases like Zilliz Cloud provide built-in embedding pipelines that automatically transform documents and user queries into embeddings, streamlining the RAG architecture.
A basic RAG setup effectively addresses the issue of LLMs producing unreliable or 'hallucinated' content by grounding the model's responses in domain-specific or proprietary contexts. However, some enterprise users require higher context relevancy in their production use cases and seek a more sophisticated RAG setup. One increasingly popular solution is integrating a reranker into the RAG system.
What is a Reranker?
A reranker is a crucial component in the information retrieval (IR) ecosystem that evaluates and reorders search results or passages to enhance their relevance to a specific query. In RAG, this tool builds on the primary vector Approximate Nearest Neighbor (ANN) search, improving search quality by more effectively determining the semantic relevance between documents and queries.
Rerankers fall into two main categories: Score-based and Neural Network-based.
Score-based rerankers work by aggregating multiple candidate lists from various sources, applying weighted scoring or Reciprocal Rank Fusion (RRF) to unify and reorder these candidates into a single, prioritized list based on their score or relative position in the original list. This type of reranker is known for its efficiency and is widely used in traditional search systems due to its lightweight nature.
On the other hand, Neural Network-based rerankers, often called cross-encoder rerankers, leverage a neural network to analyze the relevance between a query and a document. They are specifically designed to compute a similarity score that reflects the semantic proximity between the two, allowing for a refined rearrangement of results from single or multiple sources. This method ensures more semantic relatedness, thus providing useful search results and enhancing the overall effectiveness of the retrieval system.
How Does a Reranker Enhance Your RAG Apps?
Incorporating a reranker into your RAG applications can significantly improve the precision of the generated answers as the reranker narrows the context to a smaller, curated set of highly relevant documents. As the context is shorter, it is also easier for LLM to “attend” to everything in the context and avoid losing focus.
The architecture above shows that a reranker-enhanced RAG contains a two-stage retrieval system. In the first stage, a vector database retrieves a set of relevant documents from a larger dataset. Then, in the second stage, a reranker analyzes and prioritizes these documents based on their relevance to the query. Finally, the reranked results are sent to the LLM as a more relevant context for generating better-quality answers.
Nevertheless, integrating a reranker into the RAG framework comes with its challenges and expenses. Deciding whether to add a reranker requires thoroughly analyzing your business needs and balancing the improved retrieval quality against potential latency and computational cost increases.
The Hidden Price of Rerankers
Rerankers can enhance the relevance of retrieved information in a RAG system, but this approach comes at a price.
Rerankers Significantly Increase Retrieval Latency
In a basic RAG system without a reranker, vector representations for a document are pre-processed and stored before the user conducts a query. As a result, the system only needs to perform a relatively inexpensive Approximate Nearest Neighbor (ANN) vector search to identify the Top-K documents based on metrics like cosine similarity. A vector search only takes milliseconds in a production-grade vector database like Zilliz Cloud.
In contrast, a reranker, particularly a cross-encoder, must process the query and each candidate document through a deep neural network. Depending on the model's size and hardware specifications, this process can take significantly longer — from hundreds of milliseconds to seconds. A reranker-enhanced RAG would take that much time before even generating the first token.
Rerankers Are Computationally Expensive
While offline indexing in a basic RAG workflow incurs the computational cost of neural network processing once per document, reranking requires expending similar resources for every query against each potential document in the result set. This repetitive computational expense can become prohibitive in high-traffic IR settings like web and e-commerce search platforms.
Let's do some simple math to understand how costly a reranker is.
According to statistics from VectorDBBench, a vector database handling over 200 queries per second (QPS) incurs only a $100 monthly fee, which breaks down to $0.0000002 per query. In contrast, it can cost as much as $0.001 per query if using a reranker to reorder the top 100 results from the first-stage retrieved documents. That’s a 5,000x increase over the vector search cost alone.
Although a vector database might not always operate at maximum query capacity, and the reranking could be limited to a smaller number of results (e.g., top 10), the expense of employing a cross-encoder reranker is still orders of magnitude higher than that of a vector-search-only retrieval approach.
Viewed from another angle, using a reranker equates to incurring the indexing cost during query time. The inference cost is tied to the input size (tokens) and the model size. Given that both embedding and reranker models typically range from hundreds of MB to several GB, let's assume they are comparable. With an average of 1,000 tokens per document and a negligible 10 tokens per query, reranking the top 10 documents alongside the queries would equate to 10x the computation cost of indexing a single document. This means serving 1 million queries against a corpus of 10 million documents would consume as much computational power as indexing the entire corpus, which is impractical in high-traffic use cases.
Cost Analysis: Basic RAG < Reranker-enhanced RAG < Pure LLMs
While a reranker-enhanced RAG is more costly than a basic vector-search-based RAG, it is substantially more cost-effective than relying solely on LLMs to generate answers. Within a RAG framework, rerankers serve to sift through the preliminary results from vector search, discarding documents weakly relevant to the query. This process reduces time and costs by preventing LLMs from processing extraneous information.
Taking a real-world example: First-stage retrieval, such as vector search engines, quickly sift through millions of candidates to identify the top 20 most relevant documents. A reranker, though more costly, then recomputes the ordering and narrows the Top20 list down to 5. Ultimately, the most expensive large language models (LLMs) analyze these top 5 documents along with the query to craft a comprehensive and high-quality response, balancing latency and cost.
Thus, the reranker serves as a crucial intermediary between the efficient but approximate initial retrieval stage and the high-cost LLM inference phase.
When to Integrate a Reranker with Your RAG System
Incorporating a reranker into your Retrieval-Augmented Generation (RAG) setup is beneficial when high accuracy and relevance of answers are crucial, such as in specialized knowledge bases or customer service systems. In these settings, each query comes with high business value, and the enhanced precision in filtering information justifies the added cost and latency of reranking. The reranker refines the initial vector search results, providing a more relevant context for the Large Language Model (LLM) to generate accurate responses, thereby improving the overall user experience.
However, in high-traffic use cases like web or e-commerce search engines, where rapid response and cost efficiency are paramount, there may be better options than using a cross-encoder reranker. The increased computational demand and slower response times can negatively affect user experience and operational costs. In such cases, a basic RAG approach, relying on efficient vector search alone or with a lightweight score-based reranker, is preferred to balance speed and expense while maintaining acceptable levels of accuracy and relevance.
If you decide to apply a reranker to your RAG application, you can easily enable it within Zilliz Cloud Pipelines or use reranker models with Milvus.
Summary
Rerankers refine search results to improve the accuracy and relevance of answers in Retrieval-Augmented Generation (RAG) systems, proving valuable in scenarios where cost and latency can be flexible and precision is critical. However, their advantages are accompanied by trade-offs, including increased latency and computational costs, making them less suitable for high-traffic applications. Ultimately, the decision to integrate a reranker should be based on the specific needs of the RAG system, weighing the demand for high-quality responses against performance and cost constraints.
In my next blog post, I will delve into the intricacies of hybrid search architecture, exploring various technologies, including sparse embedding, knowledge graphs, rule-based retrieval, and innovative academic concepts awaiting production deployments, like ColBERT and CEPE. Additionally, I will employ a qualitative analysis to assess the improvements in result quality and the impact on latency when implementing rerankers. A comparative review of prevalent reranker models in the market, such as Cohere, BGE, and Voyage AI, will also be featured, providing insights into their performance and efficacy.
- RAG Basics
- What is a Reranker?
- How Does a Reranker Enhance Your RAG Apps?
- The Hidden Price of Rerankers
- Cost Analysis: Basic RAG < Reranker-enhanced RAG < Pure LLMs
- When to Integrate a Reranker with Your RAG System
- Summary
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free