Blog
Advanced RAG Techniques: Bridging Text and Visuals for More Accurate Responses

Advanced RAG Techniques: Bridging Text and Visuals for More Accurate Responses

Nov 24, 202411 min read

Large language models (LLMs) have demonstrated exceptional capabilities in understanding and generating various types of content, text or visual, making them widely adopted across various industries. Despite their versatility, LLMs often produce inaccurate or fabricated information—known as hallucinations—due to limited domain knowledge and outdated data.

Retrieval-Augmented Generation (RAG) is a technique that addresses these challenges by combining LLMs' generative abilities with retrieval systems that fetch relevant information from external sources. By grounding LLM responses in real-world, up-to-date data, RAG improves the accuracy and contextual relevance of answers and enhances the system's overall efficiency. This combination of generation and retrieval drives innovation, transforms data processing, and makes LLMs more reliable in real-world applications.

At a recent Unstructured Data Meetup hosted by Zilliz, Yi Ding, formerly Head of Typescript and Partnerships at LlamaIndex, shared insights on a new approach to RAG: Small to Slide. This method enhances the performance of multimodal LLMs when dealing with visual data like presentations or documents with images. In this blog, we’ll explore how RAG works, RAG challenges, and advanced RAG techniques like Small to Slide RAG.

Understanding RAG

RAG combines the power of LLMs with the ability to retrieve specific information from a knowledge base powered by vector databases. When a user asks a question, the RAG system searches the database for relevant information and uses this information to generate a more accurate response. Let’s take a look at how the RAG process works.

Figure: How RAG works

A RAG system consists of three key components: an embedding model, a vector database, and an LLM.

The embedding model converts documents into vector embeddings, which are stored in a vector database like Milvus.
When a user asks a question, the system transforms the query into a vector using the same embedding model.
The vector database then performs a similarity search to retrieve the most relevant information. This retrieved information is combined with the original question to form a "question with context," which is then sent to the LLM.
The LLM processes this enriched input to generate a more accurate and contextually relevant answer.

This method is particularly valuable for handling specialized knowledge or dynamic, frequently updated information, ensuring that AI systems provide precise and reliable responses.

For example, a large e-commerce platform could use this approach to power its product recommendation system. The system could process millions of product descriptions, customer reviews, and usage data and store them as vectors. When a user searches for a product, the system quickly retrieves the most relevant items, considering not just keywords but the semantic meaning of the query.

RAG Challenges and Advanced RAG Techniques

While RAG offers powerful capabilities, it also faces challenges that can impact its effectiveness, especially as data becomes large and complex.

Speed: Searching larger databases takes longer. This can be mitigated by leveraging hardware-accelerated computing, such as optimizing for GPUs or other high-performance systems.
Accuracy: Ensuring that we're retrieving the most relevant information is challenging. Developers need ways to balance performance and data accuracy based on their needs.
Multimodal Data: A basic RAG struggles with non-text data like images or charts. Handling different types of vector data becomes crucial in these cases.

In the following sections, we’ll explore several advanced RAG techniques—Small to Big RAG, Small to Slide RAG, and ColPali—that address these challenges, enhancing RAG systems' speed, accuracy, and versatility.

Small to Big RAG

Small to Big RAG is a method to enhance the accuracy of RAG systems. The core idea is to split documents into smaller, intent-focused chunks for precise searches but retrieve larger context sections rather than returning just the exact matching pieces to ensure LLMs can generate meaningful and complete responses. This is because while smaller chunks improve search precision by honing in on specific details, they can also miss critical context, leading to incomplete or inaccurate answers.

Let’s take a look at one example.

Consider the query: “Where does Ms. Churilo work?”

When processed, the embedding model might prioritize the less common word “Churilo” over more important terms like “work.” As a result, the system could retrieve sentences mentioning “Churilo” but fail to provide the key information: “Chris is currently VP of Marketing at Zilliz.”

Small to Big RAG addresses this limitation by retrieving the entire paragraph where her role is mentioned, giving the LLM the full context needed to deliver a correct and complete response.

Figure: Incorrect retrieval in traditional RAG

Small to Big RAG is particularly effective for handling complex or multi-layered text data. In a technical support system, for example, retrieving a single sentence with a relevant keyword might leave out surrounding details that are critical for understanding. Instead, Small to Big RAG retrieves all sections related to the query. For instance, when a user asks, “How do I reset my password?” the system could retrieve the paragraph with the direct answer and include additional context, such as security recommendations, making the response more comprehensive and useful.

By combining the precision of smaller embeddings with the contextual depth of broader retrieval, Small to Big RAG ensures a balance between accuracy and completeness. This approach enables AI systems to provide answers that are not only precise but also rich in context, improving their reliability in real-world applications.

Small to Slide RAG

While Small to Big RAG excels at handling text-based data, it struggles to capture information embedded in visuals such as slides, charts, or graphs. Small to Slide RAG is an improved RAG method that solves this problem by embedding text from visual elements like slides but retrieving the entire slide image. This retrieved result is then provided to and processed by a multimodal LLM capable of analyzing both text and images.

Here's how it works:

Text Embedding: The system extracts and embeds text from slides.
Visual Retrieval: Instead of retrieving only text, it retrieves the entire slide image containing the relevant information.
Multimodal Processing: The retrieved slide images are fed into a multimodal LLM (e.g., GPT-4 with vision capabilities) that can process both the visual and textual components simultaneously.

This method is particularly useful for handling complex documents where crucial information is conveyed through visuals. For example, in quarterly financial reports, key insights are often found in graphs or charts that traditional text-based RAG systems might miss. By retrieving and analyzing the entire slide image, Small to Slide RAG ensures no critical detail is overlooked, providing a more accurate and comprehensive response.

To demonstrate the power of Small to Slide RAG, Yi Ding gave a demo of a practical example using an Nvidia slide deck. You can check out the demo here.

Figure: A comparison between the context retrieved by a traditional RAG and a Small to Slide RAG

In this demo, Yi Ding compared the context retrieved by a traditional RAG system with Small to Slide RAG when answering the same question: "Which SaaS platforms does NVIDIA AI Enterprise support?"

Traditional RAG: The traditional text-based RAG system struggled to provide a complete answer. It relied solely on text extracted from the slides, missing critical information embedded in charts and diagrams. This limitation resulted in an incomplete and less informative response.
Small to Slide RAG: In contrast, Small to Slide RAG delivered far better results. The system retrieved the actual slide images containing the relevant information and fed them into a multimodal LLM (e.g., GPT-4 with vision capabilities). This allowed the model to interpret both the text and visual elements, generating a much more comprehensive and accurate answer.

This example illustrates the unique strength of Small to Slide RAG: its ability to handle complex documents where key information is spread across text and visuals, making it an invaluable tool for processing visually rich content.

For these RAG methods to function smoothly, they require infrastructure to manage complex queries and retrieval operations.

ColPali: Pushing the Boundaries

At the end of his talk, Yi introduced an emerging technique called ColPali, a novel document retrieval model that pushes the boundaries of Retrieval-Augmented Generation (RAG). Unlike traditional methods that convert documents with visual data (like slides and PDFs) into text, ColPali works directly with the visual features of documents, enabling it to index and retrieve information without the error-prone step of text extraction.

How ColPali Works

While still in its early stages, ColPali changes the way we usually conduct document retrieval by focusing on visual data:

Documents are treated as images, bypassing text conversion entirely.
Multiple embeddings, similar to visual n-grams, are generated per image to capture fine-grained visual features.
Searches are conducted by comparing visual embeddings to the query, identifying the most relevant areas of the document.

Figure: French document about oil production in Kazakhstan, with certain areas highlighted

Let’s look at the example shown in the image above: a French document about oil production in Kazakhstan. When asked about offshore oil production, ColPali highlights specific areas of the document—such as charts and diagrams—that are most relevant to the query. This ensures that both textual and visual elements are taken into account, offering a comprehensive understanding of the content.

Why ColPali is Exciting

ColPali offers several key advantages over traditional RAG systems:

By working directly with images, it avoids inaccuracies introduced by converting complex documents (e.g., PDFs) to text.
It captures visual information, such as layouts, diagrams, and formatting, that is often lost in text-based methods.
ColPali enables retrieval and search for visual data, opening possibilities in fields like technical design, legal documentation, and more.

A practical application of ColPali could be in architectural design reviews. Imagine an AI system analyzing blueprints and technical drawings directly. When queried about specific design elements, the system could highlight relevant parts of the drawings, incorporating both textual annotations and visual features.

ColPali Challenges and Limitations

While ColPali is promising, it is still in its early stages and comes with some challenges:

High Computational Demand: Generating and comparing multiple embeddings per image is computationally intensive, making GPU acceleration essential for practical use.
Limited Ecosystem: Compared to more established RAG techniques, there are fewer tools and resources available for implementing ColPali.
Language and Training-Specific Models: ColPali models currently rely on specific training datasets, which may limit generalization across different domains.

Pros and Cons of Each RAG Technique

Each of the techniques we have discussed above offers distinct advantages and limitations.

Figure: Comparison of different RAG techniques

Text-based RAG is the simplest to implement, benefiting from extensive tooling support. However, it struggles with maintaining multimodal information and doesn't always handle chunking perfectly. On the other hand, Small to Slide improves the capture of multimodal data, though it demands a more intricate setup and works best with specific data types. Lastly, ColPali simplifies the process by avoiding the need for text conversion and integrating built-in attribution, but it requires GPUs and custom models, which may limit accessibility for some projects.

Implementing Advanced RAG: Practical Considerations

If you're considering implementing these advanced RAG techniques in your own projects, here are some points to keep in mind.

Data Assessment

Look at the types of data you're working with. If you have many documents with charts, images, or complex layouts, approaches like Small to Slide RAG might be particularly beneficial. For example, if you're building a system to analyze financial reports, which often contain a mix of text, tables, and charts, Small to Slide RAG could provide more comprehensive insights than a text-only approach.

Computational Resources

Methods like Small to Slide RAG and ColPali can be more computationally intensive than traditional RAG. Ensure you have the necessary resources, particularly when working with large datasets.

Model Selection

These advanced techniques often require specific types of models. For Small to Slide RAG, you'll need a multimodal LLM capable of processing text and images. Make sure you choose a model that fits your needs and that you have the resources to run it.

Vector Database Capabilities

Choosing the right vector database is crucial for implementing a RAG system tailored to your needs. For example, if you're implementing a Small to Big RAG, you'll need a vector database capable of efficiently retrieving larger chunks of text using embeddings generated from smaller chunks. Scalability becomes key if your application deals with massive amounts of vector data, such as billion-scale datasets. Solutions like Milvus and its managed service, Zilliz Cloud, are designed for these scenarios, offering highly scalable and efficient vector retrieval.

Evaluation Metrics

Develop clear metrics to evaluate the performance of your RAG system, including the relevance of retrieved information, the accuracy of generated responses, and the speed of retrieval and generation.

Iterative Development

RAG is evolving rapidly. Plan for an iterative development process that allows you to easily test and compare different approaches. Start with a basic RAG implementation, then gradually introduce more advanced techniques like Small to Big RAG or Small to Slide RAG. Continuously evaluate performance and user feedback to guide your development process.

Conclusion

RAG techniques continue to evolve, offering different approaches like text-based RAG, Small to Slide RAG, and ColPali to address varying needs. While text-based RAG provides a solid starting point for general applications, more specialized cases, especially those handling visual data, can benefit from advanced methods like Small to Slide RAG or ColPali.

Choosing the right method depends on your data and the resources at hand. Each approach provides a way to improve the efficiency and accuracy of AI systems, enabling a better balance between complexity, speed, and cost. By aligning your approach with your specific requirements, you can leverage RAG to build systems that offer users the right information at the right time in the right format.

Further Resources

Updated on Mar 27, 2025

Fendy Feng
Fendy Feng is the Technical Marketing Writer at Zilliz. She has extensive experience developing and enhancing the impact of open-source projects in various global markets by producing high-quality, tailored content. Before joining Zilliz, Fendy worked as a Content Strategist at PingCAP, a fast-growing E-Series startup renowned for its open-source distributed SQL database.
Simon Mwaniki

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Balancing Precision and Performance: How Zilliz Cloud's New Parameters Help You Optimize Vector Search

Optimize vector search with Zilliz Cloud’s level and recall features to tune accuracy, balance performance, and power AI applications.

Introducing DeepSearcher: A Local Open Source Deep Research

In contrast to OpenAI’s Deep Research, this example ran locally, using only open-source models and tools like Milvus and LangChain.

Vector Databases vs. NoSQL Databases

Use a vector database for AI-powered similarity search; use NoSQL databases for flexibility, scalability, and diverse non-relational data storage needs.