Build AI Apps with Retrieval Augmented Generation (RAG)
A comprehensive guide to Retrieval Augmented Generation (RAG), including its definition, workflow, benefits, use cases, and challenges.
What is RAG?
So, what's the deal with Retrieval Augmented Generation, or RAG? It acts as a framework for Large Language Models (LLMs). To help explain "RAG," let's first look at the "G." The "G" in "RAG" is where the LLM generates text in response to a user query referred to as a prompt. Unfortunately, sometimes, the models will generate a less-than-desirable response.
Question: What year did the first human land on Mars?
Incorrect Answer (Hallucinated): The first human landed on Mars in 2025.
In this example, the language model has provided a fictional answer since, as of 2024, humans have not landed on Mars! The model may generate responses based on learned patterns from training data. If it encounters a question about an event that hasn't happened, it might still attempt to provide an answer, leading to inaccuracies or hallucinations.
This answer also needs a source listed, or you don't have much confidence in where this answer came from. In addition, more often than not, the answer is out of date. In our case, the LLM hasn't been trained on recent data that NASA released about its preparation to get humans to Mars. In any event, it's crucial to be aware of and address such issues when relying on language models for information.
Here are some of the problems that we face with the response generated:
- No source listed, so you don’t have much confidence in where this answer came from
- Out-of-date information
- Answers can be made up based on the data the LLM has been trained on; we refer to this as an AI hallucination
- Content is not available on the public internet, where most LLMs get their training data from。
When I look up information on the NASA website about Humans on Mars, I can see much information from NASA on how they prepare humans to explore Mars. Looking further into the NASA site, you can see that a mission started in June 2023 to begin a 378-day Mars surface simulation. Eventually, this mission will end, so the information about humans on Mars will keep changing. With this, I have now grounded my answer with something more believable; I have a source (NASA website)) and I have not hallucinated the answer like the LLM did.
So, what is the point of using an LLM if it will be problematic? This is where the “RA” portion of “RAG” comes in. Retrieval Augmented means that instead of relying on what the LLM has been trained on, we provide the LLM with the correct answer with the sources and ask to generate a summary and list the source. This way, we help the LLM from hallucinating the answer.
We do this by putting our content (documents, PDFs, etc) in a data store like a vector database. In this case, we will create a chatbot interface for our users to interface with instead of using the LLM directly. We then create the vector embeddings of our content and store it in the vector database. When the user prompts (asks) our chatbot interface a question, we will instruct the LLM to retrieve the information that is relevant to what the query was. It will convert the question into a vector embedding and do a semantic similarity search using the data stored in the vector database. Once armed with the Retrieval-Augmented answer, our Chatbot app can send this and the sources to the LLM and ask it to generate a summary with the user questions, the data provided, and evidence that it did as instructed.
Hopefully, you can see how RAG can help LLM overcome the above-mentioned challenges. First, with the incorrect data, we provided a data store with correct data from which the application can retrieve the information and send that to the LLM with strict instructions only to use that data and the original question to formulate the response. Second, we can instruct the LLM to pay attention to the data source to provide evidence. We can even take it a step further and require the LLM to respond with “I don’t know” if the question can’t be reliably answered based on the data stored in the vector database.
How Does RAG Work?
RAG vs. Fine tuning a Model
Beginning with RAG is a suitable entry point, offering simplicity and potential adequacy for your applications. A sophisticated, prompt engineering approach will enhance the response even more. In contrast, fine-tuning serves a distinct purpose, mainly when the goal is to modify the behavior of the language model or adapt it to comprehend a different "language." These approaches can be complementary rather than mutually exclusive. A strategic approach involves fine-tuning to enhance the model's grasp of domain-specific language and desired output while leveraging RAG to elevate response quality and relevance.
Addressing RAG Challenges Head-On
LLMs in the Dark about Your Data
Traditional LLMs are only trained on datasets they can access within their training cut-off points. This cutoff point renders the LLM's dataset static and prone to responding incorrectly or providing outdated information when faced with queries outside their training scope.
AI Applications Demand Custom Data for Effectiveness
Organizations require models to understand their domain and provide answers derived from their specific data to ensure LLMs deliver relevant and particular responses. This is crucial for applications such as customer support bots or internal Q&A bots, which must furnish company-specific answers without extensive retraining.
Retrieval Augmentation as an Industry Standard
RAG has become a standard industry practice in such a short time. By including relevant data as part of the prompt, RAG connects static LLMs with real-time data retrieval, surpassing the limitations imposed by static training data.
Retrieval Augmented Generation Use Cases
Below are the most popular RAG use cases:
Question and Answer Chatbots: Automated customer support and resolved queries by deriving accurate answers from company documents and knowledge bases.
Search Augmentation: Enhancing search engines with LLM-generated answers to improve informational query responses and facilitate easier information retrieval.
Knowledge Engine for Internal Queries: Enabling employees to ask questions about company data, such as HR or finance policies or compliance documents.
Benefits of RAG
Up-to-date and Accurate Responses: RAG ensures LLM responses are based on current external data sources, mitigating the reliance on static training data.
Reduced Inaccuracies and Hallucinations: By grounding LLM output in relevant external knowledge, RAG minimizes the risk of providing incorrect or fabricated information, offering outputs with verifiable citations.
Domain-Specific, Relevant Responses: Leveraging RAG allows LLMs to provide contextually relevant responses tailored to an organization's proprietary or domain-specific data.
Efficient and Cost-Effective: RAG is simple and cost-effective compared to other customization approaches, enabling organizations to deploy it without extensive model customization.
Reference Architecture for RAG Applications
Data Preparation: Gather document data, preprocess it, and chunk it into suitable lengths based on the embedding model and downstream LLM application.
Index Relevant Data: Produce document embeddings and generate a Vector Search index with this data. Vector databases will automatically create the index for you and provide a whole host of data management capabilities.
Retrieve Relevant Data: Retrieve data relevant to a user's query and provide it as part of the prompt used for the summary generation by the LLM.
Build AI Applications: Wrap prompts with augmented content and LLM querying components into an endpoint, exposing it to applications such as Q&A chatbots through a REST API.
Evaluations: Consistent evaluation of response effectiveness to queries. Ground truth metrics compare RAG responses with established answers. Conversely, metrics like the RAG Triad assess relevance between queries, context, and responses. LLM response metrics consider friendliness, harmfulness, and conciseness.
Key Elements of RAG Architecture
Vector Database: AI applications benefit from vector databases for fast similarity searches, ensuring access to up-to-date information.
Prompt Engineering: Sophisticated instructions to the LLM to use only the provided content to generate the response.
ETL Pipeline: An ingestion pipeline to help handle duplicate data, upserts, and any transformations (text splitting, metadata extractor, etc) required to the data before storing in the Vector Database
LLM: There are many LLMs, including closed and open source.
Semantic Cache: A semantic cache like GPT Cache that stores your LLM responses to reduce spend and increase performance.
RAG tools: In RAG, using third-party tooling (LangChain, LLamaIndex, Semantic Kernel, etc) helps build RAG models and are often LLM agnostic.
Evaluation Tools & Metrics: Metrics, LLMs, and 3rd party tools to help with evaluations (TruLens, DeepEval, LangSmith, Phoenix, etc.)
Navigating the RAG Architectural Landscape
In AI, companies see that Retrieval Augmented Generation is a game-changer, not just a tool. It seamlessly blends LLMs with custom data, delivering responses that are accurate and current and industry-specific. RAG leads AI towards a future where accuracy meets flexibility, and today's language models become tomorrow's smart conversationalists. The journey has just begun, and with RAG at the helm, the possibilities are boundless.
Resources for Building RAG Applications
- What is RAG?
- How Does RAG Work?
- RAG vs. Fine tuning a Model
- Addressing RAG Challenges Head-On
- Retrieval Augmentation as an Industry Standard
- Retrieval Augmented Generation Use Cases
- Benefits of RAG
- Reference Architecture for RAG Applications
- Key Elements of RAG Architecture
- Navigating the RAG Architectural Landscape
- Resources for Building RAG Applications
Take Zilliz for a Spin for FreeGet Started Free
Share this article