Kickstart Your Local RAG Setup: A Beginner's Guide to Using Llama 3 LangChain with Ollama and Milvus
With the rise of Open-Source LLMs like Llama 3, Mistral, Gemma, and more, it has become apparent that Large Language Models (LLMs) might also be useful even when run locally. This approach is not only practical but also becomes essential, as costs can skyrocket when scaling up with commercial LLMs like GPT-3 or GPT-4.
In this hands-on guide, we will see how to deploy a Retrieval Augmented Generation (RAG) to create a question-answering (Q&A) chatbot that can answer questions about specific information This setup will also use Ollama and Llama 3, powered by Milvus as the vector store. A typical implementation involves setting up a text generation pipeline for Llama 3.
The different tools to build this retrieval augmented generation (rag) setup include:
Ollama: Ollama is an open-source tool that allows the management of Llama 3 on local machines. It brings the power of LLMs to your laptop, simplifying local operation.
LangChain LangChain is a framework that simplifies the development of LLM-powered applications. It is what we use to create an agent and interact with our Data.
is the vector database we use to store and retrieve your relevant data efficiently.
Llama 3 is an open-source Large Language Model developed by Meta and is latest iteration of a lineup of LLMs. It provides a complex prompt format for user interactions and supports multiple user roles including 'system', 'user', and 'assistant'.
Q&A with RAG
We will build a sophisticated question-answering (Q&A) chatbot using RAG (Retrieval Augmented Generation). This will allow us to answer questions about specific information.
What exactly is retrieval augmented generation?
RAG, or Retrieval Augmented Generation, is a technique that enhances LLMs by integrating additional data sources. A typical RAG application involves:
Indexing - a pipeline for ingesting data from a source and indexing it, which usually consists of Loading, Splitting and Storing the data in Milvus. The data stored in Milvus are the representation of the data in the form of vector embeddings. Embeddings capture the essence of content, allowing for more relevant search results compared to keyword searches.
Retrieval and generation - Retrieval Augmented Generation systems improve the quality of responses by retrieving relevant context from a vector database and passing that to an LLM before generating answers. When given context, the LLM will avoid generating hallucinations. More specifically, at runtime, RAG processes the user's query, fetches relevant data from the index stored in Milvus, and the LLM generates a response based on this enriched context.
This guide is designed to be practical and hands-on, showing you how local LLMs can be used to set up a RAG application. It's not just for experts-even beginners can dive in and start building their own Q&A chatbot. Let's get started!
Prerequisites
Before starting to set up the different components of our tutorial, make sure your system has the following:
- Docker & Docker-Compose - Ensure Docker and Docker-Compose are installed on your system.
- Milvus Standalone - For our purposes, we'll use Milvus Standalone, which is easy to manage via Docker Compose; check out how to install it in our documentation
- Ollama - Install Ollama on your system; visit their website for the latest installation guide.
##Langchain Setup
Once you've installed all the prerequisites, you're ready to set up your RAG application:
- Start a Milvus Standalone instance with:
docker-compose up -d.
- This command starts your Milvus instance in detached mode, running quietly in the background.
- Fetch an LLM model via:
ollama pull <name_of_model>
- View the list of available models via their library
- e.g.
ollama pull llama3
- This command downloads the default (usually the latest and smallest) version of the model.
- To chat directly with a model from the command line, use
ollama run <name-of-model>
Install dependencies for vector store
To run this application, you need to install the Python libraries. You can either use Poetry if you use the code on Github directly or install them with pip
if you prefer.
pip install langchain pymilvus ollama pypdf langchainhub langchain-community langchain-experimental
RAG Application for llm powered applications
As said earlier, one main component of RAG is indexing the data.
- Start by data ingestion from your PDF using PyPDFLoader
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001813756/975b3e9b-268e-4798-a9e4-2a9a7c92dc10.pdf"
)
data = loader.load()
- Splitting the data
Break down the loaded data into manageable chunks using the RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)
- Getting the Embeddings and storing the data in Milvus
Next, convert the text data into vector embeddings using Jina AI’s Small English embeddings, and store it into Milvus.
from langchain_community.embeddings.jina import JinaEmbeddings
from langchain.vectorstores.milvus import Milvus
embeddings = JinaEmbeddings(
jina_api_key=JINA_AI_API_KEY, model_name="jina-embeddings-v2-small-en"
)
vector_store = Milvus.from_documents(documents=all_splits, embedding=embeddings)
- Load your LLM
Ollama makes it easy to load and use an LLM locally. In our example, we will use Llama 3 by Meta, here is how to load it:
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(
model="llama3",
callback_manager=CallbackManager(
[StreamingStdOutCallbackHandler()]
),
stop=["<|eot_id|>"],
)
- Build your QA chain with Langchain
Finally, construct your QA chain to process and respond to user queries:
from langchain import hub
from langchain.chains import RetrievalQA
query = input("\nQuery: ")
prompt = hub.pull("rlm/rag-prompt")
qa_chain = RetrievalQA.from_chain_type(
llm, retriever=vectorstore.as_retriever(), chain_type_kwargs={"prompt": prompt}
)
result = qa_chain({"query": query})
print(result)
Run your application
Execute your RAG application by the last cell with the result variable.
Execute your RAG application by running:
python rag_ollama.py
Example of a QA interaction:
Query: What is this document about?
The document appears to be a 104 Cover Page Interactive Data File for an SEC filing. It contains information about the company's financial statements and certifications.{'query': 'What is this document about?', 'result': "The document appears to be a 104 Cover Page Interactive Data File for an SEC filing. It contains information about the company's financial statements and certifications."}
And there you have it! You've just set up a sophisticated local LLM using Ollama with Llama 3, Langchain, and Milvus. This setup not only makes it feasible to handle large datasets efficiently but also enables a highly responsive local question-answering system.
Feel free to check out Milvus, the code on Github, and share your experiences with the community by joining our Discord.
Recap of retrieved context
This guide provided a walkthrough for setting up a Retrieval Augmented Generation (RAG) application using local Large Language Models (LLMs). By leveraging tools like Ollama, Llama 3, LangChain, and Milvus, we demonstrated how to create a powerful question-answering (Q&A) chatbot capable of handling specific information queries with retrieved context from a vector store.
Key takeaways include:
RAG Overview: RAG enhances LLM capabilities by integrating external data sources. It involves indexing data into vector embeddings using Milvus and retrieving relevant context during query processing to generate accurate and informed responses.
Tooling Highlights:
Ollama simplifies managing and running Llama 3 models locally.
LangChain provides an intuitive framework for developing LLM-based applications.
Milvus, as the vector store, efficiently stores and retrieves vectorized data, enabling precise query handling.
Llama 3, developed by Meta, supports advanced functionality with features like multi-role interactions and customizable system prompts.
The setup process covered essential steps:
Prerequisites: Installing Docker, Milvus Standalone, Ollama, and the required Python libraries.
Indexing Data: Using tools like PyPDFLoader and RecursiveCharacterTextSplitter to load, split, and vectorize data.
Embedding Storage: Converting text into embeddings with Jina AI model and storing them in the Milvus vector store.
Building a QA Chain: Integrating the vector store with Llama 3 through LangChain, configuring a tailored system prompt to process and respond to user queries.
By running the application, users can interact with their local chatbot to retrieve accurate, contextually relevant answers from specific datasets. Once relevant documents are retrieved, they are passed to the LLM for answer generation, a proven effective method. The guide emphasized practicality, ensuring accessibility for both beginners and experts. Keep in mind, for optimal retrieval, parameters may need to be fine-tuned based on the specifics of the application.
With the growing prominence of open-source LLMs, this setup highlights how localized implementations reduce costs, enhance security, and offer scalable solutions for information retrieval and generation. Explore the GitHub repository and join our Discord community to share your experience and insights!
- Q&A with RAG
- Prerequisites
- Install dependencies for vector store
- RAG Application for llm powered applications
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
How AI Is Transforming Information Retrieval and What’s Next for You
This blog will summarize the monumental changes AI brought to Information Retrieval (IR) in 2024.
- Read Now
Insights into LLM Security from the World’s Largest Red Team
We will discuss how the Gandalf project revealed LLMs' vulnerabilities to adversarial attacks. Additionally, we will address the role of vector databases in AI security.
- Read Now
Introducing IBM Data Prep Kit for Streamlined LLM Workflows
The Data Prep Kit (DPK) is an open-source toolkit by IBM Research designed to streamline unstructured data preparation for building AI applications.