Blog
Kickstart Your Local RAG Setup: A Beginner's Guide to Using Llama 3 LangChain with Ollama and Milvus

Kickstart Your Local RAG Setup: A Beginner's Guide to Using Llama 3 LangChain with Ollama and Milvus

Apr 19, 20246 min read

With the rise of Open-Source LLMs like Llama 3, Mistral, Gemma, and more, it has become apparent that Large Language Models (LLMs) might also be useful even when run locally. This approach is not only practical but also becomes essential, as costs can skyrocket when scaling up with commercial LLMs like GPT-3 or GPT-4.

In this hands-on guide, we will see how to deploy a Retrieval Augmented Generation (RAG) to create a question-answering (Q&A) chatbot that can answer questions about specific information This setup will also use Ollama and Llama 3, powered by Milvus as the vector store. A typical implementation involves setting up a text generation pipeline for Llama 3.

The different tools to build this retrieval augmented generation (rag) setup include:

Ollama: Ollama is an open-source tool that allows the management of Llama 3 on local machines. It brings the power of LLMs to your laptop, simplifying local operation.
LangChain LangChain is a framework that simplifies the development of LLM-powered applications. It is what we use to create an agent and interact with our Data.
is the vector database we use to store and retrieve your relevant data efficiently.
Llama 3 is an open-source Large Language Model developed by Meta and is latest iteration of a lineup of LLMs. It provides a complex prompt format for user interactions and supports multiple user roles including 'system', 'user', and 'assistant'.

Q&A with RAG

We will build a sophisticated question-answering (Q&A) chatbot using RAG (Retrieval Augmented Generation). This will allow us to answer questions about specific information.

What exactly is retrieval augmented generation?

RAG, or Retrieval Augmented Generation, is a technique that enhances LLMs by integrating additional data sources. A typical RAG application involves:

Indexing - a pipeline for ingesting data from a source and indexing it, which usually consists of Loading, Splitting and Storing the data in Milvus. The data stored in Milvus are the representation of the data in the form of vector embeddings. Embeddings capture the essence of content, allowing for more relevant search results compared to keyword searches.
Retrieval and generation - Retrieval Augmented Generation systems improve the quality of responses by retrieving relevant context from a vector database and passing that to an LLM before generating answers. When given context, the LLM will avoid generating hallucinations. More specifically, at runtime, RAG processes the user's query, fetches relevant data from the index stored in Milvus, and the LLM generates a response based on this enriched context.

This guide is designed to be practical and hands-on, showing you how local LLMs can be used to set up a RAG application. It's not just for experts-even beginners can dive in and start building their own Q&A chatbot. Let's get started!

Prerequisites

Before starting to set up the different components of our tutorial, make sure your system has the following:

Docker & Docker-Compose - Ensure Docker and Docker-Compose are installed on your system.
Milvus Standalone - For our purposes, we'll use Milvus Standalone, which is easy to manage via Docker Compose; check out how to install it in our documentation
Ollama - Install Ollama on your system; visit their website for the latest installation guide.

##Langchain Setup

Once you've installed all the prerequisites, you're ready to set up your RAG application:

Start a Milvus Standalone instance with: docker-compose up -d.
This command starts your Milvus instance in detached mode, running quietly in the background.
Fetch an LLM model via: ollama pull <name_of_model>
- View the list of available models via their library
- e.g. ollama pull llama3
This command downloads the default (usually the latest and smallest) version of the model.
To chat directly with a model from the command line, use ollama run <name-of-model>

Install dependencies for vector store

To run this application, you need to install the Python libraries. You can either use Poetry if you use the code on Github directly or install them with pip if you prefer.

pip install langchain pymilvus ollama pypdf langchainhub langchain-community langchain-experimental

RAG Application for llm powered applications

As said earlier, one main component of RAG is indexing the data.

Start by data ingestion from your PDF using PyPDFLoader

from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(
"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001813756/975b3e9b-268e-4798-a9e4-2a9a7c92dc10.pdf"
)
data = loader.load()

Splitting the data

Break down the loaded data into manageable chunks using the RecursiveCharacterTextSplitter

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

Getting the Embeddings and storing the data in Milvus

Next, convert the text data into vector embeddings using Jina AI’s Small English embeddings, and store it into Milvus.

from langchain_community.embeddings.jina import JinaEmbeddings
from langchain.vectorstores.milvus import Milvus

embeddings = JinaEmbeddings(
   jina_api_key=JINA_AI_API_KEY, model_name="jina-embeddings-v2-small-en"
)
vector_store = Milvus.from_documents(documents=all_splits, embedding=embeddings)

Load your LLM

Ollama makes it easy to load and use an LLM locally. In our example, we will use Llama 3 by Meta, here is how to load it:

from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = Ollama(
model="llama3",
callback_manager=CallbackManager(
            [StreamingStdOutCallbackHandler()]
),
stop=["<|eot_id|>"],
)

Build your QA chain with Langchain

Finally, construct your QA chain to process and respond to user queries:

from langchain import hub
from langchain.chains import RetrievalQA

query = input("\nQuery: ")
prompt = hub.pull("rlm/rag-prompt")   
    
qa_chain = RetrievalQA.from_chain_type(
    llm, retriever=vectorstore.as_retriever(), chain_type_kwargs={"prompt": prompt}
)

result = qa_chain({"query": query})
print(result)

Run your application

Execute your RAG application by the last cell with the result variable.

Execute your RAG application by running: python rag_ollama.py

Example of a QA interaction:

Query: What is this document about?

The document appears to be a 104 Cover Page Interactive Data File for an SEC filing. It contains information about the company's financial statements and certifications.{'query': 'What is this document about?', 'result': "The document appears to be a 104 Cover Page Interactive Data File for an SEC filing. It contains information about the company's financial statements and certifications."}

And there you have it! You've just set up a sophisticated local LLM using Ollama with Llama 3, Langchain, and Milvus. This setup not only makes it feasible to handle large datasets efficiently but also enables a highly responsive local question-answering system.

Feel free to check out Milvus, the code on Github, and share your experiences with the community by joining our Discord.

Recap of retrieved context

This guide provided a walkthrough for setting up a Retrieval Augmented Generation (RAG) application using local Large Language Models (LLMs). By leveraging tools like Ollama, Llama 3, LangChain, and Milvus, we demonstrated how to create a powerful question-answering (Q&A) chatbot capable of handling specific information queries with retrieved context from a vector store.

Key takeaways include:

RAG Overview: RAG enhances LLM capabilities by integrating external data sources. It involves indexing data into vector embeddings using Milvus and retrieving relevant context during query processing to generate accurate and informed responses.
Tooling Highlights:
- Ollama simplifies managing and running Llama 3 models locally.
- LangChain provides an intuitive framework for developing LLM-based applications.
- Milvus, as the vector store, efficiently stores and retrieves vectorized data, enabling precise query handling.
- Llama 3, developed by Meta, supports advanced functionality with features like multi-role interactions and customizable system prompts.

The setup process covered essential steps:

Prerequisites: Installing Docker, Milvus Standalone, Ollama, and the required Python libraries.
Indexing Data: Using tools like PyPDFLoader and RecursiveCharacterTextSplitter to load, split, and vectorize data.
Embedding Storage: Converting text into embeddings with Jina AI model and storing them in the Milvus vector store.
Building a QA Chain: Integrating the vector store with Llama 3 through LangChain, configuring a tailored system prompt to process and respond to user queries.

By running the application, users can interact with their local chatbot to retrieve accurate, contextually relevant answers from specific datasets. Once relevant documents are retrieved, they are passed to the LLM for answer generation, a proven effective method. The guide emphasized practicality, ensuring accessibility for both beginners and experts. Keep in mind, for optimal retrieval, parameters may need to be fine-tuned based on the specifics of the application.

With the growing prominence of open-source LLMs, this setup highlights how localized implementations reduce costs, enhance security, and offer scalable solutions for information retrieval and generation. Explore the GitHub repository and join our Discord community to share your experience and insights!

Updated on Jan 20, 2025

Stephen Batifol
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Announcing the General Availability of Zilliz Cloud BYOC on Google Cloud Platform

Zilliz Cloud BYOC on GCP offers enterprise vector search with full data sovereignty and seamless integration.

Vector Databases vs. In-Memory Databases

Use a vector database for AI-powered similarity search; use an in-memory database for ultra-low latency and high-throughput data access.

Building a RAG Application with Milvus and Databricks DBRX

In this tutorial, we will explore how to build a robust RAG application by combining the capabilities of Milvus, a scalable vector database optimized for similarity search, and DBRX.