Blog
Building Intelligent RAG Applications with LangServe, LangGraph, and Milvus

Building Intelligent RAG Applications with LangServe, LangGraph, and Milvus

Jun 25, 20244 min read

This blog post is a follow-up to my previous article about Local Agentic RAG with LangGraph and Llama 3.

In this blog post, we'll explore how to build applications using LangServe and LangGraph, two powerful tools from the LangChain ecosystem. We will also use Milvus as the Vector Database. We'll show you how to set up a FastAPI application, configure LangServe and LangGraph, and use Milvus for efficient data retrieval.

Follow Along with the video tutorial

What You'll Learn

Setting up a FastAPI application with LangServe and LangGraph.
Integrating Milvus for vector storage and retrieval.
Building an LLM Agent with LangGraph

Prerequisites

Before we start, ensure you have the following dependencies installed:

Python 3.9+
Docker
Basic knowledge of FastAPI and Docker

Introduction to LangServe & Milvus

LangServe is an extension of FastAPI designed to streamline the creation of dynamic and powerful endpoints that utilize LangChain. It allows you to define complex processing workflows that can be exposed as API endpoints.

LangGraph — An extension of Langchain aimed at building robust and stateful multi-actor applications with LLMs by modeling steps as edges and nodes in a graph.

Milvus is a high-performance open-source vector database built for scale. Milvus Lite, the lightweight and local version of Milvus, can be run on your personal device without the need for Docker or Kubernetes.

Building Tool-Calling Agents with LangGraph

In our previous blog post, we discussed how LangGraph can significantly enhance the capabilities of language models by enabling tool-calling. Here, we will demonstrate how to build such agents with LangServe and deploy them on various infrastructures using Docker.

Introduction to Agentic RAG

Agents can transform language models into powerful reasoning engines that determine actions, execute them, and evaluate the results. This process is known as Agentic RAG (Retrieval Augmented Generation). Agents can:

Perform web searches
Browse emails
Conduct self-reflection or self-grading on retrieved documents
Execute custom user-defined functions
And more…

Setting Things Up

LangGraph: An extension of LangChain for building stateful applications with LLMs, using graphs to model steps and decisions.
Ollama & Llama 3: Ollama enables running open-source language models like Llama 3 locally, allowing for offline use and greater control.
Milvus Lite: A local version of Milvus for efficient vector storage and retrieval, suitable for running on personal devices.

Using LangServe and Milvus

LangServe allows you to expose complex processing workflows as API endpoints. We show an example for the function analyze_text, here, we add the endpoint /analyze to our FastAPI app, which makes it possible to query it.

Define Your FastAPI Application:

import uvicorn

from fastapi import FastAPI
from pydantic import BaseModel
from langchain_core.runnables import RunnableLambda
from langserve import add_routes


# Define Pydantic model for request body
class QuestionRequest(BaseModel):
    question: str

fastapi_app = FastAPI()

# Define LangServe route for text analysis
@fastapi_app.post("/analyze")
async def analyze_text(request: QuestionRequest):
    # Simulate text analysis (replace with your actual LangServe logic)
    entities = ["entity1", "entity2"]
    processed_data = f"Processed entities: {entities}"
    return {"entities": entities, "processed_data": processed_data}
add_routes(fastapi_app, RunnableLambda(analyze_text))

if __name__ == "__main__":
    uvicorn.run(fastapi_app, host="0.0.0.0", port=5001)

Integrate LangGraph for Workflow Management:

Use LangGraph to build a custom Llama3-powered RAG agent that can handle various tasks, such as routing user queries to the most suitable retrieval method or performing self-correction to improve answer quality. Feel free to check out our blog about LangGraph to see how you can do that.

Utilize Milvus for Efficient Data Retrieval:

Integrate Milvus to store and retrieve vector data efficiently. This step will enable quick access to relevant information and improve your application's performance.

from langchain_core.documents import Document

def load_and_split_documents(urls: list[str]) -> list[Document]:
    docs = [WebBaseLoader(url).load() for url in urls]
    docs_list = [item for sublist in docs for item in sublist]
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=250, chunk_overlap=0)
    return text_splitter.split_documents(docs_list)

def add_documents_to_milvus(doc_splits: list[Document], embedding_model: Embeddings, connection_args: Any):
    vectorstore = Milvus.from_documents(documents=doc_splits, collection_name="rag_milvus", embedding=embedding_model, connection_args=connection_args)
    return vectorstore.as_retriever()

urls = [
"https://lilianweng.github.io/posts/2023-06-23-agent/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
"https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
]

doc_splits = load_and_split_documents(urls)
embedding_model = HuggingFaceEmbeddings()
connection_args = {"uri": "./milvus_rag.db"}
retriever = add_documents_to_milvus(doc_splits, embedding_model, connection_args)

#Function that the Agent will call when needed.
def retrieve(state: Dict[str, Any]) -> Dict[str, Any]:
    print("---RETRIEVE---")
    question = state["question"]
    documents = retriever.invoke(question)
    return {"documents": [doc.page_content for doc in documents], "question": question}

Please feel free to check out the code on my GitHub.

Conclusion

In this blog post, we've shown how to build a RAG system using agents with LangServe, LangGraph, Llama 3, and Milvus. These agents enhance LLM capabilities by incorporating planning, memory, and tool usage, leading to more robust and informative responses. By integrating LangServe, you can expose these sophisticated workflows as API endpoints, making building and deploying intelligent applications easier.

If you enjoyed this blog post, consider giving us a star on Github and joining our Discord to share your experiences with the community.

Updated on Sep 01, 2025

Stephen Batifol
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Balancing Precision and Performance: How Zilliz Cloud's New Parameters Help You Optimize Vector Search

Optimize vector search with Zilliz Cloud’s level and recall features to tune accuracy, balance performance, and power AI applications.

Optimizing Embedding Model Selection with TDA Clustering: A Strategic Guide for Vector Databases

Discover how Topological Data Analysis (TDA) reveals hidden embedding model weaknesses and helps optimize vector database performance.

Vector Databases vs. Key-Value Databases

Use a vector database for AI-powered similarity search; use a key-value database for high-throughput, low-latency simple data lookups.