Ragas and Zilliz Cloud Integration
Ragas and Zilliz Cloud integrate to evaluate and optimize RAG pipelines, combining Ragas' automated evaluation framework for measuring faithfulness, answer relevance, and context precision with Zilliz Cloud's high-performance vector database for scalable retrieval in production RAG systems.
Use this integration for FreeWhat is Ragas
Ragas is a framework designed to evaluate Retrieval Augmented Generation (RAG) pipelines. While existing tools and frameworks help build RAG pipelines, evaluating and quantifying pipeline performance can be hard — this is where Ragas (RAG Assessment) comes in. It provides assessment tools focusing on metrics including faithfulness, answer relevancy, context recall, and context precision, and supports synthetic test dataset generation, production monitoring, and integrations with platforms like LangChain, LlamaIndex, and Milvus.
By integrating with Zilliz Cloud (fully managed Milvus), Ragas enables comprehensive performance assessment of RAG pipelines built on large-scale vector databases, allowing developers to evaluate retrieval quality and answer generation accuracy at production scale, identify issues like answer hallucinations, and iteratively improve their applications.
Benefits of the Ragas + Zilliz Cloud Integration
- Comprehensive RAG evaluation at scale: Zilliz Cloud handles billion-scale vectors for enterprise applications, and Ragas enables comprehensive performance assessment on these large datasets, maintaining evaluation efficiency as data scales.
- Multi-metric assessment: Ragas provides metrics for faithfulness, answer relevancy, context recall, and context precision, giving developers a holistic view of their RAG pipeline's performance when using Zilliz Cloud as the vector store.
- Streamlined development cycle: The integration allows developers to evaluate performance over time with minimal coding, identify answer hallucinations and retrieval gaps, and iteratively improve their RAG applications.
- Production monitoring: Ragas supports monitoring RAG pipeline quality in production environments, ensuring that the retrieval from Zilliz Cloud and LLM generation maintain high standards over time.
How the Integration Works
Ragas serves as the evaluation framework, assessing the quality of RAG pipeline outputs across multiple dimensions. It evaluates the precision and recall of contextual information retrieved from the vector database, and measures the faithfulness and relevance of LLM-generated responses, computing a weighted score to measure overall answer quality.
Zilliz Cloud serves as the vector database layer in the RAG pipeline being evaluated, storing and indexing document embeddings for fast similarity search. It handles the retrieval step — finding the most relevant context for user queries — which Ragas then evaluates for precision and recall.
Together, Ragas and Zilliz Cloud create a complete RAG development and evaluation workflow: documents are embedded and stored in Zilliz Cloud, user queries retrieve relevant context through vector similarity search, an LLM generates responses based on the retrieved context, and Ragas evaluates the entire pipeline — measuring how precise and complete the retrieved context is, and how faithful and relevant the generated answers are.
Step-by-Step Guide
1. Install Required Packages
$ pip install --upgrade pymilvus openai requests tqdm pandas ragas2. Set Up the OpenAI API Key
import os os.environ["OPENAI_API_KEY"] = "sk-***********"3. Define the RAG Class
Define the RAG class that uses Milvus as the vector store and OpenAI as the LLM:
from typing import List from tqdm import tqdm from openai import OpenAI from pymilvus import MilvusClient class RAG: """ RAG (Retrieval-Augmented Generation) class built upon OpenAI and Milvus. """ def __init__(self, openai_client: OpenAI, milvus_client: MilvusClient): self._prepare_openai(openai_client) self._prepare_milvus(milvus_client) def _emb_text(self, text: str) -> List[float]: return ( self.openai_client.embeddings.create(input=text, model=self.embedding_model) .data[0] .embedding ) def _prepare_openai( self, openai_client: OpenAI, embedding_model: str = "text-embedding-3-small", llm_model: str = "gpt-3.5-turbo", ): self.openai_client = openai_client self.embedding_model = embedding_model self.llm_model = llm_model self.SYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. """ self.USER_PROMPT = """ Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ def _prepare_milvus( self, milvus_client: MilvusClient, collection_name: str = "rag_collection" ): self.milvus_client = milvus_client self.collection_name = collection_name if self.milvus_client.has_collection(self.collection_name): self.milvus_client.drop_collection(self.collection_name) embedding_dim = len(self._emb_text("foo")) self.milvus_client.create_collection( collection_name=self.collection_name, dimension=embedding_dim, metric_type="IP", consistency_level="Strong", ) def load(self, texts: List[str]): data = [] for i, line in enumerate(tqdm(texts, desc="Creating embeddings")): data.append({"id": i, "vector": self._emb_text(line), "text": line}) self.milvus_client.insert(collection_name=self.collection_name, data=data) def retrieve(self, question: str, top_k: int = 3) -> List[str]: search_res = self.milvus_client.search( collection_name=self.collection_name, data=[self._emb_text(question)], limit=top_k, search_params={"metric_type": "IP", "params": {}}, output_fields=["text"], ) retrieved_texts = [res["entity"]["text"] for res in search_res[0]] return retrieved_texts[:top_k] def answer( self, question: str, retrieval_top_k: int = 3, return_retrieved_text: bool = False, ): retrieved_texts = self.retrieve(question, top_k=retrieval_top_k) user_prompt = self.USER_PROMPT.format( context="\n".join(retrieved_texts), question=question ) response = self.openai_client.chat.completions.create( model=self.llm_model, messages=[ {"role": "system", "content": self.SYSTEM_PROMPT}, {"role": "user", "content": user_prompt}, ], ) if not return_retrieved_text: return response.choices[0].message.content else: return response.choices[0].message.content, retrieved_texts4. Initialize the RAG Pipeline and Load Data
Initialize RAG with OpenAI and Milvus clients, download and load the data:
openai_client = OpenAI() milvus_client = MilvusClient(uri="./milvus_demo.db") my_rag = RAG(openai_client=openai_client, milvus_client=milvus_client)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.import urllib.request url = "https://raw.githubusercontent.com/milvus-io/milvus/master/DEVELOPMENT.md" file_path = "./Milvus_DEVELOPMENT.md" if not os.path.exists(file_path): urllib.request.urlretrieve(url, file_path) with open(file_path, "r") as file: file_text = file.read() text_lines = file_text.split("# ") my_rag.load(text_lines)5. Prepare Questions, Get Answers, and Collect Results
Prepare questions with ground truth answers and collect RAG pipeline results:
from ragas import EvaluationDataset import pandas as pd user_input_list = [ "what is the hardware requirements specification if I want to build Milvus and run from source code?", "What is the programming language used to write Knowhere?", "What should be ensured before running code coverage?", ] reference_list = [ "If you want to build Milvus and run from source code, the recommended hardware requirements specification is:\n\n- 8GB of RAM\n- 50GB of free disk space.", "The programming language used to write Knowhere is C++.", "Before running code coverage, you should make sure that your code changes are covered by unit tests.", ] retrieved_contexts_list = [] response_list = [] for user_input in tqdm(user_input_list, desc="Answering questions"): response, retrieved_context = my_rag.answer(user_input, return_retrieved_text=True) retrieved_contexts_list.append(retrieved_context) response_list.append(response) df = pd.DataFrame( { "user_input": user_input_list, "retrieved_contexts": retrieved_contexts_list, "response": response_list, "reference": reference_list, } ) rag_results = EvaluationDataset.from_pandas(df)6. Evaluate with Ragas
Use Ragas to evaluate the RAG pipeline with multiple metrics:
from ragas import evaluate from ragas.metrics import AnswerRelevancy, Faithfulness, ContextRecall, ContextPrecision from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini") evaluator_llm = LangchainLLMWrapper(llm) results = evaluate( dataset=rag_results, metrics=[ AnswerRelevancy(llm=evaluator_llm), Faithfulness(llm=evaluator_llm), ContextRecall(llm=evaluator_llm), ContextPrecision(llm=evaluator_llm), ], ) resultsLearn More
- Evaluation with Ragas — Official Milvus tutorial for RAG evaluation with Ragas
- RAG Evaluation Using Ragas — Zilliz blog on RAG evaluation metrics and implementation
- How to Evaluate Retrieval Augmented Generation (RAG) Applications — Zilliz blog on evaluating RAG applications
- Ragas Documentation — Official Ragas documentation
- RAGAS: Automated Evaluation of Retrieval Augmented Generation — Ragas academic paper