Integrations
DeepEval and Zilliz Cloud Integration

DeepEval and Zilliz Cloud Integration

DeepEval and Zilliz Cloud integrate to test and evaluate RAG pipeline performance, combining DeepEval's evaluation framework with metrics for contextual precision, recall, relevancy, faithfulness, and answer relevancy alongside Zilliz Cloud's high-performance vector database for scalable retrieval in production RAG systems.

Use this integration for Free

What is DeepEval
DeepEval is an evaluation framework built for testing Retrieval-Augmented Generation (RAG) pipelines. It helps developers assess and quantify pipeline performance through comprehensive metrics covering both retrieval quality (contextual precision, recall, and relevancy) and generation quality (answer relevancy and faithfulness). DeepEval enables developers to identify issues, ensure quality standards, and iteratively improve their RAG systems.

By integrating with Zilliz Cloud (fully managed Milvus), DeepEval provides a complete toolkit to build and test production RAG systems, measuring the relevance and accuracy of retrieval results from Zilliz Cloud's vector search and evaluating LLM-generated responses for faithfulness and relevance.
Benefits of the DeepEval + Zilliz Cloud Integration
- Comprehensive retrieval evaluation: DeepEval measures contextual precision, recall, and relevancy of documents retrieved from Zilliz Cloud, providing detailed insights into how well the vector search surfaces relevant information.
- Generation quality assessment: DeepEval evaluates answer relevancy and faithfulness of LLM responses, ensuring outputs are both relevant to the question and factually grounded in the context retrieved from Zilliz Cloud.
- Automated testing with LLM judges: DeepEval uses LLM-based evaluation (e.g., GPT-4o) to automatically score and explain retrieval and generation quality, reducing manual evaluation effort.
- Production-ready testing: The combination gives developers a complete toolkit to build and test production RAG systems, with Zilliz Cloud's managed infrastructure reducing operational complexity while DeepEval ensures quality standards are met.
How the Integration Works
DeepEval serves as the evaluation framework, providing metrics and test cases to assess RAG pipeline quality. It evaluates the retriever through contextual precision, recall, and relevancy metrics, and assesses the generator through answer relevancy and faithfulness metrics — all using LLM-based judges for automated scoring and explanations.

Zilliz Cloud serves as the vector database layer in the RAG pipeline being evaluated, storing and indexing document embeddings for fast similarity search. It handles the retrieval step — finding the most relevant context for user queries — which DeepEval then evaluates for quality.

Together, DeepEval and Zilliz Cloud create a complete RAG development and evaluation workflow: documents are embedded and stored in Zilliz Cloud, user queries retrieve relevant context through vector similarity search, an LLM generates responses, and DeepEval evaluates the entire pipeline — measuring retrieval ranking quality, contextual relevance, answer accuracy, and factual faithfulness.

Step-by-Step Guide

1. Install Required Packages

$ pip install --upgrade pymilvus openai requests tqdm pandas deepeval

2. Set Up the OpenAI API Key

import os

os.environ["OPENAI_API_KEY"] = "sk-***********"

3. Define the RAG Class

Define the RAG class that uses Milvus as the vector store and OpenAI as the LLM:

from typing import List
from tqdm import tqdm
from openai import OpenAI
from pymilvus import MilvusClient


class RAG:
    """
    RAG (Retrieval-Augmented Generation) class built upon OpenAI and Milvus.
    """

    def __init__(self, openai_client: OpenAI, milvus_client: MilvusClient):
        self._prepare_openai(openai_client)
        self._prepare_milvus(milvus_client)

    def _emb_text(self, text: str) -> List[float]:
        return (
            self.openai_client.embeddings.create(input=text, model=self.embedding_model)
            .data[0]
            .embedding
        )

    def _prepare_openai(
        self,
        openai_client: OpenAI,
        embedding_model: str = "text-embedding-3-small",
        llm_model: str = "gpt-4o-mini",
    ):
        self.openai_client = openai_client
        self.embedding_model = embedding_model
        self.llm_model = llm_model
        self.SYSTEM_PROMPT = """
            Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
        """
        self.USER_PROMPT = """
            Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
            <context>
            {context}
            </context>
            <question>
            {question}
            </question>
        """

    def _prepare_milvus(
        self, milvus_client: MilvusClient, collection_name: str = "rag_collection"
    ):
        self.milvus_client = milvus_client
        self.collection_name = collection_name
        if self.milvus_client.has_collection(self.collection_name):
            self.milvus_client.drop_collection(self.collection_name)
        embedding_dim = len(self._emb_text("demo"))
        self.milvus_client.create_collection(
            collection_name=self.collection_name,
            dimension=embedding_dim,
            metric_type="IP",
            consistency_level="Strong",
        )

    def load(self, texts: List[str]):
        data = []
        for i, line in enumerate(tqdm(texts, desc="Creating embeddings")):
            data.append({"id": i, "vector": self._emb_text(line), "text": line})
        self.milvus_client.insert(collection_name=self.collection_name, data=data)

    def retrieve(self, question: str, top_k: int = 3) -> List[str]:
        search_res = self.milvus_client.search(
            collection_name=self.collection_name,
            data=[self._emb_text(question)],
            limit=top_k,
            search_params={"metric_type": "IP", "params": {}},
            output_fields=["text"],
        )
        retrieved_texts = [res["entity"]["text"] for res in search_res[0]]
        return retrieved_texts[:top_k]

    def answer(
        self,
        question: str,
        retrieval_top_k: int = 3,
        return_retrieved_text: bool = False,
    ):
        retrieved_texts = self.retrieve(question, top_k=retrieval_top_k)
        user_prompt = self.USER_PROMPT.format(
            context="\n".join(retrieved_texts), question=question
        )
        response = self.openai_client.chat.completions.create(
            model=self.llm_model,
            messages=[
                {"role": "system", "content": self.SYSTEM_PROMPT},
                {"role": "user", "content": user_prompt},
            ],
        )
        if not return_retrieved_text:
            return response.choices[0].message.content
        else:
            return response.choices[0].message.content, retrieved_texts

4. Initialize the RAG Pipeline, Load Data, and Collect Results

openai_client = OpenAI()
milvus_client = MilvusClient(uri="./milvus_demo.db")

my_rag = RAG(openai_client=openai_client, milvus_client=milvus_client)

As for the argument of MilvusClient: Setting the uri as a local file, e.g. ./milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the Public Endpoint and API Key in Zilliz Cloud.

Download and load data, then collect RAG results:

import urllib.request
import pandas as pd

url = "https://raw.githubusercontent.com/milvus-io/milvus/master/DEVELOPMENT.md"
file_path = "./Milvus_DEVELOPMENT.md"

if not os.path.exists(file_path):
    urllib.request.urlretrieve(url, file_path)
with open(file_path, "r") as file:
    file_text = file.read()

text_lines = file_text.split("# ")
my_rag.load(text_lines)

question_list = [
    "what is the hardware requirements specification if I want to build Milvus and run from source code?",
    "What is the programming language used to write Knowhere?",
    "What should be ensured before running code coverage?",
]
ground_truth_list = [
    "If you want to build Milvus and run from source code, the recommended hardware requirements specification is:\n\n- 8GB of RAM\n- 50GB of free disk space.",
    "The programming language used to write Knowhere is C++.",
    "Before running code coverage, you should make sure that your code changes are covered by unit tests.",
]
contexts_list = []
answer_list = []
for question in tqdm(question_list, desc="Answering questions"):
    answer, contexts = my_rag.answer(question, return_retrieved_text=True)
    contexts_list.append(contexts)
    answer_list.append(answer)

df = pd.DataFrame(
    {
        "question": question_list,
        "contexts": contexts_list,
        "answer": answer_list,
        "ground_truth": ground_truth_list,
    }
)

5. Evaluate Retriever with DeepEval

Assess retrieval quality using contextual precision, recall, and relevancy metrics:

from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
)
from deepeval.test_case import LLMTestCase
from deepeval import evaluate

contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()

test_cases = []

for index, row in df.iterrows():
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=row["answer"],
        expected_output=row["ground_truth"],
        retrieval_context=row["contexts"],
    )
    test_cases.append(test_case)

result = evaluate(
    test_cases=test_cases,
    metrics=[contextual_precision, contextual_recall, contextual_relevancy],
    print_results=False,
)

6. Evaluate Generation with DeepEval

Assess generation quality using answer relevancy and faithfulness metrics:

from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()

test_cases = []

for index, row in df.iterrows():
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=row["answer"],
        expected_output=row["ground_truth"],
        retrieval_context=row["contexts"],
    )
    test_cases.append(test_case)

result = evaluate(
    test_cases=test_cases,
    metrics=[answer_relevancy, faithfulness],
    print_results=False,
)

Learn More
- Evaluation with DeepEval — Official Milvus tutorial for RAG evaluation with DeepEval
- Evaluating RAG: Everything You Should Know — Zilliz blog on RAG evaluation best practices
- How to Evaluate Retrieval Augmented Generation (RAG) Applications — Zilliz blog on evaluating RAG applications
- DeepEval Documentation — Official DeepEval documentation
- DeepEval GitHub Repository — DeepEval source code and community resources

DeepEval and Zilliz Cloud Integration

What is DeepEval

Benefits of the DeepEval + Zilliz Cloud Integration

How the Integration Works

Step-by-Step Guide

Learn More

Related Resources

Evaluating RAG: Everything You Should Know

How to Evaluate Retrieval Augmented Generation (RAG) Applications

RAG Evaluation Using Ragas