Arize Phoenix and Zilliz Cloud Integration
Arize Phoenix and Zilliz Cloud integrate to evaluate and optimize RAG pipelines, combining Phoenix's open-source evaluation and observability framework for hallucination detection, QA accuracy, and LLM tracing with Zilliz Cloud's high-performance vector database for scalable retrieval in production AI systems.
Use this integration for FreeWhat is Arize Phoenix
Arize Phoenix is an open-source tool for evaluating Retrieval-Augmented Generation (RAG) pipelines. It delivers metrics and insights measuring retrieval quality and response accuracy, focusing on hallucination evaluation (determining if content is factual or hallucinatory) and QA evaluation (assessing the accuracy of model answers). Phoenix also provides OTEL-compatible tracing for LLM applications, capturing application latency, token usage, runtime exceptions, and retrieved document analysis across frameworks like LangChain, LlamaIndex, and OpenAI.
By integrating with Zilliz Cloud (fully managed Milvus), Arize Phoenix enables comprehensive evaluation of RAG pipelines built on scalable vector databases, helping teams identify high-performing queries, recognize improvement areas, and understand how retrieval system modifications affect overall performance through data-driven analysis.
Benefits of the Arize Phoenix + Zilliz Cloud Integration
- Hallucination detection: Phoenix evaluates whether LLM responses are grounded in the context retrieved from Zilliz Cloud, identifying factual vs. hallucinatory content with detailed explanations.
- QA accuracy assessment: Phoenix measures the accuracy of model answers against ground truth, enabling teams to quantify how well their Zilliz Cloud-backed RAG pipeline performs.
- End-to-end LLM tracing: Phoenix's OTEL-compatible tracing captures the entire request flow — from embedding generation to Zilliz Cloud retrieval to LLM response — providing insights into latency, token usage, and runtime exceptions.
- Data-driven optimization: The combination of Zilliz Cloud's retrieval metrics and Phoenix's evaluation insights enables teams to iteratively improve their RAG pipelines with quantified performance data.
How the Integration Works
Arize Phoenix serves as the evaluation and observability layer, assessing the quality of RAG pipeline outputs through hallucination and QA evaluators powered by LLM judges. It also provides tracing that captures the full request lifecycle for performance analysis and debugging.
Zilliz Cloud serves as the vector database layer in the RAG pipeline being evaluated, storing and indexing document embeddings for fast similarity search. It handles the retrieval step — finding the most relevant context for user queries — which Phoenix then evaluates for quality.
Together, Arize Phoenix and Zilliz Cloud create a complete RAG development and evaluation workflow: documents are embedded and stored in Zilliz Cloud, user queries retrieve relevant context through vector similarity search, an LLM generates responses, and Phoenix evaluates the entire pipeline — detecting hallucinations, measuring answer accuracy, and tracing performance across all components.
Step-by-Step Guide
1. Install Required Packages
$ pip install --upgrade pymilvus openai requests tqdm pandas "arize-phoenix>=4.29.0" nest_asyncio2. Set Up the OpenAI API Key
import os os.environ["OPENAI_API_KEY"] = "sk-***********"3. Define the RAG Class
Define the RAG class that uses Milvus as the vector store and OpenAI as the LLM:
from typing import List from tqdm import tqdm from openai import OpenAI from pymilvus import MilvusClient class RAG: """ RAG (Retrieval-Augmented Generation) class built upon OpenAI and Milvus. """ def __init__(self, openai_client: OpenAI, milvus_client: MilvusClient): self._prepare_openai(openai_client) self._prepare_milvus(milvus_client) def _emb_text(self, text: str) -> List[float]: return ( self.openai_client.embeddings.create(input=text, model=self.embedding_model) .data[0] .embedding ) def _prepare_openai( self, openai_client: OpenAI, embedding_model: str = "text-embedding-3-small", llm_model: str = "gpt-4o-mini", ): self.openai_client = openai_client self.embedding_model = embedding_model self.llm_model = llm_model self.SYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. """ self.USER_PROMPT = """ Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ def _prepare_milvus( self, milvus_client: MilvusClient, collection_name: str = "rag_collection" ): self.milvus_client = milvus_client self.collection_name = collection_name if self.milvus_client.has_collection(self.collection_name): self.milvus_client.drop_collection(self.collection_name) embedding_dim = len(self._emb_text("demo")) self.milvus_client.create_collection( collection_name=self.collection_name, dimension=embedding_dim, metric_type="IP", consistency_level="Strong", ) def load(self, texts: List[str]): data = [] for i, line in enumerate(tqdm(texts, desc="Creating embeddings")): data.append({"id": i, "vector": self._emb_text(line), "text": line}) self.milvus_client.insert(collection_name=self.collection_name, data=data) def retrieve(self, question: str, top_k: int = 3) -> List[str]: search_res = self.milvus_client.search( collection_name=self.collection_name, data=[self._emb_text(question)], limit=top_k, search_params={"metric_type": "IP", "params": {}}, output_fields=["text"], ) retrieved_texts = [res["entity"]["text"] for res in search_res[0]] return retrieved_texts[:top_k] def answer( self, question: str, retrieval_top_k: int = 3, return_retrieved_text: bool = False, ): retrieved_texts = self.retrieve(question, top_k=retrieval_top_k) user_prompt = self.USER_PROMPT.format( context="\n".join(retrieved_texts), question=question ) response = self.openai_client.chat.completions.create( model=self.llm_model, messages=[ {"role": "system", "content": self.SYSTEM_PROMPT}, {"role": "user", "content": user_prompt}, ], ) if not return_retrieved_text: return response.choices[0].message.content else: return response.choices[0].message.content, retrieved_texts4. Initialize the RAG Pipeline, Load Data, and Get Results
openai_client = OpenAI() milvus_client = MilvusClient(uri="./milvus_demo.db") my_rag = RAG(openai_client=openai_client, milvus_client=milvus_client)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.Download and load data, then prepare questions with ground truth:
import urllib.request url = "https://raw.githubusercontent.com/milvus-io/milvus/master/DEVELOPMENT.md" file_path = "./Milvus_DEVELOPMENT.md" if not os.path.exists(file_path): urllib.request.urlretrieve(url, file_path) with open(file_path, "r") as file: file_text = file.read() text_lines = file_text.split("# ") my_rag.load(text_lines)Collect RAG pipeline results for evaluation:
from datasets import Dataset import pandas as pd question_list = [ "what is the hardware requirements specification if I want to build Milvus and run from source code?", "What is the programming language used to write Knowhere?", "What should be ensured before running code coverage?", ] ground_truth_list = [ "If you want to build Milvus and run from source code, the recommended hardware requirements specification is:\n\n- 8GB of RAM\n- 50GB of free disk space.", "The programming language used to write Knowhere is C++.", "Before running code coverage, you should make sure that your code changes are covered by unit tests.", ] contexts_list = [] answer_list = [] for question in tqdm(question_list, desc="Answering questions"): answer, contexts = my_rag.answer(question, return_retrieved_text=True) contexts_list.append(contexts) answer_list.append(answer) df = pd.DataFrame( { "question": question_list, "contexts": contexts_list, "answer": answer_list, "ground_truth": ground_truth_list, } )5. Launch Phoenix and Run Evaluations
Start the Phoenix server and instrument OpenAI:
import phoenix as px from phoenix.trace.openai import OpenAIInstrumentor session = px.launch_app() OpenAIInstrumentor().instrument()Run hallucination and QA evaluators:
import nest_asyncio from phoenix.evals import HallucinationEvaluator, OpenAIModel, QAEvaluator, run_evals nest_asyncio.apply() eval_model = OpenAIModel(model="gpt-4o") hallucination_evaluator = HallucinationEvaluator(eval_model) qa_evaluator = QAEvaluator(eval_model) df["context"] = df["contexts"] df["reference"] = df["contexts"] df.rename(columns={"question": "input", "answer": "output"}, inplace=True) hallucination_eval_df, qa_eval_df = run_evals( dataframe=df, evaluators=[hallucination_evaluator, qa_evaluator], provide_explanation=True, )6. View Evaluation Results
results_df = df.copy() results_df["hallucination_eval"] = hallucination_eval_df["label"] results_df["hallucination_explanation"] = hallucination_eval_df["explanation"] results_df["qa_eval"] = qa_eval_df["label"] results_df["qa_explanation"] = qa_eval_df["explanation"] results_df.head()Learn More
- Evaluation with Arize Phoenix — Official Milvus tutorial for RAG evaluation with Phoenix
- The Path to Production: LLM Application Evaluations and Observability — Zilliz blog on LLM evaluation and observability
- Top 10 RAG & LLM Evaluation Tools for AI Success — Zilliz tutorial on RAG evaluation tools
- Arize Phoenix Documentation — Official Phoenix documentation
- Arize Phoenix GitHub Repository — Phoenix source code and community resources