Arm and Zilliz Cloud Integration
Arm and Zilliz Cloud integrate to build cost-effective RAG applications on Arm-based infrastructure, combining Arm's energy-efficient processor architecture with Zilliz Cloud's high-performance vector database for optimized vector search workloads on Arm servers and cloud instances.
Use this integration for FreeWhat is Arm
Arm designs processor architectures powering billions of devices worldwide. Their energy-efficient designs increasingly support data centers and cloud computing, delivering strong AI and machine learning performance. Server-grade Arm processors, such as AWS Graviton, offer cost-effective alternatives to traditional x86 architectures, with expanding adoption among cloud providers and enterprises for running ML workloads.
By integrating with Zilliz Cloud (fully managed Milvus), Arm-based infrastructure enables organizations to efficiently run vector search and RAG workloads with optimized performance and cost savings through Arm's energy efficiency, while Zilliz Cloud handles vector storage, indexing, and similarity search natively on Arm architecture.
Benefits of the Arm + Zilliz Cloud Integration
- Cost-effective AI infrastructure: Arm's energy-efficient processors deliver strong AI performance at lower cost compared to traditional x86 architectures, and Zilliz Cloud runs natively on Arm, maximizing the cost savings for vector search workloads.
- Native Arm architecture support: Zilliz Cloud supports Arm architecture natively, enabling seamless deployment on Arm-based servers and cloud instances like AWS Graviton without compatibility concerns.
- Optimized vector search performance: The platform automatically manages resource allocation and optimizes query performance specifically for Arm processors, ensuring efficient vector similarity search operations.
- Local LLM inference on Arm CPUs: The integration supports running LLMs like Llama 3.1 directly on Arm-based CPUs using llama.cpp with quantized models, enabling complete RAG pipelines without GPU requirements.
How the Integration Works
Arm provides the processor architecture and infrastructure layer, powering servers and cloud instances like AWS Graviton that run both the vector database and LLM inference workloads. Arm CPUs support efficient execution of embedding models and quantized LLMs through optimizations like SVE 256 and MATMUL_INT8.
Zilliz Cloud serves as the vector database layer running natively on Arm architecture, storing and indexing document embeddings for fast similarity search. It provides high-performance retrieval with low latency on Arm-based infrastructure, enabling efficient RAG applications.
Together, Arm and Zilliz Cloud create a cost-effective, end-to-end RAG platform: documents are embedded using models running on Arm CPUs, stored in Zilliz Cloud's vector database deployed on Arm infrastructure, and when users query the system, relevant context is retrieved through vector similarity search and passed to an LLM served by llama.cpp on Arm — all running efficiently on energy-efficient Arm processors.
Step-by-Step Guide
1. Set Up the Arm-based Environment
We recommend using AWS Graviton instances (e.g.,
c7g.2xlargewith Ubuntu 22.04 LTS). You need at least four cores, 8GB of RAM, and 32GB of disk storage.Install Python and create a virtual environment:
$ sudo apt update $ sudo apt install python-is-python3 python3-pip python3-venv -y $ python -m venv venv $ source venv/bin/activateInstall the required Python dependencies:
$ pip install --upgrade pymilvus openai requests langchain-huggingface huggingface_hub tqdm2. Create the Collection on Zilliz Cloud
Set the
uriandtokenas the Public Endpoint and API Key in Zilliz Cloud:from pymilvus import MilvusClient milvus_client = MilvusClient( uri="<your_zilliz_public_endpoint>", token="<your_zilliz_api_key>" ) collection_name = "my_rag_collection" if milvus_client.has_collection(collection_name): milvus_client.drop_collection(collection_name) milvus_client.create_collection( collection_name=collection_name, dimension=384, metric_type="IP", consistency_level="Strong", )3. Prepare the Data and Insert Embeddings
Download the Milvus FAQ documentation and load it:
$ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip $ unzip -q milvus_docs_2.4.x_en.zip -d milvus_docsfrom glob import glob text_lines = [] for file_path in glob("milvus_docs/en/faq/*.md", recursive=True): with open(file_path, "r") as file: file_text = file.read() text_lines += file_text.split("# ")Prepare the embedding model and insert data:
from langchain_huggingface import HuggingFaceEmbeddings from tqdm import tqdm embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2") data = [] text_embeddings = embedding_model.embed_documents(text_lines) for i, (line, embedding) in enumerate( tqdm(zip(text_lines, text_embeddings), desc="Creating embeddings") ): data.append({"id": i, "vector": embedding, "text": line}) milvus_client.insert(collection_name=collection_name, data=data)4. Build and Launch llama.cpp on Arm
Install build tools, clone and build llama.cpp:
$ sudo apt install make cmake -y $ sudo apt install gcc g++ -y $ sudo apt install build-essential -y $ git clone https://github.com/ggerganov/llama.cpp $ cd llama.cpp $ make GGML_NO_LLAMAFILE=1 -j$(nproc)Download and re-quantize the model for optimal Arm performance:
$ huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False $ ./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8This requantization is optimal specifically for Graviton3. For Graviton2, use
Q4_0_4_4format, and for Graviton4, useQ4_0_4_8format.Start the LLM server:
$ ./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -n 2048 -t 64 -c 65536 --port 80805. Perform RAG Query
Initialize the LLM client, search for relevant context, and generate a response:
from openai import OpenAI llm_client = OpenAI(base_url="http://localhost:8080/v1", api_key="no-key") question = "How is data stored in milvus?" search_res = milvus_client.search( collection_name=collection_name, data=[embedding_model.embed_query(question)], limit=3, search_params={"metric_type": "IP", "params": {}}, output_fields=["text"], ) retrieved_lines_with_distances = [ (res["entity"]["text"], res["distance"]) for res in search_res[0] ] context = "\n".join( [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances] ) SYSTEM_PROMPT = """ Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided. """ USER_PROMPT = f""" Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags. <context> {context} </context> <question> {question} </question> """ response = llm_client.chat.completions.create( model="not-used", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": USER_PROMPT}, ], ) print(response.choices[0].message.content)Learn More
- Build RAG on Arm Architecture — Official Milvus tutorial for building RAG on Arm
- Arm Learning Path: Servers and Cloud Computing with Milvus — Arm official learning path for Milvus
- Building RAG with Milvus, vLLM, and Meta's Llama 3.1 — Zilliz blog on building RAG with Llama 3.1
- llama.cpp GitHub Repository — llama.cpp source code for LLM inference
- Milvus Installation Documentation — Milvus installation guide including Arm support


