Integrations
Arm and Zilliz Cloud Integration

Arm and Zilliz Cloud Integration

Arm and Zilliz Cloud integrate to build cost-effective RAG applications on Arm-based infrastructure, combining Arm's energy-efficient processor architecture with Zilliz Cloud's high-performance vector database for optimized vector search workloads on Arm servers and cloud instances.

Use this integration for Free

What is Arm
Arm designs processor architectures powering billions of devices worldwide. Their energy-efficient designs increasingly support data centers and cloud computing, delivering strong AI and machine learning performance. Server-grade Arm processors, such as AWS Graviton, offer cost-effective alternatives to traditional x86 architectures, with expanding adoption among cloud providers and enterprises for running ML workloads.

By integrating with Zilliz Cloud (fully managed Milvus), Arm-based infrastructure enables organizations to efficiently run vector search and RAG workloads with optimized performance and cost savings through Arm's energy efficiency, while Zilliz Cloud handles vector storage, indexing, and similarity search natively on Arm architecture.
Benefits of the Arm + Zilliz Cloud Integration
- Cost-effective AI infrastructure: Arm's energy-efficient processors deliver strong AI performance at lower cost compared to traditional x86 architectures, and Zilliz Cloud runs natively on Arm, maximizing the cost savings for vector search workloads.
- Native Arm architecture support: Zilliz Cloud supports Arm architecture natively, enabling seamless deployment on Arm-based servers and cloud instances like AWS Graviton without compatibility concerns.
- Optimized vector search performance: The platform automatically manages resource allocation and optimizes query performance specifically for Arm processors, ensuring efficient vector similarity search operations.
- Local LLM inference on Arm CPUs: The integration supports running LLMs like Llama 3.1 directly on Arm-based CPUs using llama.cpp with quantized models, enabling complete RAG pipelines without GPU requirements.
How the Integration Works
Arm provides the processor architecture and infrastructure layer, powering servers and cloud instances like AWS Graviton that run both the vector database and LLM inference workloads. Arm CPUs support efficient execution of embedding models and quantized LLMs through optimizations like SVE 256 and MATMUL_INT8.

Zilliz Cloud serves as the vector database layer running natively on Arm architecture, storing and indexing document embeddings for fast similarity search. It provides high-performance retrieval with low latency on Arm-based infrastructure, enabling efficient RAG applications.

Together, Arm and Zilliz Cloud create a cost-effective, end-to-end RAG platform: documents are embedded using models running on Arm CPUs, stored in Zilliz Cloud's vector database deployed on Arm infrastructure, and when users query the system, relevant context is retrieved through vector similarity search and passed to an LLM served by llama.cpp on Arm — all running efficiently on energy-efficient Arm processors.

Step-by-Step Guide

1. Set Up the Arm-based Environment

We recommend using AWS Graviton instances (e.g., c7g.2xlarge with Ubuntu 22.04 LTS). You need at least four cores, 8GB of RAM, and 32GB of disk storage.

Install Python and create a virtual environment:

$ sudo apt update
$ sudo apt install python-is-python3 python3-pip python3-venv -y
$ python -m venv venv
$ source venv/bin/activate

Install the required Python dependencies:

$ pip install --upgrade pymilvus openai requests langchain-huggingface huggingface_hub tqdm

2. Create the Collection on Zilliz Cloud

Set the uri and token as the Public Endpoint and API Key in Zilliz Cloud:

from pymilvus import MilvusClient

milvus_client = MilvusClient(
    uri="<your_zilliz_public_endpoint>", token="<your_zilliz_api_key>"
)

collection_name = "my_rag_collection"

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=384,
    metric_type="IP",
    consistency_level="Strong",
)

3. Prepare the Data and Insert Embeddings

Download the Milvus FAQ documentation and load it:

$ wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
$ unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs

from glob import glob

text_lines = []

for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
    with open(file_path, "r") as file:
        file_text = file.read()

    text_lines += file_text.split("# ")

Prepare the embedding model and insert data:

from langchain_huggingface import HuggingFaceEmbeddings
from tqdm import tqdm

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

data = []
text_embeddings = embedding_model.embed_documents(text_lines)

for i, (line, embedding) in enumerate(
    tqdm(zip(text_lines, text_embeddings), desc="Creating embeddings")
):
    data.append({"id": i, "vector": embedding, "text": line})

milvus_client.insert(collection_name=collection_name, data=data)

4. Build and Launch llama.cpp on Arm

Install build tools, clone and build llama.cpp:

$ sudo apt install make cmake -y
$ sudo apt install gcc g++ -y
$ sudo apt install build-essential -y

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make GGML_NO_LLAMAFILE=1 -j$(nproc)

Download and re-quantize the model for optimal Arm performance:

$ huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False

$ ./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8

This requantization is optimal specifically for Graviton3. For Graviton2, use Q4_0_4_4 format, and for Graviton4, use Q4_0_4_8 format.

Start the LLM server:

$ ./llama-server -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -n 2048 -t 64 -c 65536 --port 8080

5. Perform RAG Query

Initialize the LLM client, search for relevant context, and generate a response:

from openai import OpenAI

llm_client = OpenAI(base_url="http://localhost:8080/v1", api_key="no-key")

question = "How is data stored in milvus?"

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[embedding_model.embed_query(question)],
    limit=3,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"],
)

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]

context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

response = llm_client.chat.completions.create(
    model="not-used",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)

Learn More
- Build RAG on Arm Architecture — Official Milvus tutorial for building RAG on Arm
- Arm Learning Path: Servers and Cloud Computing with Milvus — Arm official learning path for Milvus
- Building RAG with Milvus, vLLM, and Meta's Llama 3.1 — Zilliz blog on building RAG with Llama 3.1
- llama.cpp GitHub Repository — llama.cpp source code for LLM inference
- Milvus Installation Documentation — Milvus installation guide including Arm support

Arm and Zilliz Cloud Integration

What is Arm

Benefits of the Arm + Zilliz Cloud Integration

How the Integration Works

Step-by-Step Guide

Learn More

Related Resources

Building RAG with Milvus, vLLM, and Meta's Llama 3.1

Deploying a Multimodal RAG System with vLLM and Milvus

How to Build RAG with Milvus Lite, Llama3 and LlamaIndex