BentoML and Zilliz Cloud Integration
BentoML and Zilliz Cloud integrate to build end-to-end AI applications, combining BentoML's open-source AI inference platform for serving and deploying machine learning models with Zilliz Cloud's high-performance vector database for scalable embedding storage and retrieval in RAG systems.
Use this integration for FreeWhat is BentoML
BentoML is an open-source AI Inference platform for serving and deploying machine learning models. It bridges development and operations by streamlining production deployment, encapsulating models, dependencies, and inference logic into standardized units called "Bentos." The platform offers high-performance API serving supporting HTTP, gRPC, and CLI protocols, with deployment to Docker containers, Kubernetes, and cloud platforms. BentoCloud, its managed service, provides pre-built models including Llama 3, Stable Diffusion, CLIP, and Sentence Transformers with single-click deployment.
By integrating with Zilliz Cloud (fully managed Milvus), BentoML enables developers to convert unstructured data into vector embeddings using served models and store them in a scalable vector database for efficient retrieval, powering end-to-end RAG applications, semantic search, and recommendation systems with minimal infrastructure overhead.
Benefits of the BentoML + Zilliz Cloud Integration
- Single-click model deployment with scalable storage: BentoCloud provides instant deployment of state-of-the-art embedding and LLM models, while Zilliz Cloud handles the vector storage and retrieval at scale.
- Access to advanced models: BentoCloud offers immediate availability of cutting-edge AI models like Llama 3 and Sentence Transformers without training requirements, with embeddings efficiently stored and searched in Zilliz Cloud.
- Reduced infrastructure burden: Both managed services minimize setup and maintenance overhead, allowing teams to focus on building AI applications rather than managing infrastructure.
- Flexible deployment options: BentoML supports self-hosting via its open-source framework alongside BentoCloud's managed service, with Zilliz Cloud providing the vector database layer in either deployment model.
- Standardized model serving: BentoML standardizes model deployment across ML frameworks (PyTorch, TensorFlow, scikit-learn), while Zilliz Cloud provides a consistent vector storage interface regardless of the model framework used.
How the Integration Works
BentoML serves as the AI inference platform, hosting and serving embedding models and LLMs through BentoCloud or self-hosted deployments. It handles model serving via HTTP endpoints, enabling applications to generate embeddings from text data using models like Sentence Transformers and generate LLM responses using models like Llama 3.
Zilliz Cloud serves as the vector database layer, storing and indexing the embeddings generated by BentoML-served models. It provides high-performance similarity search with low latency, enabling efficient retrieval of the most relevant context from large collections.
Together, BentoML and Zilliz Cloud create a complete RAG solution: BentoML serves embedding models that convert text data into vectors, which are stored in Zilliz Cloud. When a user asks a question, BentoML embeds the query, Zilliz Cloud retrieves the most relevant documents through similarity search, and BentoML's LLM service generates a contextually informed response based on the retrieved context.
Step-by-Step Guide
1. Install Required Packages
$ pip install -U pymilvus bentoml2. Serving Embeddings with BentoML/BentoCloud
Import
bentomland set up an HTTP client using theSyncHTTPClientby specifying the endpoint and optionally the token:import bentoml BENTO_EMBEDDING_MODEL_END_POINT = "BENTO_EMBEDDING_MODEL_END_POINT" BENTO_API_TOKEN = "BENTO_API_TOKEN" embedding_client = bentoml.SyncHTTPClient( BENTO_EMBEDDING_MODEL_END_POINT, token=BENTO_API_TOKEN )3. Prepare and Process Data
Read files and preprocess the text, then download the city data:
def chunk_text(filename: str) -> list: with open(filename, "r") as f: text = f.read() sentences = text.split("\n") return sentences import os import requests import urllib.request repo = "ytang07/bento_octo_milvus_RAG" directory = "data" save_dir = "./city_data" api_url = f"https://api.github.com/repos/{repo}/contents/{directory}" response = requests.get(api_url) data = response.json() if not os.path.exists(save_dir): os.makedirs(save_dir) for item in data: if item["type"] == "file": file_url = item["download_url"] file_path = os.path.join(save_dir, item["name"]) urllib.request.urlretrieve(file_url, file_path)Process each file and generate embeddings:
cities = os.listdir("city_data") city_chunks = [] for city in cities: chunked = chunk_text(f"city_data/{city}") cleaned = [] for chunk in chunked: if len(chunk) > 7: cleaned.append(chunk) mapped = {"city_name": city.split(".")[0], "chunks": cleaned} city_chunks.append(mapped) def get_embeddings(texts: list) -> list: if len(texts) > 25: splits = [texts[x : x + 25] for x in range(0, len(texts), 25)] embeddings = [] for split in splits: embedding_split = embedding_client.encode(sentences=split) embeddings += embedding_split return embeddings return embedding_client.encode( sentences=texts, ) entries = [] for city_dict in city_chunks: embedding_list = get_embeddings(city_dict["chunks"]) for i, embedding in enumerate(embedding_list): entry = { "embedding": embedding, "sentence": city_dict["chunks"][i], "city": city_dict["city_name"], } entries.append(entry)4. Insert Data into Milvus
Initialize a Milvus Lite client, create a collection with schema and index, and insert the data:
from pymilvus import MilvusClient, DataType COLLECTION_NAME = "Bento_Milvus_RAG" DIMENSION = 384 milvus_client = MilvusClient("milvus_demo.db") schema = MilvusClient.create_schema( auto_id=True, enable_dynamic_field=True, ) schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True) schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=DIMENSION) index_params = milvus_client.prepare_index_params() index_params.add_index( field_name="embedding", index_type="AUTOINDEX", metric_type="COSINE", ) if milvus_client.has_collection(collection_name=COLLECTION_NAME): milvus_client.drop_collection(collection_name=COLLECTION_NAME) milvus_client.create_collection( collection_name=COLLECTION_NAME, schema=schema, index_params=index_params ) milvus_client.insert(collection_name=COLLECTION_NAME, data=entries)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.5. Set Up the LLM and Build RAG
Deploy an LLM on BentoCloud and set up the RAG function:
BENTO_LLM_END_POINT = "BENTO_LLM_END_POINT" llm_client = bentoml.SyncHTTPClient(BENTO_LLM_END_POINT, token=BENTO_API_TOKEN) def dorag(question: str, context: str): prompt = ( f"You are a helpful assistant. The user has a question. Answer the user question based only on the context: {context}. \n" f"The user question is {question}" ) results = llm_client.generate( max_tokens=1024, prompt=prompt, ) res = "" for result in results: res += result return res6. Ask a Question with RAG
Search Milvus for relevant context and generate a response:
question = "What state is Cambridge in?" def ask_a_question(question): embeddings = get_embeddings([question]) res = milvus_client.search( collection_name=COLLECTION_NAME, data=embeddings, anns_field="embedding", limit=5, output_fields=["sentence"], ) sentences = [] for hits in res: for hit in hits: sentences.append(hit["entity"]["sentence"]) context = ". ".join(sentences) return context context = ask_a_question(question=question) print(dorag(question=question, context=context))Learn More
- Retrieval-Augmented Generation (RAG) with Milvus and BentoML — Official Milvus tutorial for building RAG with BentoML
- RAG Without OpenAI: BentoML, OctoAI and Milvus — Zilliz blog on building RAG without OpenAI
- Infrastructure Challenges in Scaling RAG with Custom AI Models — Zilliz blog on scaling RAG with BentoML and Milvus
- BentoML Documentation — Official BentoML documentation
- BentoCloud Documentation — BentoCloud managed service documentation