vLLM and Zilliz Cloud Integration
vLLM and Zilliz Cloud integrate to build high-performance Retrieval-Augmented Generation systems, combining vLLM's optimized LLM inference with PagedAttention and up to 24x throughput improvement alongside Zilliz Cloud's scalable vector database for efficient context retrieval.
Use this integration for FreeWhat is vLLM
vLLM is an open-source library started at UC Berkeley SkyLab, now an incubation-stage project at the LF AI & Data Foundation, focused on optimizing LLM serving performance. It uses efficient memory management with PagedAttention, continuous batching, and optimized CUDA kernels to improve serving performance by up to 24x while cutting GPU memory usage in half compared to traditional methods like HuggingFace Transformers and Text Generation Inference.
By integrating with Zilliz Cloud (fully managed Milvus), vLLM's high-throughput LLM inference is paired with a scalable vector database for efficient retrieval of relevant context, enabling developers to build production-grade RAG systems that ground LLM responses in actual retrieved data to mitigate AI hallucinations.
Benefits of the vLLM + Zilliz Cloud Integration
- High-throughput RAG inference: vLLM's PagedAttention delivers up to 24x higher throughput for LLM inference, while Zilliz Cloud provides fast vector retrieval, creating an end-to-end high-performance RAG pipeline.
- Efficient GPU memory usage: vLLM reduces GPU memory usage by approximately 50% through virtual memory management for the KV cache, allowing more resources for handling larger knowledge bases stored in Zilliz Cloud.
- Grounded responses with reduced hallucinations: Zilliz Cloud retrieves relevant context from large knowledge bases, and vLLM generates responses grounded in this retrieved data, mitigating AI hallucinations.
- Open-source flexibility: Both vLLM and Milvus are open-source projects under the LF AI & Data Foundation, giving developers full control over their RAG infrastructure with managed options available through Zilliz Cloud.
How the Integration Works
vLLM serves as the LLM inference engine, providing optimized serving of large language models like Meta's Llama 3.1-8B. It uses PagedAttention for efficient memory management, continuous batching for high throughput, and optimized CUDA kernels for fast generation of contextually informed responses.
Zilliz Cloud serves as the vector database layer, storing and indexing document embeddings for fast similarity search. When user queries arrive, it retrieves the most relevant text chunks from the knowledge base to provide context for the LLM.
Together, vLLM and Zilliz Cloud create a complete RAG solution: documents are chunked, embedded using models like BAAI/bge-large-en-v1.5, and stored in Zilliz Cloud. When a user asks a question, the query is embedded and Zilliz Cloud retrieves relevant context through similarity search. This context is then passed to vLLM, which serves the LLM to generate accurate, grounded responses augmented by the retrieved text.
Step-by-Step Guide
1. Install Required Packages
$ pip install -U pymilvus# (Recommended) Create a new conda environment. conda create -n myenv python=3.11 -y conda activate myenv # Install vLLM with CUDA 12.1. pip install -U vllm transformers torch2. Prepare Your Dataset
Use the official Milvus documentation as the dataset and load the HTML files:
from langchain.document_loaders import DirectoryLoader path = "../../RAG/rtdocs_new/" global_pattern = '*.html' loader = DirectoryLoader(path=path, glob=global_pattern) docs = loader.load() print(f"loaded {len(docs)} documents")3. Download an Embedding Model
Download a free, open-source embedding model from HuggingFace:
import torch from sentence_transformers import SentenceTransformer N_GPU = torch.cuda.device_count() DEVICE = torch.device('cuda:N_GPU' if torch.cuda.is_available() else 'cpu') model_name = "BAAI/bge-large-en-v1.5" encoder = SentenceTransformer(model_name, device=DEVICE) EMBEDDING_DIM = encoder.get_sentence_embedding_dimension() MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length()4. Chunk and Encode Your Data as Vectors
Use a fixed length of 512 characters with 10% overlap:
from langchain.text_splitter import RecursiveCharacterTextSplitter import numpy as np CHUNK_SIZE = 512 chunk_overlap = np.round(CHUNK_SIZE * 0.10, 0) child_splitter = RecursiveCharacterTextSplitter( chunk_size=CHUNK_SIZE, chunk_overlap=chunk_overlap) chunks = child_splitter.split_documents(docs) list_of_strings = [doc.page_content for doc in chunks if hasattr(doc, 'page_content')] embeddings = torch.tensor(encoder.encode(list_of_strings)) embeddings = np.array(embeddings / np.linalg.norm(embeddings)) converted_values = list(map(np.float32, embeddings)) dict_list = [] for chunk, vector in zip(chunks, converted_values): chunk_dict = { 'chunk': chunk.page_content, 'source': chunk.metadata.get('source', ""), 'vector': vector, } dict_list.append(chunk_dict)5. Save the Vectors in Milvus
Ingest the encoded vector embeddings into the Milvus vector database:
from pymilvus import MilvusClient mc = MilvusClient("milvus_demo.db") COLLECTION_NAME = "MilvusDocs" mc.create_collection(COLLECTION_NAME, EMBEDDING_DIM, consistency_level="Eventually", auto_id=True, overwrite=True) mc.insert( COLLECTION_NAME, data=dict_list, progress_bar=True)As for the argument of
MilvusClient: Setting theurias a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have large scale of data, you can set up a more performant Milvus server on Docker or Kubernetes. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust theuriandtoken, which correspond to the Public Endpoint and API Key in Zilliz Cloud.6. Perform a Vector Search
Ask a question and search for the nearest neighbor chunks:
import torch.nn.functional as F SAMPLE_QUESTION = "What do the parameters for HNSW mean?" query_embeddings = torch.tensor(encoder.encode(SAMPLE_QUESTION)) query_embeddings = F.normalize(query_embeddings, p=2, dim=1) query_embeddings = list(map(np.float32, query_embeddings)) OUTPUT_FIELDS = list(dict_list[0].keys()) OUTPUT_FIELDS.remove('vector') TOP_K = 2 results = mc.search( COLLECTION_NAME, data=query_embeddings, output_fields=OUTPUT_FIELDS, limit=TOP_K, consistency_level="Eventually")7. Run RAG Generation with vLLM and Llama 3.1
Instantiate a vLLM model instance and generate answers augmented by the retrieved text:
from vllm import LLM, SamplingParams MODELTORUN = "meta-llama/Meta-Llama-3.1-8B-Instruct" torch.cuda.empty_cache() llm = LLM(model=MODELTORUN, enforce_eager=True, dtype=torch.bfloat16, gpu_memory_utilization=0.5, max_model_len=1000, seed=415, max_num_batched_tokens=3000)Write a prompt using contexts and sources retrieved from Milvus:
contexts_combined = ' '.join(reversed(contexts)) source_combined = ' '.join(reversed(list(dict.fromkeys(sources)))) SYSTEM_PROMPT = f"""First, check if the provided Context is relevant to the user's question. Second, only if the provided Context is strongly relevant, answer the question using the Context. Otherwise, if the Context is not strongly relevant, answer the question without using the Context. Be clear, concise, relevant. Answer clearly, in fewer than 2 sentences. Grounding sources: {source_combined} Context: {contexts_combined} User's question: {SAMPLE_QUESTION} """ prompts = [SYSTEM_PROMPT] sampling_params = SamplingParams(temperature=0.2, top_p=0.95) outputs = llm.generate(prompts, sampling_params) for output in outputs: generated_text = output.outputs[0].text print(f"Question: {SAMPLE_QUESTION!r}") print(f"Generated text: {generated_text!r}")Learn More
- Building RAG with Milvus, vLLM, and Llama 3.1 — Official Milvus tutorial for building RAG with vLLM
- Building RAG with Milvus, vLLM, and Meta's Llama 3.1 — Zilliz blog on building RAG with vLLM
- Deploying a Multimodal RAG System with vLLM and Milvus — Zilliz blog on multimodal RAG with vLLM
- Building RAG Applications with Milvus, Qwen, and vLLM — Zilliz blog on RAG with Qwen and vLLM
- vLLM Documentation — Official vLLM documentation