vLLM
Zilliz Cloud and vLLM enable efficient RAG systems with vector search and LLM inference.
Use this integration for FreeAbout vLLM
vLLM is an open-source library for large language model (LLM) inference and serving, developed at UC Berkeley SkyLab. It focuses on optimizing LLM serving performance through efficient memory management, continuous batching, and optimized CUDA kernels. vLLM's PagedAttention technology improves serving performance by up to 24x while reducing GPU memory usage by half compared to traditional methods.
Why Zilliz Cloud and vLLM
Combining Zilliz Cloud and vLLM creates a powerful solution for building high-performance Retrieval Augmented Generation (RAG) systems. Zilliz Cloud, based on the Milvus vector database, provides efficient vector storage and retrieval capabilities essential for RAG applications. vLLM complements this by offering optimized LLM inference and serving.
This integration allows developers to build RAG systems that can efficiently retrieve relevant information from large datasets stored in Zilliz Cloud and generate high-quality responses using vLLM's optimized LLM serving. The combination addresses common challenges in AI applications, such as AI hallucinations, by grounding LLM responses in accurate, retrieved information.
How Zilliz Cloud and vLLM works
The integration of Zilliz Cloud and vLLM works by leveraging the strengths of both technologies in a RAG system. First, text data is embedded and stored as vector embeddings in Zilliz Cloud. When a user query is received, Zilliz Cloud performs efficient vector similarity search to retrieve the most relevant text chunks from its knowledge base.
These retrieved text chunks are then passed to vLLM, which uses them to augment the context for the LLM (such as Meta's Llama 3.1). vLLM's optimized serving technology, including PagedAttention for efficient memory management, enables fast and resource-efficient LLM inference. The LLM then generates a response based on both the user query and the retrieved context, resulting in more accurate and contextually relevant answers.
Learn
The best way to start is with a hands-on tutorial. This tutorial will walk you through how to build a large language model application with vLLM & Zilliz Cloud.
Tutorial: [Build and Perform RAG-Retrieval with Milvus and vLLM ](https://milvus.io/docs/milvus_rag_with_vllm.md
And here are a few more resources:
- Blog: Building RAG with Milvus, vLLM, and Llama 3.1
- [vLLM GitHub Repository ](https://github.com/vllm-project/vllm and model page)
- 2023 vLLM paper on Paged Attention
- 2023 vLLM presentation at Ray Summit
- vLLM blog: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention
- Helpful blog about running the vLLM server: Deploying vLLM: a Step-by-Step Guide
- The Llama 3 Herd of Models | Research - AI at Meta