Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries
FriendliAI specializes in generative AI infrastructure, offering solutions that enable organizations to efficiently deploy and manage large language models (LLMs) and other generative AI models with optimized performance and reduced cost. Users have the ability to choose from production-ready conventional LLMs accessible through APIs, or custom fine-tuned LLMs deployed on the hardware of the user’s choice, whether on the public cloud or on private on-premise clusters.
Milvus is an open-source vector database that stores, indexes, and searches billion-scale unstructured data through high-dimensional vector embeddings. It is perfect for building modern AI applications such as retrieval augmented generation (RAG), semantic search, multimodal search, and recommendation systems.
In this article, we'll explore how to use Milvus with Friendli Serverless Endpoints to perform Retrieval-Augmented Generation (RAG) on particular documents and materials and execute multi-modal queries that incorporate images and other visual content. This powerful combination allows for more sophisticated and context-aware AI applications.
Understanding RAG and Multi-Modal Models
Retrieval-Augmented Generation (RAG)
RAG is a technique that enhances language models by providing them with relevant information, primarily retrieved from a vector database-powered knowledge base. This approach allows AI models to generate more accurate and contextually appropriate responses by referencing designated external data sources.
Multi-Modal Models
Multimodal models can process and understand multiple types of input data, such as text, images, and audio. They can analyze and generate responses based on diverse information sources, enabling more comprehensive and nuanced interactions.
Why Incorporate RAG and Multi-modal models together?
The combination of RAG and multi-modal capabilities significantly improves AI systems by providing the following features simultaneously:
- Allowing for more diverse and rich input types of the user’s choice
- Providing up-to-date information
- Enhancing accuracy and relevance of responses
- Enabling context-aware interactions
Hands-On Implementation
Let's dive into the practical implementation of RAG and multi-modal queries using the Milvus vector database and Friendli Serverless Endpoints.
Step 1: Install Prerequisites and Download Milvus Docs
First, we'll install the necessary libraries and download the Milvus documentation that we'll use for our RAG job:
!pip install --upgrade pymilvus requests tqdm langchain langchain-community langchain-huggingface langchain-openai friendli-client tiktoken
!wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip
!rm -rf milvus_docs
!unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs
Step 2: Process Documentation Files
Next, we'll read the Milvus documentation files and use a simple file-splitting strategy to treat each text line as an individual chunk:
from glob import glob
text_lines = []
for file_path in glob("milvus_docs/en/faq/*.md", recursive=True):
with open(file_path, "r") as file:
file_text = file.read()
text_lines += file_text.split("# ")
Step 3: Prepare Embeddings
We'll use the Hugging Face embeddings library to use a simple all-MiniLM-L6
model to create vector representations of our text:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedding = HuggingFaceEmbeddings(model_name=embeddings_model_name)
test_embedding = embedding.embed_query("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])
Step 4: Set Up Milvus Client
Now, let's prepare the Milvus client for our RAG implementation. In this simple example, we use Milvus Lite, which runs locally and materializes a file in a local file. You can also consider other Milvus deployment options:
If you only need a local vector database for small scale data or prototyping, setting the uri as a local file, e.g../milvus.db
, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
For larger scale data and traffic in production, you can set up a Milvus server on Docker or Kubernetes. In this setup, please use the server address and port as your uri
, e.g.http://localhost:19530
. If you enable the authentication feature on Milvus, set the token
as "<your_username>:<your_password>", otherwise there is no need to set the token.
You can also use fully managed Milvus on Zilliz Cloud. Simply set the uri
and token
to the Public Endpoint and API key of your Zilliz Cloud instance.
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"
Step 5: Create Milvus Collection
We'll create a collection in the Milvus client if it doesn't already exist:
if milvus_client.has_collection(collection_name):
milvus_client.drop_collection(collection_name)
milvus_client.create_collection(
collection_name=collection_name,
dimension=embedding_dim,
metric_type="IP", # Inner product distance
consistency_level="Strong", # Strong consistency level
)
Step 6: Embed and Insert Text into Milvus
Let's embed our text and insert it into the Milvus collection:
from tqdm import tqdm
data = []
for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
data.append({"id": i, "vector": embedding.embed_query(line), "text": line})
milvus_client.insert(collection_name=collection_name, data=data)
Step 7: Perform RAG Query
Now we can ask a question and search for relevant data within our Milvus database:
question = "How is data stored in milvus?"
search_res = milvus_client.search(
collection_name=collection_name,
data=[
embedding.embed_query(question)
],
limit=3, # Return top 3 results
search_params={"metric_type": "IP", "params": {}}, # Inner product distance
output_fields=["text"], # Return the text field
)
import json
retrieved_lines_with_distances = [
(res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))
context = "\n".join(
[line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)
Step 8: Create Prompts for RAG
Let's create the system and user prompts for our RAG query:
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
Step 9: Set Up Friendli Token
Obtain your FRIENDLI_TOKEN
from the Friendli Suite and set it as an environment variable:
import os
if "FRIENDLI_TOKEN" not in os.environ:
os.environ["FRIENDLI_TOKEN"] = 'flp_FILL_IN_WITH_YOUR_OWN_PERSONAL_ACCESS_TOKEN'
Step 10: Execute RAG Query
Now we can execute our RAG query using the Friendli Serverless Endpoints:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="meta-llama-3.1-70b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
prompt = ChatPromptTemplate.from_messages([
("system", SYSTEM_PROMPT),
("user", USER_PROMPT)
])
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
print(chain.invoke({"input": question}))
This produces the answer based on the provided documents:
In Milvus, data is stored in two forms: inserted data and metadata.
Inserted data (vector data, scalar data, and collection-specific schema) is stored in persistent storage as incremental logs. Milvus supports multiple object storage backends, including MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).
Metadata, on the other hand, is generated within Milvus and is stored in etcd, with each Milvus module having its own metadata.
Step 11: Multi-Modal Queries
For multi-modal queries, we'll use the Llama-3.2-11b-vision model:
multimodalllm = ChatOpenAI(
model="llama-3.2-11b-vision-instruct",
base_url="https://api.friendli.ai/serverless/beta",
api_key=os.environ["FRIENDLI_TOKEN"],
)
image_url = "https://milvus.io/docs/v2.4.x/assets/highly-decoupled-architecture.png"
message = HumanMessage(
content=[
{"type": "text", "text": "describe what is in this image"},
{"type": "image_url", "image_url": {"url": image_url}},
],
)
response = multimodalllm.invoke([message])
print(response.content)
From its response, we can infer that the model correctly understands the image:
The image depicts a flowchart of the components of a system, with the following components:
**Coordinator Service**
* Root
* Query
…
Step 12: Combine RAG and Multi-Modal Capabilities
Finally, let's combine the RAG and multi-modal capabilities:
question = "How is data stored in milvus with respect to this picture?"
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""
message = HumanMessage(
content=[
{"type": "text", "text": USER_PROMPT},
{"type": "image_url", "image_url": {"url": image_url}},
],
)
response = multimodalllm.invoke([message])
print(response.content)
The model correctly generates a correct response based on the image and the documents:
**Step 1: Identify the components involved in storing data in Milvus.**
The components involved in storing data in Milvus include:
* Access Layer
* Message Storage
* Worker Node
**Step 2: Determine how data is stored in Milvus.**
Data is stored in the Access Layer and Message Storage.
**Step 3: Determine where data is stored in Milvus.**
Data is stored in both Access Layer and Message Storage.
**Answer:** Data is stored in both Access Layer and Message Storage.
For the full code and more details, check out this Colab notebook.
Conclusion
This tutorial has demonstrated how to leverage Milvus and Friendli Serverless Endpoints to implement advanced RAG and multi-modal queries. By combining these powerful technologies, you can create more sophisticated AI applications that can understand and process diverse types of information, leading to more accurate and context-aware responses.
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
How to Choose a Vector Database: Qdrant Cloud vs. Zilliz Cloud
Compare Qdrant Cloud and Zilliz Cloud (fully managed Milvus) in this in-depth benchmark, cost, and features comparison.
- Read Now
How Vector Databases are Revolutionizing Unstructured Data Search in AI Applications
Learn how vector databases have emerged as a transformative technology in the field of AI and machine learning, particularly for handling unstructured data. Their applications extend far beyond simple retrieval-augmented generation (RAG) systems, revolutionizing various domains including customer support, recommendation systems, drug discovery, and multimodal search.
- Read Now
Unlock AI-powered search with Fivetran and Milvus
Fivetran supports the Milvus vector database as a destination, making it easier to onboard every data source for RAG and AI-powered search.