Getting Started with Milvus Hybrid Search
With the release of Milvus 2.4, we introduced multi-vector search and the capability of doing hybrid search (multi-vector search). This new functionality enhances how we can search and analyze data by allowing users to conduct a hybrid search, performing simultaneous queries across multiple vector fields and integrating the results with re-ranking strategies.
Hybrid search, often referred to as multi-vector search, is the process of conducting searches across various vector fields within the same dataset. These vectors can represent different facets of data, utilize diverse embedding models, or employ distinct data processing methods and combine the results using re-rankers.
In this tutorial, you will learn how to leverage Milvus 2.4’s hybrid search capabilities to enhance your search. We’ll cover:
Create Sparse Embeddings
Create Dense Embeddings
Index your data in Milvus
Perform a multi-vector search using the same collection.
This tutorial will utilize the eSci dataset, a comprehensive product search dataset from Amazon. Additionally, we’ll use the BGE-M3 model via the pymilvus[models] library, which makes it easier to generate direct embedding within Milvus.
Introduction to Hybrid Search
Hybrid search is a powerful technique that combines the strengths of sparse and dense retrieval methods to enhance search results. Traditional sparse retrieval methods, while effective in certain scenarios, often suffer from limited expressiveness and lack of flexibility. Hybrid search addresses these limitations by merging sparse and dense retrieval methods, enabling more accurate and efficient search outcomes.
By leveraging the complementary strengths of sparse and dense vectors, hybrid search can capture both the exact matches and the semantic nuances of the data. This results in a more comprehensive and precise search experience. Whether you’re dealing with text, images, or other types of data, hybrid search can significantly improve the relevance and quality of your search results.
Understanding Vector Search
Vector search is a type of search that uses vectors to represent data. Vectors are mathematical representations that capture the essence of data points, allowing for the calculation of similarities between them. In vector search, both the data and the search queries are transformed into vectors. The similarity between the query vector and the data vectors is then computed, and the most similar data points are returned as search results.
This approach is particularly useful for searching large datasets, as it enables efficient and scalable retrieval of relevant information. Vector search can be applied to various domains, including image and text search, where it excels in finding semantically similar items. By leveraging the power of vectors, you can achieve more accurate and meaningful search results.
Install and Import the Needed Libraries, and Create Indexes
! pip install pymilvus[model] datasets
import pandas as pd
from datasets import load_dataset
from pymilvus import (
FieldSchema,
CollectionSchema,
DataType,
Collection,
AnnSearchRequest,
RRFRanker,
connections,
)
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
Prepare the Dataset
The ESCI dataset is designed for the semantic matching of queries and products. In this section, we'll prepare the ESCI dataset. We'll focus on selecting a subset of the data and ensuring it's clean and ready for processing.
Download and Select a Subset
dataset = load_dataset("tasksource/esci", split="train")
dataset = dataset.select(range(500))
dataset = dataset.filter(lambda x: x["product_locale"] == "us")
dataset
Clean the data
Cleaning the dataset is crucial to avoid any bad search results being caused by duplicates or missing information.
source_df = dataset.to_pandas()
df = source_df.drop_duplicates(
subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
# Drop rows with missing values
df = df.dropna(
subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
df.head()
Here’s a quick look at what the data includes:
example_id query query_id product_id product_locale esci_label small_version large_version product_title product_description product_bullet_point product_brand product_color product_text
0 0 revent 80 cfm 0 B000MOO21W us Irrelevant 0 1 Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil... None WhisperCeiling fans feature a totally enclosed... Panasonic White Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
2 1 revent 80 cfm 0 B07X3Y6B1V us Exact 0 1 Homewerks 7141-80 Bathroom Fan Integrated LED ... None OUTSTANDING PERFORMANCE: This Homewerk's bath ... Homewerks 80 CFM Homewerks 7141-80 Bathroom Fan Integrated LED ...
3 2 revent 80 cfm 0 B07WDM7MQQ us Exact 0 1 Homewerks 7140-80 Bathroom Fan Ceiling Mount E... None OUTSTANDING PERFORMANCE: This Homewerk's bath ... Homewerks White Homewerks 7140-80 Bathroom Fan Ceiling Mount E...
4 3 revent 80 cfm 0 B07RH6Z8KW us Exact 0 1 Delta Electronics RAD80L BreezRadiance 80 CFM ... This pre-owned or refurbished product has been... Quiet operation at 1.5 sones\nBuilt-in thermos... DELTA ELECTRONICS (AMERICAS) LTD. White Delta Electronics RAD80L BreezRadiance 80 CFM ...
5 4 revent 80 cfm 0 B07QJ7WYFQ us Exact 0 1 Panasonic FV-08VRE2 Ventilation Fan with Reces... None The design solution for Fan/light combinations... Panasonic White Panasonic FV-08VRE2 Ventilation Fan with Reces...
With the dataset now prepared and cleaned, we can generate vector embeddings and index them in Milvus for our hybrid search.
Generating Vector Embeddings with BGE-M3
Once your data is clean and ready, the next step is to generate vector embeddings. We'll use the BGE-M3 embedding model to transform the raw text data into numerical vectors that the Milvus vector database can index and search effectively.
Merge text data
First, concatenate the different text fields associated with each product to form a unified text vector representation. This helps in capturing all relevant information about the product in a single vector:
df["merged_text"] = df["product_title"] + "\n" + df["product_text"] + "\n" + df["product_bullet_point"]
docs = df["merged_text"].to_list()
Generate Embeddings
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
dense_dim = ef.dim["dense"]
docs_embeddings = ef(docs)
query = "Do you have an example of a Panasonic product?"
query_embeddings = ef([query])
Setting Up Your Milvus Collection
After generating embeddings, the next step is to store these vectors in Milvus by creating a collection that can handle both sparse and dense vectors.
Connect to Milvus
Start by establishing a connection to your Milvus server:
from pymilvus import connections
connections.connect()
Define Your Collection Schema
fields = [
# Use auto generated id as primary key
FieldSchema(
name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8192),
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
]
schema = CollectionSchema(fields, "")
col = Collection("sparse_dense_demo", schema)
Create Indexes for Vectors
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
dense_index = {"index_type": "FLAT", "metric_type": "COSINE"}
col.create_index("sparse_vector", sparse_index)
col.create_index("dense_vector", dense_index)
Insert the Data into the Collection
entities = [
docs,
docs_embeddings["sparse"],
docs_embeddings["dense"],
]
col.insert(entities)
Executing Hybrid Searches in Milvus
We can perform hybrid searches once your Milvus collection is prepared with the necessary data and indexes. This step uses both sparse and dense vectors to query the collection and refine the results using a re-ranker.
def query_hybrid_search(query: str):
query_embeddings = ef([query])
sparse_req = AnnSearchRequest(
query_embeddings["sparse"], "sparse_vector", {"metric_type": "IP"}, limit=2
)
dense_req = AnnSearchRequest(
query_embeddings["dense"], "dense_vector", {"metric_type": "COSINE"}, limit=2
)
res = col.hybrid_search(
[sparse_req, dense_req], rerank=RRFRanker(), limit=2, output_fields=["text"]
)
return res
This function generates embeddings for an input query. It then constructs two AnnSearchRequest
objects for the sparse and dense vectors, specifying the type of similarity metric to use (IP for inner product and COSINE for cosine similarity). The hybrid_search
** method combines results from both vectors using a RRFRanker, which re-ranks the combined results to prioritize the most relevant matches.
A Vector Search Example
To see this function in action, let's perform a hybrid search with a practical query:
query_hybrid_search("Do you have a Homewerks product?")[0]
Output
['id: 449353344520491318, distance: 0.032786883413791656, entity: {\'text\': "Homewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks\\n80 CFM\\nNone\\nOUTSTANDING PERFORMANCE: This Homewerk\'s bath fan ensures comfort in your home by quietly eliminating moisture and humidity in the bathroom. This exhaust fan is 1.1 sones at 80 CFM which means it’s able to manage spaces up to 80 square feet and is very quiet..\\nBATH FANS HELPS REMOVE HARSH ODOR: When cleaning the bathroom or toilet, harsh chemicals are used and they can leave an obnoxious odor behind. Homewerk’s bathroom fans can help remove this odor with its powerful ventilation\\n …]
Hybrid Search vs Simple Vector Similarity Search
Using only dense embeddings instead of a mix of sparse and dense embeddings might run into trouble with queries that depend on exact keyword matches or categorial data distinctions, which sparse embeddings can handle well.
For example, if one is looking for a very specific attribute, like "shipping included." Dense embeddings might retrieve a broad range of things related to the main keywords, like the product type or brand, but miss the crucial "shipping included" aspect.
Let’s compare using a hybrid vector search engine and using only a simple vector search with dense embeddings:
def query_dense_search(query: str):
query_embeddings = ef([query])
search_param = {
"data": query_embeddings["dense"],
"anns_field": "dense_vector",
"param": {"metric_type": "COSINE"},
"limit": 2,
"output_fields": ["text"],
}
res_dense = col.search(**search_param)
return res_dense
> query_dense_search("shipping included")
[
{
"id": "449353344520491390",
"distance": 0.5341320037841797,
"entity": {
"text": "BAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Productsn#6 3/4 (100-count)n<p><strong>BACK TO BAZIC</strong></p> <p>Our goal is to provide each customer with long-lasting supplies at an affordable cost. Since 1998, we’ve delivered on this promise and will only continue to improve every year. We’ve built our brand on integrity and quality, so customers know exactly what to expect.</p> <p><strong>COMMITTED TO VALUES</strong></p> <p>We are a value-driven company, guided by the principles of excellence through strong product design at low cost. Our commitment to these values is reflected in our dedication to improving current products and developing new exciting products for our consumers. We thrive on imagination, passion and leadership. We have great products and will to continue to rise with our customer expectations.</p> <p><strong>SUCCESS BASED ON SATISFACTION</strong></p> <p>Headquarters in Los Angeles, California, United States. Each and every product we send out, we expect our 100% customer satisfaction. Our success stems from individual consumer fulfillment. We create products that people want to recommend to others.</p>n#6-3/4 SELF-SEAL WHITE ENVELOPES. These 3 5/8" x 6 1/2" inch self-sealing envelopes are designed to save your time and money. Helps you fly through mailings in record time.nSECURE. Just peel and seal to create a strong lasting seal without licking or moistening. Our self seal design is quick and easy and is guaranteed to stay sealed with no need for tape or glue sticks.nWINDOWLESS DESIGN. Envelopes manufactured with a windowless front panel for easy printing, labeling or hand-addressing, perfect for quick, mass business mailings.nWHITE 20LB STOCK. Perfect for everyday business use, these standard white envelopes are crafted of heavy, durable 20lb paper and securely holds hefty files during transit.nMULTI-PURPOSE. These envelopes can easily fit invoices, letters, checks, gift cards, etc. These envelopes can be used for virtually anything you need them for!"
}
},
Same for hybrid search
> query_hybrid_search("shipping included")
[
{
"id": "449353344520491358",
"distance": 0.016393441706895828,
"entity": {
"text": "ASURION 4 Year Home Improvement Protection Plan $20-29.99nASURION 4 Year Home Improvement Protection Plan $20-29.99nASURIONnNonenAsurion is taking the guesswork out of finding product protection plans to fit your needs. Products fail - often at the most inconvenient time. It’s a good thing you’re covered because no other plan can protect your stuff the way an Asurion Protection Plan can. Simply put, Asurion Protection Plans cover your products when you need it most with a fast and easy claims process. Buy a protection plan from a company that you know and trust. Add an Asurion Protection Plan to your cart today! Please see "User Guide [pdf]" below for detailed terms and conditions related to this plan.nNO ADDITIONAL COST: You pay $0 for repairs – parts, labor and shipping included.n
}
}
]
Here, we can see that a hybrid search can find a product where “shipping” is included, whereas dense embeddings are going more for a result where you can ship something.
Conclusion
Throughout this tutorial, we've explored a new capability of Milvus 2.4, focusing on hybrid search, which allows for vector searches across different types of data embeddings.
Feel free to check out Milvus and the code on Github, and share your experiences with the community by joining our Discord.
Conducting a Hybrid Search
Conducting a hybrid search involves combining sparse and dense retrieval methods to improve search results. The process typically involves the following steps:
Data Preparation: Prepare the data by converting it into vectors. This can be done using various techniques, such as word embeddings for text data or image embeddings for image data. By transforming the raw data into numerical vectors, you enable efficient indexing and retrieval.
Create Indexes: Create indexes for the data vectors. Indexes are data structures that enable fast lookup and retrieval of data. In Milvus, you can create indexes for both sparse and dense vectors, ensuring that your search queries can be processed quickly and accurately.
Perform Hybrid Search: Perform a hybrid search by combining sparse and dense retrieval methods. Techniques such as reciprocal rank fusion can be used to merge the results from both retrieval methods, producing a final ranking that balances precision and recall. This step is crucial for leveraging the strengths of both sparse and dense vectors.
Retrieve Results: Retrieve the search results based on the final ranking. The search results can be further filtered and refined using various techniques, such as attribute filtering. This ensures that the most relevant and high-quality results are presented to the user.
Hybrid search offers several benefits, including improved search results, increased efficiency, and flexibility. It can be used for various applications, such as image and text search, and can be implemented using various techniques, such as reciprocal rank fusion and vector-based methods. By conducting a hybrid search, you can achieve a more comprehensive and accurate search experience, tailored to the specific needs of your application.
- Introduction to Hybrid Search
- Understanding Vector Search
- Install and Import the Needed Libraries, and Create Indexes
- Prepare the Dataset
- Generating Vector Embeddings with BGE-M3
- Setting Up Your Milvus Collection
- Executing Hybrid Searches in Milvus
- Conclusion
- Conducting a Hybrid Search
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
Deploying a Multimodal RAG System Using vLLM and Milvus
This blog will guide you through creating a Multimodal RAG with Milvus and vLLM.
- Read Now
Ensuring Secure and Permission-Aware RAG Deployments
This blog introduces key security considerations for RAG deployments, including data anonymization, strong encryption, input/output validation, and robust access controls, among other critical security measures.
- Read Now
How Inkeep and Milvus Built a RAG-driven AI Assistant for Smarter Interaction
Robert Tran, the Co-founder and CTO of Inkeep, shared how Inkeep and Zilliz built an AI-powered assistant for their documentation site.