Getting Started with Hybrid Search with Milvus
With the release of Milvus 2.4, we introduced multi-vector search and the capability of doing hybrid search (multi-vector search). This new functionality enhances our vector similarity search and analysis of data by allowing simultaneous queries across multiple vector fields and integrating the results with re-ranking strategies.
What is Hybrid Search in Milvus?
Milvus supports the creation of up to 10 vector fields for the same dataset within a single collection. Based on this support, hybrid search allows users to search across multiple vector columns simultaneously. This capability allows for combining multimodal search, hybrid sparse and full keyword search, dense vector search, and hybrid dense and full-text search, offering versatile and flexible search functionality.
These vectors in different columns represent diverse facets of data, originating from different embedding models or undergoing distinct processing methods. The results of hybrid searches are integrated using various reranking strategies.
Tutorial Overview
In this tutorial, you will learn how to leverage Milvus 2.4's hybrid search capabilities to enhance your search. We'll cover:
Create Sparse Embeddings
Create Dense Embeddings
Index your data in Milvus
Perform a hybrid search using the same collection.
This tutorial will utilize the eSci dataset, a comprehensive product search dataset from Amazon. Additionally, we'll use the BGE-M3 model via the pymilvus[models]
library, which makes it easier to generate direct embedding within Milvus.
Install and Import the Needed Libraries
! pip install pymilvus[model] datasets
import pandas as pd
from datasets import load_dataset
from pymilvus import (
FieldSchema,
CollectionSchema,
DataType,
Collection,
AnnSearchRequest,
RRFRanker,
connections,
)
from pymilvus.model.hybrid import BGEM3EmbeddingFunction
Prepare the Dataset
The ESCI dataset is designed for the semantic matching of queries and products. In this section, we'll prepare the ESCI dataset. We'll focus on selecting a subset of the data and ensuring it's clean and ready for processing.
Download and Select a Subset
dataset = load_dataset("tasksource/esci", split="train")
dataset = dataset.select(range(500))
dataset = dataset.filter(lambda x: x["product_locale"] == "us")
dataset
Clean the data
Cleaning the dataset is crucial to avoid any bad search results being caused by duplicates or missing information.
source_df = dataset.to_pandas()
df = source_df.drop_duplicates(
subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
# Drop rows with missing values
df = df.dropna(
subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
df.head()
Here’s a quick look at what the data includes:
example_id query query_id product_id product_locale esci_label small_version large_version product_title product_description product_bullet_point product_brand product_color product_text
0 0 revent 80 cfm 0 B000MOO21W us Irrelevant 0 1 Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil... None WhisperCeiling fans feature a totally enclosed... Panasonic White Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
2 1 revent 80 cfm 0 B07X3Y6B1V us Exact 0 1 Homewerks 7141-80 Bathroom Fan Integrated LED ... None OUTSTANDING PERFORMANCE: This Homewerk's bath ... Homewerks 80 CFM Homewerks 7141-80 Bathroom Fan Integrated LED ...
3 2 revent 80 cfm 0 B07WDM7MQQ us Exact 0 1 Homewerks 7140-80 Bathroom Fan Ceiling Mount E... None OUTSTANDING PERFORMANCE: This Homewerk's bath ... Homewerks White Homewerks 7140-80 Bathroom Fan Ceiling Mount E...
4 3 revent 80 cfm 0 B07RH6Z8KW us Exact 0 1 Delta Electronics RAD80L BreezRadiance 80 CFM ... This pre-owned or refurbished product has been... Quiet operation at 1.5 sones\nBuilt-in thermos... DELTA ELECTRONICS (AMERICAS) LTD. White Delta Electronics RAD80L BreezRadiance 80 CFM ...
5 4 revent 80 cfm 0 B07QJ7WYFQ us Exact 0 1 Panasonic FV-08VRE2 Ventilation Fan with Reces... None The design solution for Fan/light combinations... Panasonic White Panasonic FV-08VRE2 Ventilation Fan with Reces...
With the dataset now prepared and cleaned, we can generate vector embeddings and index them in Milvus for our hybrid search.
Generating Vector Embeddings with BGE-M3
Once your data is clean and ready, the next step is to generate vector embeddings. We'll use the BGE-M3 embedding model to transform the raw text data into numerical vectors that the Milvus vector database can index and search effectively.
Merge text data
First, concatenate the different text fields associated with each product to form a unified text vector representation. This helps in capturing all relevant information about the product in a single vector:
df["merged_text"] = df["product_title"] + "\n" + df["product_text"] + "\n" + df["product_bullet_point"]
docs = df["merged_text"].to_list()
Generate Embeddings
ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
dense_dim = ef.dim["dense"]
docs_embeddings = ef(docs)
query = "Do you have an example of a Panasonic product?"
query_embeddings = ef([query])
Setting Up Your Milvus Collection
After generating embeddings, the next step is to store these vectors in Milvus by creating a collection that can handle both sparse and dense vectors.
Connect to Milvus
Start by establishing a connection to your Milvus server:
from pymilvus import connections
connections.connect()
Define Your Collection Schema
fields = [
# Use auto generated id as primary key
FieldSchema(
name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8192),
FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
]
schema = CollectionSchema(fields, "")
col = Collection("sparse_dense_demo", schema)
Create Indexes for Vectors
sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
dense_index = {"index_type": "FLAT", "metric_type": "COSINE"}
col.create_index("sparse_vector", sparse_index)
col.create_index("dense_vector", dense_index)
Insert the Data into the Collection
entities = [
docs,
docs_embeddings["sparse"],
docs_embeddings["dense"],
]
col.insert(entities)
Executing Hybrid Searches in Milvus
We can perform hybrid searches once your Milvus collection is prepared with the necessary data and indexes. This step uses both sparse and dense vectors to query the collection and refine the results using a re-ranker.
def query_hybrid_search(query: str):
query_embeddings = ef([query])
sparse_req = AnnSearchRequest(
query_embeddings["sparse"], "sparse_vector", {"metric_type": "IP"}, limit=2
)
dense_req = AnnSearchRequest(
query_embeddings["dense"], "dense_vector", {"metric_type": "COSINE"}, limit=2
)
res = col.hybrid_search(
[sparse_req, dense_req], rerank=RRFRanker(), limit=2, output_fields=["text"]
)
return res
This function generates embeddings for an input query. It then constructs two AnnSearchRequest
objects for the sparse and dense vectors, specifying the type of similarity metric to use (IP for inner product and COSINE for cosine similarity). The hybrid_search
** method combines results from both vectors using a RRFRanker, which re-ranks the combined results to prioritize the most relevant matches.
A Vector Search Example
To see this function in action, let's perform a hybrid search with a practical query:
query_hybrid_search("Do you have a Homewerks product?")[0]
Output
['id: 449353344520491318, distance: 0.032786883413791656, entity: {\'text\': "Homewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks\\n80 CFM\\nNone\\nOUTSTANDING PERFORMANCE: This Homewerk\'s bath fan ensures comfort in your home by quietly eliminating moisture and humidity in the bathroom. This exhaust fan is 1.1 sones at 80 CFM which means it’s able to manage spaces up to 80 square feet and is very quiet..\\nBATH FANS HELPS REMOVE HARSH ODOR: When cleaning the bathroom or toilet, harsh chemicals are used and they can leave an obnoxious odor behind. Homewerk’s bathroom fans can help remove this odor with its powerful ventilation\\n …]
Hybrid Search vs Simple Vector Similarity Search
Using only dense embeddings instead of a mix of sparse and dense embeddings might run into trouble with queries that depend on exact keyword matches or categorial data distinctions, which sparse embeddings can handle well.
For example, if one is looking for a very specific attribute, like "shipping included." Dense embeddings might retrieve a broad range of things related to the main keywords, like the product type or brand, but miss the crucial "shipping included" aspect.
Let’s compare using a hybrid vector search engine and using only a simple vector search with dense embeddings:
def query_dense_search(query: str):
query_embeddings = ef([query])
search_param = {
"data": query_embeddings["dense"],
"anns_field": "dense_vector",
"param": {"metric_type": "COSINE"},
"limit": 2,
"output_fields": ["text"],
}
res_dense = col.search(**search_param)
return res_dense
> query_dense_search("shipping included")
[
{
"id": "449353344520491390",
"distance": 0.5341320037841797,
"entity": {
"text": "BAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Productsn#6 3/4 (100-count)n<p><strong>BACK TO BAZIC</strong></p> <p>Our goal is to provide each customer with long-lasting supplies at an affordable cost. Since 1998, we’ve delivered on this promise and will only continue to improve every year. We’ve built our brand on integrity and quality, so customers know exactly what to expect.</p> <p><strong>COMMITTED TO VALUES</strong></p> <p>We are a value-driven company, guided by the principles of excellence through strong product design at low cost. Our commitment to these values is reflected in our dedication to improving current products and developing new exciting products for our consumers. We thrive on imagination, passion and leadership. We have great products and will to continue to rise with our customer expectations.</p> <p><strong>SUCCESS BASED ON SATISFACTION</strong></p> <p>Headquarters in Los Angeles, California, United States. Each and every product we send out, we expect our 100% customer satisfaction. Our success stems from individual consumer fulfillment. We create products that people want to recommend to others.</p>n#6-3/4 SELF-SEAL WHITE ENVELOPES. These 3 5/8" x 6 1/2" inch self-sealing envelopes are designed to save your time and money. Helps you fly through mailings in record time.nSECURE. Just peel and seal to create a strong lasting seal without licking or moistening. Our self seal design is quick and easy and is guaranteed to stay sealed with no need for tape or glue sticks.nWINDOWLESS DESIGN. Envelopes manufactured with a windowless front panel for easy printing, labeling or hand-addressing, perfect for quick, mass business mailings.nWHITE 20LB STOCK. Perfect for everyday business use, these standard white envelopes are crafted of heavy, durable 20lb paper and securely holds hefty files during transit.nMULTI-PURPOSE. These envelopes can easily fit invoices, letters, checks, gift cards, etc. These envelopes can be used for virtually anything you need them for!"
}
},
Same for hybrid search
> query_hybrid_search("shipping included")
[
{
"id": "449353344520491358",
"distance": 0.016393441706895828,
"entity": {
"text": "ASURION 4 Year Home Improvement Protection Plan $20-29.99nASURION 4 Year Home Improvement Protection Plan $20-29.99nASURIONnNonenAsurion is taking the guesswork out of finding product protection plans to fit your needs. Products fail - often at the most inconvenient time. It’s a good thing you’re covered because no other plan can protect your stuff the way an Asurion Protection Plan can. Simply put, Asurion Protection Plans cover your products when you need it most with a fast and easy claims process. Buy a protection plan from a company that you know and trust. Add an Asurion Protection Plan to your cart today! Please see "User Guide [pdf]" below for detailed terms and conditions related to this plan.nNO ADDITIONAL COST: You pay $0 for repairs – parts, labor and shipping included.n
}
}
]
Here, we can see that a hybrid search can find a product where “shipping” is included, whereas dense embeddings are going more for a result where you can ship something.
Conclusion
Throughout this tutorial, we've explored a new capability of Milvus 2.4, focusing on hybrid search, which allows for vector searches across different types of data embeddings.
Feel free to check out Milvus and the code on Github, and share your experiences with the community by joining our Discord.
- What is Hybrid Search in Milvus?
- Tutorial Overview
- Install and Import the Needed Libraries
- Prepare the Dataset
- Generating Vector Embeddings with BGE-M3
- Setting Up Your Milvus Collection
- Executing Hybrid Searches in Milvus
- Conclusion
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free