Blog
Getting Started with Milvus Hybrid Search

Getting Started with Milvus Hybrid Search

Apr 30, 202410 min read

With the release of Milvus 2.4, we introduced multi-vector search and the capability of doing hybrid search (multi-vector search). This new functionality enhances how we can search and analyze data by allowing users to conduct a hybrid search, performing simultaneous queries across multiple vector fields and integrating the results with re-ranking strategies.

Hybrid search, often referred to as multi-vector search, is the process of conducting searches across various vector fields within the same dataset. These vectors can represent different facets of data, utilize diverse embedding models, or employ distinct data processing methods and combine the results using re-rankers.

In this tutorial, you will learn how to leverage Milvus 2.4’s hybrid search capabilities to enhance your search. We’ll cover:

Create Sparse Embeddings
Create Dense Embeddings
Index your data in Milvus
Perform a multi-vector search using the same collection.

This tutorial will utilize the eSci dataset, a comprehensive product search dataset from Amazon. Additionally, we’ll use the BGE-M3 model via the pymilvus[models] library, which makes it easier to generate direct embedding within Milvus.

Introduction to Hybrid Search

Hybrid search is a powerful technique that combines the strengths of sparse and dense retrieval methods to enhance search results. Traditional sparse retrieval methods, while effective in certain scenarios, often suffer from limited expressiveness and lack of flexibility. Hybrid search addresses these limitations by merging sparse and dense retrieval methods, enabling more accurate and efficient search outcomes.

By leveraging the complementary strengths of sparse and dense vectors, hybrid search can capture both the exact matches and the semantic nuances of the data. This results in a more comprehensive and precise search experience. Whether you’re dealing with text, images, or other types of data, hybrid search can significantly improve the relevance and quality of your search results.

Understanding Vector Search

Vector search is a type of search that uses vectors to represent data. Vectors are mathematical representations that capture the essence of data points, allowing for the calculation of similarities between them. In vector search, both the data and the search queries are transformed into vectors. The similarity between the query vector and the data vectors is then computed, and the most similar data points are returned as search results.

This approach is particularly useful for searching large datasets, as it enables efficient and scalable retrieval of relevant information. Vector search can be applied to various domains, including image and text search, where it excels in finding semantically similar items. By leveraging the power of vectors, you can achieve more accurate and meaningful search results.

Install and Import the Needed Libraries, and Create Indexes

! pip install pymilvus[model] datasets

import pandas as pd
from datasets import load_dataset

from pymilvus import (
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    AnnSearchRequest,
    RRFRanker,
    connections,
)

from pymilvus.model.hybrid import BGEM3EmbeddingFunction

Prepare the Dataset

The ESCI dataset is designed for the semantic matching of queries and products. In this section, we'll prepare the ESCI dataset. We'll focus on selecting a subset of the data and ensuring it's clean and ready for processing.

Download and Select a Subset

dataset = load_dataset("tasksource/esci", split="train")

dataset = dataset.select(range(500))
dataset = dataset.filter(lambda x: x["product_locale"] == "us")
dataset

Clean the data

Cleaning the dataset is crucial to avoid any bad search results being caused by duplicates or missing information.

source_df = dataset.to_pandas()

df = source_df.drop_duplicates(
    subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
# Drop rows with missing values
df = df.dropna(
    subset=["product_text", "product_title", "product_bullet_point", "product_brand"]
)
df.head()

Here’s a quick look at what the data includes:

    example_id  query   query_id    product_id  product_locale  esci_label  small_version   large_version   product_title   product_description product_bullet_point    product_brand   product_color   product_text
0   0   revent 80 cfm   0   B000MOO21W  us  Irrelevant  0   1   Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...   None    WhisperCeiling fans feature a totally enclosed...   Panasonic   White   Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceil...
2   1   revent 80 cfm   0   B07X3Y6B1V  us  Exact   0   1   Homewerks 7141-80 Bathroom Fan Integrated LED ...   None    OUTSTANDING PERFORMANCE: This Homewerk's bath ...   Homewerks   80 CFM  Homewerks 7141-80 Bathroom Fan Integrated LED ...
3   2   revent 80 cfm   0   B07WDM7MQQ  us  Exact   0   1   Homewerks 7140-80 Bathroom Fan Ceiling Mount E...   None    OUTSTANDING PERFORMANCE: This Homewerk's bath ...   Homewerks   White   Homewerks 7140-80 Bathroom Fan Ceiling Mount E...
4   3   revent 80 cfm   0   B07RH6Z8KW  us  Exact   0   1   Delta Electronics RAD80L BreezRadiance 80 CFM ...   This pre-owned or refurbished product has been...   Quiet operation at 1.5 sones\nBuilt-in thermos...   DELTA ELECTRONICS (AMERICAS) LTD.   White   Delta Electronics RAD80L BreezRadiance 80 CFM ...
5   4   revent 80 cfm   0   B07QJ7WYFQ  us  Exact   0   1   Panasonic FV-08VRE2 Ventilation Fan with Reces...   None    The design solution for Fan/light combinations...   Panasonic   White   Panasonic FV-08VRE2 Ventilation Fan with Reces...

With the dataset now prepared and cleaned, we can generate vector embeddings and index them in Milvus for our hybrid search.

Generating Vector Embeddings with BGE-M3

Once your data is clean and ready, the next step is to generate vector embeddings. We'll use the BGE-M3 embedding model to transform the raw text data into numerical vectors that the Milvus vector database can index and search effectively.

Merge text data

First, concatenate the different text fields associated with each product to form a unified text vector representation. This helps in capturing all relevant information about the product in a single vector:

df["merged_text"] = df["product_title"] + "\n" + df["product_text"] + "\n" + df["product_bullet_point"]
docs = df["merged_text"].to_list()

Generate Embeddings

ef = BGEM3EmbeddingFunction(use_fp16=False, device="cpu")
dense_dim = ef.dim["dense"]
docs_embeddings = ef(docs)
query = "Do you have an example of a Panasonic product?"
query_embeddings = ef([query])

Setting Up Your Milvus Collection

After generating embeddings, the next step is to store these vectors in Milvus by creating a collection that can handle both sparse and dense vectors.

Connect to Milvus

Start by establishing a connection to your Milvus server:

from pymilvus import connections
connections.connect()

Define Your Collection Schema

fields = [
    # Use auto generated id as primary key
    FieldSchema(
        name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=True, max_length=100
    ),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=8192),
    FieldSchema(name="sparse_vector", dtype=DataType.SPARSE_FLOAT_VECTOR),
    FieldSchema(name="dense_vector", dtype=DataType.FLOAT_VECTOR, dim=dense_dim),
]
schema = CollectionSchema(fields, "")
col = Collection("sparse_dense_demo", schema)

Create Indexes for Vectors

sparse_index = {"index_type": "SPARSE_INVERTED_INDEX", "metric_type": "IP"}
dense_index = {"index_type": "FLAT", "metric_type": "COSINE"}
col.create_index("sparse_vector", sparse_index)
col.create_index("dense_vector", dense_index)

Insert the Data into the Collection

entities = [
    docs,
    docs_embeddings["sparse"],
    docs_embeddings["dense"],
]
col.insert(entities)

Executing Hybrid Searches in Milvus

We can perform hybrid searches once your Milvus collection is prepared with the necessary data and indexes. This step uses both sparse and dense vectors to query the collection and refine the results using a re-ranker.

def query_hybrid_search(query: str):
    query_embeddings = ef([query])

    sparse_req = AnnSearchRequest(
        query_embeddings["sparse"], "sparse_vector", {"metric_type": "IP"}, limit=2
    )
    dense_req = AnnSearchRequest(
        query_embeddings["dense"], "dense_vector", {"metric_type": "COSINE"}, limit=2
    )

    res = col.hybrid_search(
        [sparse_req, dense_req], rerank=RRFRanker(), limit=2, output_fields=["text"]
    )

    return res

This function generates embeddings for an input query. It then constructs two AnnSearchRequest objects for the sparse and dense vectors, specifying the type of similarity metric to use (IP for inner product and COSINE for cosine similarity). The hybrid_search** method combines results from both vectors using a RRFRanker, which re-ranks the combined results to prioritize the most relevant matches.

A Vector Search Example

To see this function in action, let's perform a hybrid search with a practical query:

query_hybrid_search("Do you have a Homewerks product?")[0]

Output

['id: 449353344520491318, distance: 0.032786883413791656, entity: {\'text\': "Homewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks 7141-80 Bathroom Fan Integrated LED Light Ceiling Mount Exhaust Ventilation, 1.1 Sones, 80 CFM\\nHomewerks\\n80 CFM\\nNone\\nOUTSTANDING PERFORMANCE: This Homewerk\'s bath fan ensures comfort in your home by quietly eliminating moisture and humidity in the bathroom. This exhaust fan is 1.1 sones at 80 CFM which means it’s able to manage spaces up to 80 square feet and is very quiet..\\nBATH FANS HELPS REMOVE HARSH ODOR: When cleaning the bathroom or toilet, harsh chemicals are used and they can leave an obnoxious odor behind. Homewerk’s bathroom fans can help remove this odor with its powerful ventilation\\n …]

Hybrid Search vs Simple Vector Similarity Search

Using only dense embeddings instead of a mix of sparse and dense embeddings might run into trouble with queries that depend on exact keyword matches or categorial data distinctions, which sparse embeddings can handle well.

For example, if one is looking for a very specific attribute, like "shipping included." Dense embeddings might retrieve a broad range of things related to the main keywords, like the product type or brand, but miss the crucial "shipping included" aspect.

Let’s compare using a hybrid vector search engine and using only a simple vector search with dense embeddings:


    def query_dense_search(query: str):
        query_embeddings = ef([query])
        search_param = {
            "data": query_embeddings["dense"],
            "anns_field": "dense_vector",
            "param": {"metric_type": "COSINE"},
            "limit": 2,
            "output_fields": ["text"],
        }
        res_dense = col.search(**search_param)
        return res_dense


> query_dense_search("shipping included")

[

    {

        "id": "449353344520491390",

        "distance": 0.5341320037841797,

        "entity": {

            "text": "BAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Self Seal White Envelope 3 5/8" x 6 1/2" #6, No Window Mailing Envelopes, Peel & Seal Mailer for Business Invoice Check (100/Pack), 1-PacknBAZIC Productsn#6 3/4 (100-count)n<p><strong>BACK TO BAZIC</strong></p> <p>Our goal is to provide each customer with long-lasting supplies at an affordable cost. Since 1998, we’ve delivered on this promise and will only continue to improve every year. We’ve built our brand on integrity and quality, so customers know exactly what to expect.</p> <p><strong>COMMITTED TO VALUES</strong></p> <p>We are a value-driven company, guided by the principles of excellence through strong product design at low cost. Our commitment to these values is reflected in our dedication to improving current products and developing new exciting products for our consumers. We thrive on imagination, passion and leadership. We have great products and will to continue to rise with our customer expectations.</p> <p><strong>SUCCESS BASED ON SATISFACTION</strong></p> <p>Headquarters in Los Angeles, California, United States. Each and every product we send out, we expect our 100% customer satisfaction. Our success stems from individual consumer fulfillment. We create products that people want to recommend to others.</p>n#6-3/4 SELF-SEAL WHITE ENVELOPES. These 3 5/8" x 6 1/2" inch self-sealing envelopes are designed to save your time and money. Helps you fly through mailings in record time.nSECURE. Just peel and seal to create a strong lasting seal without licking or moistening. Our self seal design is quick and easy and is guaranteed to stay sealed with no need for tape or glue sticks.nWINDOWLESS DESIGN. Envelopes manufactured with a windowless front panel for easy printing, labeling or hand-addressing, perfect for quick, mass business mailings.nWHITE 20LB STOCK. Perfect for everyday business use, these standard white envelopes are crafted of heavy, durable 20lb paper and securely holds hefty files during transit.nMULTI-PURPOSE. These envelopes can easily fit invoices, letters, checks, gift cards, etc. These envelopes can be used for virtually anything you need them for!"

        }

    },

Same for hybrid search


> query_hybrid_search("shipping included")

[

    {

        "id": "449353344520491358",

        "distance": 0.016393441706895828,

        "entity": {

            "text": "ASURION 4 Year Home Improvement Protection Plan $20-29.99nASURION 4 Year Home Improvement Protection Plan $20-29.99nASURIONnNonenAsurion is taking the guesswork out of finding product protection plans to fit your needs. Products fail - often at the most inconvenient time. It’s a good thing you’re covered because no other plan can protect your stuff the way an Asurion Protection Plan can. Simply put, Asurion Protection Plans cover your products when you need it most with a fast and easy claims process. Buy a protection plan from a company that you know and trust. Add an Asurion Protection Plan to your cart today! Please see "User Guide [pdf]" below for detailed terms and conditions related to this plan.nNO ADDITIONAL COST: You pay $0 for repairs – parts, labor and shipping included.n

        }

    }

]

Here, we can see that a hybrid search can find a product where “shipping” is included, whereas dense embeddings are going more for a result where you can ship something.

Conclusion

Throughout this tutorial, we've explored a new capability of Milvus 2.4, focusing on hybrid search, which allows for vector searches across different types of data embeddings.

Feel free to check out Milvus and the code on Github, and share your experiences with the community by joining our Discord.

Conducting a Hybrid Search

Conducting a hybrid search involves combining sparse and dense retrieval methods to improve search results. The process typically involves the following steps:

Data Preparation: Prepare the data by converting it into vectors. This can be done using various techniques, such as word embeddings for text data or image embeddings for image data. By transforming the raw data into numerical vectors, you enable efficient indexing and retrieval.
Create Indexes: Create indexes for the data vectors. Indexes are data structures that enable fast lookup and retrieval of data. In Milvus, you can create indexes for both sparse and dense vectors, ensuring that your search queries can be processed quickly and accurately.
Perform Hybrid Search: Perform a hybrid search by combining sparse and dense retrieval methods. Techniques such as reciprocal rank fusion can be used to merge the results from both retrieval methods, producing a final ranking that balances precision and recall. This step is crucial for leveraging the strengths of both sparse and dense vectors.
Retrieve Results: Retrieve the search results based on the final ranking. The search results can be further filtered and refined using various techniques, such as attribute filtering. This ensures that the most relevant and high-quality results are presented to the user.

Hybrid search offers several benefits, including improved search results, increased efficiency, and flexibility. It can be used for various applications, such as image and text search, and can be implemented using various techniques, such as reciprocal rank fusion and vector-based methods. By conducting a hybrid search, you can achieve a more comprehensive and accurate search experience, tailored to the specific needs of your application.

Updated on Jan 20, 2025

Stephen Batifol
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

ColPali + Milvus: Redefining Document Retrieval with Vision-Language Models

When combined with Milvus's powerful vector search capabilities, ColPali becomes a practical solution for real-world document retrieval challenges.

What is the K-Nearest Neighbors (KNN) Algorithm in Machine Learning?

KNN is a supervised machine learning technique and algorithm for classification and regression. This post is the ultimate guide to KNN.

Deliver RAG Applications 10x Faster with Zilliz and Vectorize

Zilliz Cloud delivers reliable vector storage and search, while Vectorize automates your RAG pipelines and keeps your embeddings up-to-date.