Learn
AI & Machine Learning

Exploring BGE-M3: The Future of Information Retrieval with Milvus

Apr 15, 202411 min read

The potential of BGE-M3 and Milvus is limitless, offering vast opportunities for innovation in virtually any field that relies on information retrieval.

By Rahul

Read the entire series

Picture you're at a bustling international food market where each vendor speaks a different language and offers a unique selection of spices. Now, you want to find specific ingredients for a recipe, but the recipe calls them by names in other languages that are not used on labels in your store or some ingredients are hidden behind counters. This situation is like the challenge in multilingual and multifaceted information retrieval tasks. It's about finding the correct information across various languages and formats, ensuring you can cook the perfect dish—no matter where the ingredients come from.

BGE-M3, a comprehensive solution in the world of Information Retrieval (IR), is a state-of-the-art tool designed to navigate the complexities of retrieving relevant information from a sea of data that is diverse in language and structure. Like a pilot skillfully maneuvering a sophisticated aircraft equipped with state-of-the-art sensors that detect signals across multiple spectrums, languages, and details from great heights, BGE-M3 reassures you with its unmatched precision and adaptability, transforming the landscape of Information Retrieval (IR) by navigating through the vast and diverse data.

BGE-M3 enhances the ability of systems to perform IR tasks across multiple languages and formats while also being able to delve into varying levels of detail, such as:

Multi-Linguality: involves searching through documents written in multiple languages. For example, if a healthcare organization needs to gather research on a new treatment from global sources, the information could be in English, Chinese, Spanish, or any number of languages. The challenge is understanding different languages and recognizing that the same medical terms might have different names or related terms in other languages. BGE-M3 can analyze reports in over 100 languages, ensuring no critical information is lost due to language barriers.
Multi-Functionality: extracting information that might be presented in different formats—like text, images, or videos—and might involve different dimensions or aspects of a query. For instance, a business analyst looking for market trends might need to pull data from textual reports, infographics, and video presentations, needing a system that understands all these formats and their relevant contexts. BGE-M3 can process text, audio, and video data for comprehensive monitoring across multiple channels.
Multi-Granularity: BGE-M3's capability to adjust retrieval precision is crucial for tasks requiring different levels of detail. A researcher might need a broad overview of scientific literature on one day and a deep dive into specific study details the next. BGE-M3 adjusts its retrieval settings to provide the most relevant information according to the required granularity.

BGE-M3’s integration into Milvus through the BGEM3EmbeddingFunction class enables these capabilities, making it a powerhouse for cross-lingual and diverse IR applications.

Integrating BGE-M3 with Milvus

Integrating BGE-M3 with Milvus involves setting up the system (installation) and utilizing its advanced functionalities (instantiation and parameter adjustments) to process information with unparalleled efficiency.

Step-by-Step Guide on Integration

Step 1: Installation of the FlagEmbedding Python Package

Before you can use BGE-M3 with Milvus, you need to install the necessary Python package that allows for the creation of embeddings.

pip install FlagEmbedding
pip install pymilvus
pip install libclang
pip install tensorflow-io-gcs-filesystem
pip install milvus-model

This command installs the various packages, which is crucial for encoding your data into a format that Milvus can efficiently index and search.

Step 2: Instantiation of BGEM3EmbeddingFunction

With the package installed, the next step is to set up the embedding function within your Python environment.

from pymilvus.model.hybrid import BGEM3EmbeddingFunction

bge_m3 = BGEM3EmbeddingFunction(
    model_name = 'BAAI/bge-m3', 
    device = 'cpu', 
    use_fp16 = False
)

model_name = 'BAAI/bge-m3': This parameter specifies the model to be used for embedding creation.
device = 'cpu': This parameter selects the computation device. You can change this to cuda:0 if you're using a GPU.
use_fp16 = False: This setting determines whether to use half-precision floating points to save memory. Set this to True if efficiency is needed and you're on a compatible GPU.

Step 3: Embedding Document and Query Processing

Once the function is instantiated, you can begin processing documents and queries.

Example: Encoding Documents:

# Example documents in different languages
documents = [
    "Climate change is a significant global issue.",
    "El cambio climático es un problema global significativo.",
    "气候变化是一个重大的全球问题。"
]

# Encoding the documents
docs_embeddings = bge_m3.encode_documents(documents)
print("Document Embeddings:", docs_embeddings)

This code encodes a list of documents into embeddings, which Milvus can then use for indexing and retrieval.

Example: Encoding Queries

# Example queries

queries = [
    "What are the effects of climate change?",
    "¿Cuáles son los efectos del cambio climático?",
    "气候变化有什么影响？"
]


# Encoding the queries
query_embeddings = bge_m3.encode_queries(queries)
print("Query Embeddings:", query_embeddings)

This code processes queries that can be used to search through the indexed documents in Milvus.

The examples demonstrate how to convert text data into embeddings that Milvus can efficiently process. Each text string, whether a document or a query, is transformed into a dense vector representing its semantic content. Milvus uses these vectors to perform fast and accurate searches across large datasets.

Implementing Similarity Search with Milvus

Once you have your document and query embeddings ready, using Milvus for similarity search involves several steps: indexing the document embeddings and then searching these indices with your query embeddings to find the most similar documents. Here’s how you can do it:

Step 1: Set Up Milvus and Connect to the Server

First, you must ensure that Milvus is appropriately set up and running. You'll also need to establish a connection to the Milvus server from your application. You can go through the How to Get Started article to learn more.

NB! In the mentioned article, you will come across a docker compose file for the configuration of Milvus server. Please ensure that the container version matches with the pymilvus version we installed at the beginning of this article. To find the version you installed, you can run the following lines of code:

import pymilvus
print(pymilvus.__version__)

You can change the version of pymilvus using the following command:

pip install --force-reinstall -v "pymilvus==your version"

For this tutorial, we will be using version 2.4.0. Once you are done, proceed with the following code to establish a connection to the Milvus server from your application:

# Importing the libraries
import numpy as np
from pymilvus import (
    connections,
    FieldSchema,
    CollectionSchema,
    DataType,
    Collection,
    utility
)

# Connect to Milvus
connections.connect("default", host="localhost", port="19530")

Step 2: Define and Create Collection

Initializes a collection in Milvus, a vector database optimized for vector similarity search:

# fields is a list of FieldSchema objects, each defining the structure of a 
# field (column) within a Milvus collection (similar to a table in a 
# relational database).
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024)
]

# This line creates a CollectionSchema object, passing the previously 
# defined fields list and a description of the collection.
schema = CollectionSchema(fields, description="Document Embeddings Collection")
collection_name = "doc_embeddings"
collection = Collection(name=collection_name, schema=schema)

# Ensure the collection is loaded
collection.load()

Step 3: Check if the collection already exists

# Step 3: Release the Collection before modifying the index
if utility.has_collection(collection_name):
    collection.release()

# Step 4: Drop the Existing Index if it exists
if collection.has_index():
    collection.drop_index()

# Step 5: Create a New Index
index_params = {
    "index_type": "IVF_FLAT",  # Inverted File System
    "metric_type": "L2",  # Euclidean distance
    "params": {"nlist": 128}
}
collection.create_index("embedding", index_params)
print("New index created successfully.")

# Load the collection back into memory if needed for further operations
collection.load()

The code releases the collection from memory if it exists, drops any existing index if present, and then creates a new index on the "embedding" field using an IVF_FLAT type and L2 distance metric, ensuring all operations are safely executed on an unloaded collection to maintain data integrity and performance.

Step 4: Prepare the dataset

The block of code initializes the BGEM3EmbeddingFunction for encoding multilingual documents into vector embeddings using a specific model (BAAI/bge-m3) and device settings (cpu in this case), then inserts these embeddings into the Milvus collection to facilitate subsequent vector-based search operations.

# Prepare and Insert Data
from pymilvus.model.hybrid import BGEM3EmbeddingFunction

bge_m3 = BGEM3EmbeddingFunction(
    model_name='BAAI/bge-m3', 
    device='cpu', 
    use_fp16=False
)
# Your data
documents = [
    "Climate change is a significant global issue.",
    "El cambio climático es un problema global significativo.",
    "气候变化是一个重大的全球问题。"
]


# Inserting the data
docs_embeddings = bge_m3.encode_documents(documents)
entities = [{"embedding": doc.tolist()} for doc in docs_embeddings['dense']]
insert_result = collection.insert(entities)

Step 5: Search and Query Operations

This code block is designed to perform vector-based search operations on a Milvus collection. It starts by loading the collection into memory to optimize search performance. It then sets up search parameters using the L2 Euclidean distance metric and specifies that 10 index nodes (nprobe) should be considered during the search. The code executes a search using the third document's embeddings as the query, searching for the six most similar entries, and returns their IDs. Finally, it prints the search results, including the hits and their corresponding IDs, to showcase the outcomes of the search operation, hence, concluding the vector search.

# Load the collection into memory before searching
collection.load()

# Define search parameters
search_params = {
    "metric_type": "L2",
    "params": {"nprobe": 10}
}
query_embeddings = docs_embeddings['dense'][2:3]  # Example: search with the first two embeddings
result = collection.search(
    query_embeddings, 
    "embedding", 
    search_params, 
    limit=6, 
    output_fields=["id"]
)

for hits in result:
    for hit in hits:
        print(f"hit: {hit}, id: {hit.id}")

Overcoming Challenges with BGE-M3 and Milvus

Working with advanced technologies like BGE-M3 and Milvus can revolutionize information retrieval, particularly when managing data across diverse languages and ensuring optimal performance. However, like any sophisticated technology, challenges can arise, particularly in customization and scalability.

Explaining Common Challenges with Examples

Challenge 1: Maximizing Multilingual Capabilities

While BGE-M3 is adept at handling diverse languages, the challenge often lies in maximizing its capabilities to ensure consistent performance across all languages. For example, a global customer support center utilizes BGE-M3 to analyze and retrieve customer queries worldwide. Even though BGE-M3 supports multilingual content, ensuring that it performs equally well for languages with less training data or more complex grammatical structures (such as Hungarian or Finnish) requires careful calibration of the model and continuous updates to its training datasets to capture the nuances of these languages effectively.

Challenge 2: Ensuring Efficient Scalability and Performance

While BGE-M3 integrated with Milvus efficiently manages and retrieves large datasets, the challenge intensifies when scaling to accommodate exponentially growing data volumes without compromising query response times and accuracy. For instance, a financial analytics firm relies on BGE-M3 and Milvus for real-time transactional data analysis across global markets. As data accumulates rapidly, maintaining quick retrieval speeds and precise results requires strategic indexing management and careful distribution of computational resources. Optimizing these aspects ensures that performance remains robust as data scales, which is crucial for applications requiring immediate data access and analysis.

Best Practices

Tuning Model Parameters:

BGE-M3 offers various parameters that can be tuned according to specific needs, such as adjusting the dimensions of embeddings or the precision of the model. Fine-tuning these parameters helps balance between retrieval accuracy and computational efficiency.

Leveraging Sparse and Dense Embeddings:

Using a combination of sparse and dense embeddings can address different needs within the same system. Sparse embeddings are beneficial for scalability and handling high-dimensional data sparsely populated, while dense embeddings are ideal for capturing complex patterns in data, crucial for accuracy in tasks like semantic search.

Ensuring Scalability and Efficiency:

As data volumes grow, it's essential to scale the system efficiently. Techniques include implementing sharding to distribute data across multiple nodes and using load balancing to manage query traffic, ensuring the system remains responsive and efficient as it scales.

Advanced Features and Future Directions of BGE-M3 and Milvus

The integration of BGE-M3 with Milvus brings a robust set of capabilities to information retrieval systems, equipped with advanced features and potential for significant future advancements.

Advanced Features and Customization Options

Fine-Tuning for Specific Domains: BGE-M3 allows for domain-specific fine-tuning, enhancing relevance and accuracy within specialized areas. For example, in the legal field, BGE-M3 can be fine-tuned on legal terminologies and case law to enhance retrieval accuracy of legal documents, a process that involves re-training with targeted legal datasets.
Integration with Other AI Models: BGE-M3 can be seamlessly integrated with other AI models, such as Named Entity Recognition (NER) systems. This integration is particularly useful in fields like medical research, where it is crucial to accurately identify and retrieve information on specific medical terms from vast datasets. (Example: Embedding Inference at Scale for RAG Applications with Ray Data and Milvus)

Future Trajectory and Developments

Anticipated Developments in IR Technologies: Future developments in BGE-M3 and Milvus are expected to further reduce latency and enhance accuracy for complex queries in dynamic environments. This includes algorithmic enhancements and more efficient data handling mechanisms. (Read more: Emerging Trends in Vector Database Research and Development)
Potential New Applications: The technologies are set to expand into areas such as real-time multimedia information retrieval, enabling them to handle and analyze not just text but also audio and video content across various applications, significantly transforming sectors like digital media and entertainment.

Conclusion: Revolutionizing Search with BGE-M3 and Milvus

The use of BGE-M3 with Milvus marks a significant advancement in Information Retrieval (IR), transforming how data is accessed and analyzed across various domains. Here's a recap of this integration's transformative impact on IR and its contributions to the AI community.

Enhanced Multilingual Capabilities: BGE-M3's remarkable ability to handle over 100 languages seamlessly integrates with Milvus, providing robust, scalable solutions for multilingual document retrieval. This ensures that no linguistic data is left behind or misunderstood, making global applications more inclusive and effective.
Precision and Scalability in Retrieval: BGE-M3's fine-tuning capabilities, combined with Milvus's efficient indexing, enable precise and scalable retrieval operations. This synergy supports dense, multi-vector, and sparse retrieval, making it ideal for applications ranging from academic research to real-time consumer data analysis.
Fostering Innovation Across Sectors: By enabling efficient and accurate retrieval of complex data sets, BGE-M3 and Milvus empower sectors such as healthcare, finance, and legal industries to innovate and enhance their services. These technologies provide the backbone for systems that require quick access to detailed, relevant information, thereby improving decision-making processes and operational efficiencies.

The potential of BGE-M3 and Milvus is limitless, offering vast opportunities for innovation in virtually any field that relies on information retrieval. This exciting prospect encourages researchers, developers, and business leaders to explore how these technologies can be integrated into their projects, pushing the boundaries of what's possible in their domains.

Updated on May 01, 2025

Rahul

Next: Mastering BM25: A Deep Dive into the Algorithm and Its Application in Milvus

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Optimizing AI: A Guide to Stable Diffusion and Efficient Caching Strategies

This blog post will explore various caching strategies for optimizing Stable Diffusion models.

Demystifying Color Histograms: A Guide to Image Processing and Analysis

Mastering color histograms is indispensable for anyone involved in image processing and analysis. By understanding the nuances of color distributions and leveraging advanced techniques, practitioners can unlock the full potential of color histograms in various imaging projects and research endeavors.

The Evolution of Multi-Agent Systems: From Early Neural Networks to Modern Distributed Learning (Algorithmic)

In this article, we'll discuss the evolution of MAS from its early days to the most recent developments from an algorithmic perspective.