Blog
Getting Started with PyMilvus

Getting Started with PyMilvus

Jul 07, 20238 min read

Milvus, an open-source vector database, paired with PyMilvus - its Python SDK, is a powerful tool for handling large data sets and performing advanced computations and searches.

This tutorial will guide you in installing and setting up a development environment for using Milvus and PyMilvus. Then, you'll walk through example code for analyzing audio files, storing their data in Milvus, and then using it to compare audio samples for similarities. Let's get started.

What is Milvus?

Milvus is an open-source vector database that handles trillion-scale vector similarity search and analytics. It's a powerful and efficient platform for managing and searching unstructured data through embedding vectors, an increasingly important capability in machine learning, computer vision, recommendation systems, and other artificial intelligence (AI) fields.

At its core, Milvus leverages the concept of embeddings, where data points are represented as high-dimensional vectors. These vectors can capture the similarity between data points based on their semantic or visual features. Milvus enables these vectors' storage, indexing, and retrieval, allowing users to perform fast similarity searches over massive datasets.

With its distributed architecture and support for GPU acceleration, Milvus can efficiently process large volumes of vector data and deliver real-time search results. In addition, it offers a range of indexing algorithms optimized for different use cases, including approximate nearest neighbor (ANN) search algorithms, to balance accuracy and efficiency.

Milvus supports various programming languages and provides easy-to-use APIs, making it accessible to developers and researchers. In addition, it integrates well with popular machine learning frameworks and data processing tools, enabling seamless integration into existing workflows.

What is PyMilvus?

PyMilvus is the Python SDK for Milvus. It's designed to allow Python developers to interact with Milvus easily and efficiently, facilitating tasks such as managing data, conducting vector similarity searches, building indexes, and more.

With PyMilvus, developers leverage the full power of Milvus in their Python applications without having to worry about the intricacies of the underlying database operations. It abstracts away many of the complexities of working with Milvus, providing a simple, easy-to-use API that integrates seamlessly with existing Python workflows.

Setup for the tutorial

The command line examples in this post will be from Linux, but they should translate easily to macOS or Windows with the Windows Subsystem for Linux (WSL). You'll need a system with Docker or Docker Desktop, Python 3.10+, and git.

Milvus bootcamp repository

We're going to use code from the Milvus Bootcamp repository on Github for this tutorial. Start by cloning the repo to your local system.

[egoebelbecker@ares bootcamp]$ git clone https://github.com/milvus-io/bootcamp.git
Cloning into 'bootcamp'...
remote: Enumerating objects: 61861, done.
remote: Counting objects: 100% (4488/4488), done.
remote: Compressing objects: 100% (1628/1628), done.
remote: Total 61861 (delta 2983), reused 4180 (delta 2791), pack-reused 57373
Receiving objects: 100% (61861/61861), 166.78 MiB | 4.91 MiB/s, done.
Resolving deltas: 100% (28214/28214), done.

Python virtual environment

Next, create a virtual environment to run your code. You can put it in the directory where you cloned the bootcamp repo. Activate the environment after you've created it.

[egoebelbecker@ares bootcamp]$ cd bootcamp
[egoebelbecker@ares bootcamp]$ git checkout archive20230625
[egoebelbecker@ares bootcamp]$ python3 -m venv ./venv
[egoebelbecker@ares bootcamp]$ source venv/bin/activate
(venv) [egoebelbecker@ares bootcamp]$

Install Python dependencies

Now it's time to set up our project.

We're going to use the Audio Similarity Search example. It uses a publicly-available set of audio files to demonstrate how you can use Milvus to detect similarities between different samples.

Move to the project's subdirectory and look at the requirements.txt file supplied with the code.

(venv) [egoebelbecker@ares bootcamp]$ cd solutions/audio/audio_similarity_search/
(venv) [egoebelbecker@ares audio_similarity_search]$ cat requirements.txt
pymilvus
redis
librosa
numpy
panns_inference
torch
diskcache
gdown

These are the Python libraries needed to run the project. Use pip to install them. Then, add Jupyter so we can run the notebook.

(venv) [egoebelbecker@ares audio_similarity_search]$ pip install -r requirements.txt
(Lots of output)
(venv) [egoebelbecker@ares audio_similarity_search]$ pip install jupyter
(More output)

Start Redis

Now, start a Redis Docker container.

docker run --name redis -d -p 6379:6379 redis

Install and start Milvus Lite

Finally, you'll need a Milvus server. The easiest way to get up and running with Milvus is Milvus Lite, which runs in Python. Install the milvus packages.

(venv) [egoebelbecker@ares audio_similarity_search]$ pip install milvus
(Lots of output)
(venv) [egoebelbecker@ares audio_similarity_search]$ milvus-server 


    __  _________ _   ____  ______
   /  |/  /  _/ /| | / / / / / __/
  / /|_/ // // /_| |/ / /_/ /\ \
 /_/  /_/___/____/___/\____/___/ {Lite}

 Welcome to use Milvus!

 Version:   v2.2.8-lite
 Process:   275118
 Started:   2023-05-25 12:41:47
 Config:    /home/egoebelbecker/.milvus.io/milvus-server/2.2.8/configs/milvus.yaml
 Logs:      /home/egoebelbecker/.milvus.io/milvus-server/2.2.8/logs

 Ctrl+C to exit ...

PyMilvus Project

Now, with an environment set up, you can run the tutorial.

Start Jupyter

Start Jupyter from your virtual environment.

(venv) [egoebelbecker@ares audio_similarity_search]$ python -m jupyter notebook --notebook-dir=`pwd`
  _   _          _      _
 | | | |_ __  __| |__ _| |_ ___
 | |_| | '_ \/ _` / _` |  _/ -_)
  \___/| .__/\__,_\__,_|\__\___|
       |_|
                       
Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.

https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html

(More output)

This will open the Jupyter Files page in your default browser.

Navigate to the solutions/audio/audio_similarity_search directory.

Open the audio_similarity_search notebook.

Connect to Redis and Milvus

You've already done the initial setup, so scroll right down to the Code Overview section. The code starts by importing libraries for Redis and Milvus.

On the second line, you can see the classes you'll need to connect to Milvus, create a Collection, and add and retrieve data from it.

import redis
from pymilvus import connections, DataType, FieldSchema, CollectionSchema, Collection, utility

connections.connect(host = '127.0.0.1', port = 19530)
red = redis.Redis(host = '127.0.0.1', port=6379, db=0)

Then, the code connects to the two servers. Run this block. Now, it's time to create a Milvus collection to store information about each audio file.

import time

red.flushdb()
time.sleep(.1)
collection_name = "audio_test_collection"

if utility.has_collection(collection_name):
    print("Dropping existing collection...")
    collection = Collection(name=collection_name)
    collection.drop()

#if not utility.has_collection(collection_name):
field1 = FieldSchema(name="id", dtype=DataType.INT64, description="int64", is_primary=True,auto_id=True)
field2 = FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, description="float vector", dim=2048, is_primary=False)
schema = CollectionSchema(fields=[ field1,field2], description="collection description")
collection = Collection(name=collection_name, schema=schema)
print("Created new collection with name: " + collection_name)

Create a Milvus Collection

This step creates a collection named audio_test_collection with two fields:

id - an integer identification field
embedding - a vector of 2048 floating point numbers to store data about the audio files

Run this block and you'll see this message:

Dropping existing collection...
Created new collection with name: audio_test_collection

Next, add an index to your new collection:

if utility.has_collection(collection_name):
    collection = Collection(name = collection_name)
default_index = {"index_type": "IVF_SQ8", "metric_type": "L2", "params": {"nlist": 16384}}
status = collection.create_index(field_name = "embedding", index_params = default_index)
if not status.code:
    print("Successfully create index in collection:{} with param:{}".format(collection_name, default_index))

Run this block:

Successfully create index in collection:audio_test_collection with param:{'index_type': 'IVF_SQ8', 'metric_type': 'L2', 'params': {'nlist': 16384}}

Store audio data

Milvus is ready to store the audio data now. Let's break the next code block down and take a close look.

import os
import librosa
import gdown
import zipfile
import numpy as np
from panns_inference import SoundEventDetection, labels, AudioTagging

data_dir = './example_audio'
at = AudioTagging(checkpoint_path=None, device='cpu')

def download_audio_data():
    url = 'https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp'
    gdown.download(url)
    with zipfile.ZipFile('example_audio.zip', 'r') as zip_ref:
        zip_ref.extractall(data_dir)

This section imports libraries for processing the files. Then. it creates an AudioTagging object from the Panns Inference library.

This is a Python interface to the Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition (PANNs).

The AudioTagging class will create a set of data points for each audio file, similar to this:

Speech: 0.893
Telephone bell ringing: 0.754
Inside, small room: 0.235
Telephone: 0.183
Music: 0.092
Ringtone: 0.047
Inside, large room or hall: 0.028
Alarm: 0.014
Animal: 0.009
Vehicle: 0.008
embedding: (2048,)

Next is the code to download the files, analyze them, and insert the data into Milvus, and the file information into Redis:

def embed_and_save(path, at):
    
    audio, _ = librosa.core.load(path, sr=32000, mono=True)
    audio = audio[None, :]
    try:
        _, embedding = at.inference(audio)
        embedding = embedding/np.linalg.norm(embedding)
        embedding = embedding.tolist()[0]
        mr = collection.insert([[embedding]])
        ids = mr.primary_keys
        collection.load()
        red.set(str(ids[0]), path)
    except Exception as e:
        print("failed: " + path + "; error {}".format(e))

This code block puts it all together:

print("Starting Insert")
download_audio_data()
for subdir, dirs, files in os.walk(data_dir):
    for file in files:
        path = os.path.join(subdir, file)
        embed_and_save(path, at)
print("Insert Done")

Run the code:

Checkpoint path: /home/egoebelbecker/panns_data/Cnn14_mAP=0.431.pth
Using CPU.
Starting Insert

Downloading...
From: [https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp](https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp)
To: /home/egoebelbecker/src/bootcamp/solutions/audio/audio_similarity_search/example_audio.zip
100%|█████████████████████████████████████████████████████████████████████████████████████████| 474k/474k [00:00<00:00, 8.87MB/s]

Insert Done

Search for similarities

To search the database, you need to perform the same analysis on the files you need to match. Here's code to do that to a randomly selected set of file ids.

```python
def get_embed(paths, at):
    embedding_list = []
    for x in paths:
        audio, _ = librosa.core.load(x, sr=32000, mono=True)
        audio = audio[None, :]
        try:
            _, embedding = at.inference(audio)
            embedding = embedding/np.linalg.norm(embedding)
            embedding_list.append(embedding)
        except:
            print("Embedding Failed: " + x)
    return np.array(embedding_list, dtype=np.float32).squeeze()

random_ids = [int(red.randomkey()) for x in range(2)]
search_clips = [x.decode("utf-8") for x in red.mget(random_ids)]
embeddings = get_embed(search_clips, at)
print(embeddings.shape)

Run this block to select two files and run the analysis. Now, it's time to perform a search in Milvus. This code displays the search results in audio players:

import IPython.display as ipd

def show_results(query, results, distances):
    print("Query: ")
    ipd.display(ipd.Audio(query))
    print("Results: ")
    for x in range(len(results)):
        print("Distance: " + str(distances[x]))
        ipd.display(ipd.Audio(results[x]))
    print("-"*50)

Next, the program takes the embeddings array build above and converts it into a list. Then it creates our search parameters. We're searching for two similar sounds, and using nprobe for the criteria. This matches the index type you used above.

embeddings_list = embeddings.tolist()

search_params = {"metric_type": "L2", "params": {"nprobe": 16}}

Finally, here is the search.

try:
    start = time.time()
    results = collection.search(embeddings_list, anns_field="embedding", param=search_params, limit=3)
    end = time.time() - start
    print("Search took a total of: ", end)
    for x in range(len(results)):
        query_file = search_clips[x]
        result_files = [red.get(y.id).decode('utf-8') for y in results[x]]
        distances = [y.distance for y in results[x]]
        show_results(query_file, result_files, distances)
except Exception as e:
    print("Failed to search vectors in Milvus: {}".format(e))

Run it, and you'll see the randomly selected files, and two search results, with their distance from the original.

Listen to each file, and you'll hear the exact match and its closest neighbor.

Summary

In this post, you learned how to build a Python environment for working with Milvus. After setting up a server and the PyMilvus SDK, you ran a sample tutorial from the Milvus Bootcamp repository. You saw code to create a collection, index it, store data, and perform a basic search.

Now that you've seen how easy it is to get started with Milvus, continue with the bootcamp projects and see how you can use Milvus to enhance your projects.

This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

Updated on Mar 28, 2025

Eric Goebelbecker
Eric Goebelbecker has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Producing Structured Outputs from LLMs with Constrained Sampling

Discuss the role of semantic search in processing unstructured data, how finite state machines enable reliable generation, and practical implementations using modern tools for structured outputs from LLMs.

Catch a Cute Ghost this Halloween with Milvus

Run ghastly multimodal analytics and Retrieval Augmented Generation with our "ghosts" collections in the open-source Milvus vector database.

The Role of LLMs in Modern Travel: Opportunities and Challenges Ahead

Explore How GetYourGuide use LLMs to improve customer experiences and How RAG address common LLM issues