Learn
Advanced Techniques in Vector Database Management

Getting Started with Voyager: Spotify's Nearest-Neighbor Search Library

Nov 05, 20248 min read

Voyager: a new open-source library for fast nearest-neighbor searches. Voyager uses the HNSW algorithm, outperforming its previous library, Annoy.

By Haziqa Sajid

Read the entire series

In today's data-driven world, efficient search functionalities are important for many applications. Nearest-neighbor search (NNS) is a key technique in this regard, supporting tasks such as music recommendations and image retrieval. As application systems need to handle large amounts of data, it’s important to have efficient algorithms to quickly find similar data relevant to a given query.

One of the most common approaches is using approximate nearest-neighbor (ANN) search algorithms. Libraries like Annoy (Approximate Nearest Neighbors, Oh Yeah) are widely used to implement these algorithms, enabling fast nearest-neighbor searches in large, high-dimensional datasets. However, with the increasing scale of data, there has been a need for more advanced, faster, and more accurate techniques.

To address this challenge, Spotify released Voyager–a new open-source library designed in 2023 to perform nearest-neighbor searches quickly. Voyager uses the HNSW (Hierarchical Navigable Small Worlds) algorithm, showing a significant advancement over their previous library, Annoy.

According to Spotify's engineering team, Voyager is achieving speeds up to 10 times faster than Annoy while maintaining a similar recall rate. Additionally, it delivers up to 50% higher accuracy at comparable speeds and requires up to four times less memory than Annoy.

Voyager- Spotify's Nearest-Neighbor Search Library.png

Voyager: Spotify's Nearest-Neighbor Search Library | Source

This article will discuss Voyager’s key features, including multithreaded index creation and querying capabilities. Additionally, we will cover best practices for optimizing Voyager and provide a step-by-step guide for getting started with it.

Under the Hood of Voyager: Core Features

Voyager builds on its predecessor, Annoy, by addressing key challenges in nearest-neighbor search, such as scaling to large, high-dimensional datasets. It is especially well-suited for managing the scale and complexity of data in modern applications. Let's understand Voyager's core and supported features.

Enhanced Speed and Accuracy

One of Voyager’s standout features is its significantly improved speed in both indexing and querying. Spotify’s engineering team reports that Voyager is about 10 times faster than Annoy for these tasks. This makes Voyager a better choice for applications that need instant search.

Voyager optimizes the search process using the Hierarchical Navigable Small Worlds (HNSW) algorithm. It helps navigate high-dimensional spaces to find the nearest neighbors while keeping the computational load low.

The Hierarchical Navigable Small Worlds (HNSW) algorithm helps improve how we search for similar items. It makes it easier to navigate through complicated data spaces so we can quickly find what we're looking for without using too much computing power.

Moreover, Voyager provides up to 50% better accuracy than Annoy. Boost in accuracy ensures higher-quality search results, critical for recommendation engines that aim to improve user experience.

Reduced Memory Usage

By using the E4M3 8-bit floating-point representation, Voyager can handle larger datasets efficiently while requiring up to four times less memory than Annoy. This makes it an attractive option for organizations that deal with high-dimensional data and want to optimize their resource usage.

Multithreaded Index Creation and Querying

Another important feature of Voyager is that it supports fully multithreaded index creation and querying. Multithreading lets Voyager use more than one CPU core and adds more capabilities to perform computation in parallel, thereby handling large datasets well. This reduces indexing and query time, enhancing the whole system's efficiency.

Multithreading is important for large-scale applications like Spotify, where millions of users constantly create data and search requests. It lets the system respond to high volumes of queries and indexing for a better and more efficient user experience.

Language Support

Voyager is designed for flexibility in different production environments. While many nearest-neighbor libraries like HNSWlib focus primarily on Python, Voyager supports both Python and Java. This dual-language capability supports backend engineers who prefer JVM-based languages like Java and Scala for deploying performance-critical systems, as well as data scientists who work in Python for machine learning.

Setting up Voyager: A Step-by-Step Guide

This section guides setting up Voyager in Python, building an index, and performing searches. To set up Voyager in Java, please refer to the official documentation.

Installation and Prerequisites

To use Voyager's Python bindings, we'll first need to install the Voyager package. You can do this using pip:

!pip install voyager

Import libraries

Let's import the necessary libraries. We'll need numpyfor numerical operations, voyager for efficient and accurate nearest-neighbor search.

import numpy as np
from voyager import Index, Space

Prepare data

Voyager works with vector embeddings. We're using pre-trained word embeddings from Google's Word2Vec project.

Download the vector data (word2vec_10000_200d_tensors.bytes)
Download the corresponding labels (word2vec_10000_200d_labels.tsv).
Load the vectors into a NumPy array (vectors) and the labels into a list (labels).

#Vector data
!wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_tensors.bytes

#labels
!wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_labels.tsv

Once the vectors and labels are ready, load the vectors into a NumPy array and the labels into a list.

num_dimensions = 200

with open("word2vec_10000_200d_tensors.bytes", "rb") as f: vectors = np.fromfile(f, np.float32).reshape(-1, num_dimensions)

with open("word2vec_10000_200d_labels.tsv", "r") as f: labels = [line.split("\t")[0] for line in f.readlines()[1:]]

Building an Index

Use Voyager's Index class to store and manage the vectors.

# Create an Index object that can store vectors: 
index = Index(Space.Cosine, num_dimensions=200)

We will use Space.Cosine for cosine similarity as our similarity metric, but we can also apply other similarity metrics like Space.Euclideanwith num_dimensionsset to 200, since the vectors have 200 dimensions.

Now, use theadd_items() method to add the vectorsto the index.

index.add_items(vectors)

Performing Searches

Once the index is built, we can perform nearest-neighbor searches.

query_vector = vectors[labels.index('dog')] #query vector 

neighbors, distances = index.query(query_vector, k=10) # this returns two arrays: neighbors (indices of the nearest neighbors in the index) and distances (corresponding distances).

for neighbor, distance in zip(neighbors, distances):
    print(f"\t{labels[neighbor]!r} is {distance:.2f} away from 'dog'")

Output

    'dog' is  0.00 away from dog
    'dogs' is  0.28 away from dog
    'cat' is  0.35 away from dog
    'bird' is  0.38 away from dog
    'breed' is  0.38 away from dog
    'fish' is  0.41 away from dog
    'pet' is  0.41 away from dog
    'cats' is  0.41 away from dog
    'cow' is  0.42 away from dog
    'rat' is  0.42 away from dog

When comparing the word “dog” to itself, the distance is 0 since they are identical. Closely related words, such as the plural form “dogs,” have a small distance, indicating high similarity. Words somewhat related but differ conceptually, like “cat,” have a greater distance. Meanwhile, words with little connection to “dog,” such as “fish,” are positioned the furthest away, reflecting minimal similarity.

And there you have it! This is a basic example of using Voyager for nearest-neighbor search in Python with word embeddings. Of course, you can explore more advanced features of Voyager, such as multithreading and saving or loading indexes. However, this example should provide a solid foundation for starting with Voyager.

Best Practices and Optimization Tips

Follow these best practices and strategies to optimize Voyager for nearest-neighbor search tasks.

Data Preparation

Normalization: Normalize your vectors to have zero mean and unit variance. This can improve search accuracy, especially for distance-based metrics like Euclidean distance.
Dimensionality Reduction: To handle high-dimensional data, consider using dimensionality reduction techniques like PCA and t-SNE. These methods can reduce the number of dimensions while retaining important information, ultimately enhancing the efficiency of indexing and searching.

Index Construction

Space Selection: Choose the appropriate similarity space parameter based on your data and needs. Voyager options include Euclidean distance, inner product, and cosine (cosine similarity).
Speed up the Process: To accelerate the process, use multiple threads during index construction, especially for large datasets. Experiment to find the best setting for your system.

Search Optimization

k Value: Select an appropriate k value, representing the number of nearest neighbors to retrieve. Balance the need for retrieving sufficient neighbors with the impact on search latency.
Parallelize the Search: Use multiple threads during the search to speed up the process.

Memory Management

Large Datasets: Consider dividing your data into smaller chunks and building separate indexes for large datasets. This can help manage memory usage and improve performance.
Saving and Loading: Save built indexes to disk using index.save() and load them later with index.load() to avoid rebuilding the index every time, especially for large datasets.

Voyager vs. Purpose-Built Vector Databases

Both vector search libraries like Voyager and purpose-built vector databases like Milvus address the challenge of similarity search in high-dimensional vector data. However, they serve different needs and scales.

Vector Search Libraries: Focused Efficiency

Libraries like Voyager prioritize efficient nearest-neighbor search. They provide lightweight and fast solutions for finding similar vectors, which is ideal for smaller, single-node environments with static or moderately sized datasets. However, they generally lack features for managing dynamic data, ensuring persistence, or scaling across distributed systems. Developers typically need to handle data management, updates, and scaling manually.

Purpose-Built Vector Databases: Comprehensive Solutions

Purpose-built vector databases like Milvus and Zilliz Cloud (powered by Milvus) offer a more comprehensive approach to managing vector data. These databases go beyond simple vector search, providing:

Persistent Storage: Ensuring data durability and availability.
Real-time Updates: Handling dynamic data and frequent updates effectively.
Distributed architecture and scalability: Scaling horizontally to handle increasing datasets and query workloads.
Advanced Querying Capabilities: Supporting complex queries, including filtering, metadata searches, and vector similarity searches.

These features make vector databases like Milvus ideal for production environments that require scalability, high availability, and complex search functionalities.

Conclusion

Voyager advanced the nearest-neighbor search technology. Its multithreaded capabilities reduce memory footprint and handle high-dimensional data efficiently. It's a flexible tool for various applications as It supports both Python and Java.

However, it’s important to consider specific needs when choosing between Voyager and a purpose-built vector database like Milvus. Voyager is a good choice for lightweight and efficient search, while vector databases like Milvus and Zilliz Cloud have more comprehensive, enterprise-level features like persistence, real-time updates, and the ability to scale to handle large production workloads. By following best practices and using Voyager’s capabilities, you can optimize search operations to improve application performance.

Further Resources

Updated on Mar 31, 2025

Haziqa Sajid
Digital Storytelling for Data, AI, B2B & SaaS

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Maintaining Data Integrity in Vector Databases

Guaranteeing that data is correct, consistent, and dependable throughout its lifecycle is important in data management, and especially in vector databases

Ensuring High Availability of Vector Databases

Ensuring high availability is crucial for the operation of vector databases, especially in applications where downtime translates directly into lost productivity and revenue.

Scaling Vector Databases to Meet Enterprise Demands

In this blog, we will explore the concept of database scalability and unravel Milvus's scaling capability. We will also introduce its scalability techniques and explore how they pave the way for unparalleled performance and innovation in unstructured data management.