Getting Started with Voyager: Spotify's Nearest-Neighbor Search Library
Voyager: a new open-source library for fast nearest-neighbor searches. Voyager uses the HNSW algorithm, outperforming its previous library, Annoy.
Read the entire series
- Safeguard Data Integrity: Backup and Recovery in Vector Databases
- Integrating Vector Databases with Cloud Computing: A Strategic Solution to Modern Data Challenges
- Maintaining Data Integrity in Vector Databases
- Deploying Vector Databases in Multi-Cloud Environments
- Ensuring High Availability of Vector Databases
- Scaling Vector Databases to Meet Enterprise Demands
- Getting Started with Voyager: Spotify's Nearest-Neighbor Search Library
In today's data-driven world, efficient search functionalities are important for many applications. Nearest-neighbor search (NNS) is a key technique in this regard, supporting tasks such as music recommendations and image retrieval. As application systems need to handle large amounts of data, it’s important to have efficient algorithms to quickly find similar data relevant to a given query.
One of the most common approaches is using approximate nearest-neighbor (ANN) search algorithms. Libraries like Annoy (Approximate Nearest Neighbors, Oh Yeah) are widely used to implement these algorithms, enabling fast nearest-neighbor searches in large, high-dimensional datasets. However, with the increasing scale of data, there has been a need for more advanced, faster, and more accurate techniques.
To address this challenge, Spotify released Voyager–a new open-source library designed in 2023 to perform nearest-neighbor searches quickly. Voyager uses the HNSW (Hierarchical Navigable Small Worlds) algorithm, showing a significant advancement over their previous library, Annoy.
According to Spotify's engineering team, Voyager is achieving speeds up to 10 times faster than Annoy while maintaining a similar recall rate. Additionally, it delivers up to 50% higher accuracy at comparable speeds and requires up to four times less memory than Annoy.
Voyager- Spotify's Nearest-Neighbor Search Library.png
Voyager: Spotify's Nearest-Neighbor Search Library | Source
This article will discuss Voyager’s key features, including multithreaded index creation and querying capabilities. Additionally, we will cover best practices for optimizing Voyager and provide a step-by-step guide for getting started with it.
Under the Hood of Voyager: Core Features
Voyager builds on its predecessor, Annoy, by addressing key challenges in nearest-neighbor search, such as scaling to large, high-dimensional datasets. It is especially well-suited for managing the scale and complexity of data in modern applications. Let's understand Voyager's core and supported features.
Enhanced Speed and Accuracy
One of Voyager’s standout features is its significantly improved speed in both indexing and querying. Spotify’s engineering team reports that Voyager is about 10 times faster than Annoy for these tasks. This makes Voyager a better choice for applications that need instant search.
Voyager optimizes the search process using the Hierarchical Navigable Small Worlds (HNSW) algorithm. It helps navigate high-dimensional spaces to find the nearest neighbors while keeping the computational load low.
The Hierarchical Navigable Small Worlds (HNSW) algorithm helps improve how we search for similar items. It makes it easier to navigate through complicated data spaces so we can quickly find what we're looking for without using too much computing power.
Moreover, Voyager provides up to 50% better accuracy than Annoy. Boost in accuracy ensures higher-quality search results, critical for recommendation engines that aim to improve user experience.
Reduced Memory Usage
By using the E4M3 8-bit floating-point representation, Voyager can handle larger datasets efficiently while requiring up to four times less memory than Annoy. This makes it an attractive option for organizations that deal with high-dimensional data and want to optimize their resource usage.
Multithreaded Index Creation and Querying
Another important feature of Voyager is that it supports fully multithreaded index creation and querying. Multithreading lets Voyager use more than one CPU core and adds more capabilities to perform computation in parallel, thereby handling large datasets well. This reduces indexing and query time, enhancing the whole system's efficiency.
Multithreading is important for large-scale applications like Spotify, where millions of users constantly create data and search requests. It lets the system respond to high volumes of queries and indexing for a better and more efficient user experience.
Language Support
Voyager is designed for flexibility in different production environments. While many nearest-neighbor libraries like HNSWlib focus primarily on Python, Voyager supports both Python and Java. This dual-language capability supports backend engineers who prefer JVM-based languages like Java and Scala for deploying performance-critical systems, as well as data scientists who work in Python for machine learning.
Setting up Voyager: A Step-by-Step Guide
This section guides setting up Voyager in Python, building an index, and performing searches. To set up Voyager in Java, please refer to the official documentation.
Installation and Prerequisites
To use Voyager's Python bindings, we'll first need to install the Voyager package. You can do this using pip:
!pip install voyager
Import libraries
Let's import the necessary libraries. We'll need numpy
for numerical operations, voyager
for efficient and accurate nearest-neighbor search.
import numpy as np
from voyager import Index, Space
Prepare data
Voyager works with vector embeddings. We're using pre-trained word embeddings from Google's Word2Vec project.
Download the vector data (
word2vec_10000_200d_tensors.bytes
)Download the corresponding labels (
word2vec_10000_200d_labels.tsv
).Load the vectors into a NumPy array (vectors) and the labels into a list (labels).
#Vector data
!wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_tensors.bytes
#labels
!wget https://storage.googleapis.com/embedding-projector/data/word2vec_10000_200d_labels.tsv
Once the vectors and labels are ready, load the vectors into a NumPy array and the labels into a list.
num_dimensions = 200
with open("word2vec_10000_200d_tensors.bytes", "rb") as f: vectors = np.fromfile(f, np.float32).reshape(-1, num_dimensions)
with open("word2vec_10000_200d_labels.tsv", "r") as f: labels = [line.split("\t")[0] for line in f.readlines()[1:]]
Building an Index
Use Voyager's Index class to store and manage the vectors.
# Create an Index object that can store vectors:
index = Index(Space.Cosine, num_dimensions=200)
We will use Space.Cosine
for cosine similarity as our similarity metric, but we can also apply other similarity metrics like Space.Euclidean
with num_dimensions
set to 200
, since the vectors have 200 dimensions.
Now, use theadd_items()
method to add the vectors
to the index.
index.add_items(vectors)
Performing Searches
Once the index is built, we can perform nearest-neighbor searches.
query_vector = vectors[labels.index('dog')] #query vector
neighbors, distances = index.query(query_vector, k=10) # this returns two arrays: neighbors (indices of the nearest neighbors in the index) and distances (corresponding distances).
for neighbor, distance in zip(neighbors, distances):
print(f"\t{labels[neighbor]!r} is {distance:.2f} away from 'dog'")
Output
'dog' is 0.00 away from dog
'dogs' is 0.28 away from dog
'cat' is 0.35 away from dog
'bird' is 0.38 away from dog
'breed' is 0.38 away from dog
'fish' is 0.41 away from dog
'pet' is 0.41 away from dog
'cats' is 0.41 away from dog
'cow' is 0.42 away from dog
'rat' is 0.42 away from dog
When comparing the word “dog” to itself, the distance is 0 since they are identical. Closely related words, such as the plural form “dogs,” have a small distance, indicating high similarity. Words somewhat related but differ conceptually, like “cat,” have a greater distance. Meanwhile, words with little connection to “dog,” such as “fish,” are positioned the furthest away, reflecting minimal similarity.
And there you have it! This is a basic example of using Voyager for nearest-neighbor search in Python with word embeddings. Of course, you can explore more advanced features of Voyager, such as multithreading and saving or loading indexes. However, this example should provide a solid foundation for starting with Voyager.
Best Practices and Optimization Tips
Follow these best practices and strategies to optimize Voyager for nearest-neighbor search tasks.
Data Preparation
Normalization: Normalize your vectors to have zero mean and unit variance. This can improve search accuracy, especially for distance-based metrics like Euclidean distance.
Dimensionality Reduction: To handle high-dimensional data, consider using dimensionality reduction techniques like PCA and t-SNE. These methods can reduce the number of dimensions while retaining important information, ultimately enhancing the efficiency of indexing and searching.
Index Construction
Space Selection: Choose the appropriate similarity space parameter based on your data and needs. Voyager options include Euclidean distance, inner product, and cosine (cosine similarity).
Speed up the Process: To accelerate the process, use multiple threads during index construction, especially for large datasets. Experiment to find the best setting for your system.
Search Optimization
k Value: Select an appropriate
k
value, representing the number of nearest neighbors to retrieve. Balance the need for retrieving sufficient neighbors with the impact on search latency.Parallelize the Search: Use multiple threads during the search to speed up the process.
Memory Management
Large Datasets: Consider dividing your data into smaller chunks and building separate indexes for large datasets. This can help manage memory usage and improve performance.
Saving and Loading: Save built indexes to disk using
index.save()
and load them later withindex.load()
to avoid rebuilding the index every time, especially for large datasets.
Voyager vs. Purpose-Built Vector Databases
Both vector search libraries like Voyager and purpose-built vector databases like Milvus address the challenge of similarity search in high-dimensional vector data. However, they serve different needs and scales.
Vector Search Libraries: Focused Efficiency
Libraries like Voyager prioritize efficient nearest-neighbor search. They provide lightweight and fast solutions for finding similar vectors, which is ideal for smaller, single-node environments with static or moderately sized datasets. However, they generally lack features for managing dynamic data, ensuring persistence, or scaling across distributed systems. Developers typically need to handle data management, updates, and scaling manually.
Purpose-Built Vector Databases: Comprehensive Solutions
Purpose-built vector databases like Milvus and Zilliz Cloud (powered by Milvus) offer a more comprehensive approach to managing vector data. These databases go beyond simple vector search, providing:
Persistent Storage: Ensuring data durability and availability.
Real-time Updates: Handling dynamic data and frequent updates effectively.
Distributed architecture and scalability: Scaling horizontally to handle increasing datasets and query workloads.
Advanced Querying Capabilities: Supporting complex queries, including filtering, metadata searches, and vector similarity searches.
These features make vector databases like Milvus ideal for production environments that require scalability, high availability, and complex search functionalities.
Conclusion
Voyager advanced the nearest-neighbor search technology. Its multithreaded capabilities reduce memory footprint and handle high-dimensional data efficiently. It's a flexible tool for various applications as It supports both Python and Java.
However, it’s important to consider specific needs when choosing between Voyager and a purpose-built vector database like Milvus. Voyager is a good choice for lightweight and efficient search, while vector databases like Milvus and Zilliz Cloud have more comprehensive, enterprise-level features like persistence, real-time updates, and the ability to scale to handle large production workloads. By following best practices and using Voyager’s capabilities, you can optimize search operations to improve application performance.
Further Resources
- Under the Hood of Voyager: Core Features
- Setting up Voyager: A Step-by-Step Guide
- Best Practices and Optimization Tips
- Voyager vs. Purpose-Built Vector Databases
- Conclusion
- Further Resources
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free