Learn
Accelerated Vector Search

What is Voyager?

Dec 19, 20248 min read

Voyager is an Approximate Nearest Neighbor (ANN) search library optimized for high-dimensional vector data.

Read the entire series

Introduction

Modern AI systems rely on Nearest Neighbor Search (NNS) and vector similarity search to find patterns and relationships within vast datasets. NNS identifies the closest data points to a given query, enabling applications like recommendation engines, search platforms, and AI-driven exploration tools to provide relevant and personalized results. By representing data as high-dimensional vectors, vector similarity search measures the closeness of these vectors using mathematical metrics such as cosine similarity or Euclidean distance.

NNS plays a critical role in AI applications by allowing systems to efficiently compare and retrieve similar items. For example, a recommendation system can identify songs with features similar to a user’s listening history, while an image search engine can retrieve visually related results. The speed and accuracy of these searches are essential for delivering meaningful user experiences in real time.

To address the growing demands of large-scale and real-time vector search, Spotify developed Voyager, an open-source ANN library designed to optimize search performance. Voyager builds on the Hierarchical Navigable Small World (HNSW) algorithm and introduces features like dynamic indexing and multithreading to support scalable and adaptive AI systems. This article will explore Voyager’s architecture, functionality, and role in advancing vector similarity search.

What is Voyager?

Voyager is an Approximate Nearest Neighbor (ANN) search library optimized for high-dimensional vector data. It organizes datasets into a graph-based structure, enabling fast and efficient similarity searches. Designed to support dynamic and real-time systems, Voyager is capable of handling the evolving needs of modern AI-driven applications.

Why Was Voyager Built?

Voyager was developed to address specific challenges encountered with existing ANN libraries, particularly Annoy, which Spotify previously relied on. Annoy uses a tree-based indexing approach, which works well for static datasets but struggles frequently in environments where data changes. Annoy requires a complete rebuild of the index each time new data is added, leading to inefficiencies and operational challenges.

Additionally, as Spotify’s datasets grew into billions of vectors, Annoy’s memory usage and slower query times became bottlenecks. Spotify also explored other libraries, such as HNSWlib, which implemented the HNSW algorithm. While HNSWlib offered better performance and accuracy, it lacked production-ready features like fault tolerance and multi-language support.

Voyager was built to overcome these challenges. Let’s see how Voyager functions.

How Voyager Works

Voyager achieves its efficiency and flexibility through a combination of graph-based navigation, dynamic indexing, and multithreaded operations. These features allow it to handle large-scale, high-dimensional datasets while delivering fast and accurate results.

Core Mechanism: Graph-Based Navigation

At the heart of Voyager lies the HNSW algorithm, which organizes vectors into a graph structure. Each node in the graph represents a data point, and edges connect nodes based on their proximity in vector space. The graph is hierarchical, with layers that balance broad exploration and precise refinement.

Upper Layers: Sparse connections allow for quick exploration of the dataset, narrowing down candidates efficiently.
Lower Layers: Dense connections facilitate precise searches within smaller subsets of data points.

The following diagram illustrates the layered structure of the HNSW graph:

Figure- A diagram from the HNSW paper that visualizes the layered graph concept

Figure: A diagram from the HNSW paper that visualizes the layered graph concept. | Source

In the diagram, searches start at Layer 2, where fewer nodes are connected. This allows the algorithm to identify broad regions of relevance quickly. Then, more connections are evaluated to refine the candidates at Layer 1. Finally, at Layer 0, which contains the densest connections, the nearest neighbors are identified with high precision.

Consider a music streaming service recommending songs similar to a user’s favorite track. At Layer 2, the algorithm navigates through sparse connections, identifying broad categories like pop, rock, or classical, and directs the search to the relevant genre, such as pop. At Layer 1, more connections are evaluated within the pop category, narrowing down to subgenres like synth-pop, indie-pop, or electro-pop. This step focuses the search on tracks with stylistic features closer to the user’s query. Finally, at Layer 0, the algorithm performs precision matching, identifying specific songs within the synth-pop subgenre that share attributes like tempo, vocal style, or instrumentation with the query track. This multi-layered approach ensures both speed and accuracy in generating personalized recommendations.

This hierarchical approach ensures logarithmic complexity for searches, enabling Voyager to efficiently navigate large datasets.

Dynamic Indexing

Unlike static libraries, Voyager supports real-time additions, updates, and deletions of vectors. This feature eliminates the need for complete index rebuilds, ensuring that the system remains responsive even as data evolves. For example, in a music recommendation system, new songs can be added or old ones removed without disrupting the user experience.

Multithreaded Operations

Voyager leverages multithreading to enhance both indexing and querying performance. It achieves this by distributing tasks across multiple CPU cores, which accelerates processing times and increases throughput. This capability is essential for handling high query volumes in real-time environments.

Key Features of Voyager

Voyager includes several features that make it a reliable tool for vector search in modern AI applications.

Scalability

Voyager is designed to handle billions of vectors without compromising performance. It achieves this by using memory-efficient techniques, such as E4M3 8-bit floating-point representations, which reduce resource consumption while maintaining accuracy. This scalability makes Voyager suitable for enterprise-scale deployments.

Real-Time Updates

Dynamic indexing is one of Voyager’s key features, allowing systems to integrate new data or update existing entries without downtime. This capability is crucial for applications like recommendation engines, where data is constantly evolving.

High Accuracy

Despite using approximate methods, Voyager delivers high accuracy through hierarchical graph refinement. Developers can also adjust parameters to balance speed and recall, tailoring Voyager to specific application needs.

Production-Ready Design

Voyager incorporates fault-tolerant features, such as corruption detection for index files. Its cloud-compatible design supports stream-based I/O, simplifying deployment in distributed systems.

Optimized for CPU-Based Operations

Voyager is designed to operate efficiently on CPUs, foregoing GPU acceleration to prioritize simplicity and broad compatibility. This approach simplifies deployment, reduces dependencies on specialized hardware, and ensures that Voyager can be easily integrated into diverse environments.

Comparison with Other Libraries

Voyager exists within a growing ecosystem of ANN libraries. Each has strengths and limitations that make it suitable for specific use cases.

Voyager vs. Annoy

Annoy was one of the earliest libraries Spotify used for ANN search. It relies on tree-based indexing, which is effective for static datasets but requires full index rebuilds for updates. Voyager addresses this limitation with dynamic indexing, making it better suited for real-time and dynamic applications. Additionally, Voyager outperforms Annoy in terms of speed and accuracy, especially as datasets grow larger.

Voyager vs. hnswlib

Hnswlib is a well-known implementation of the HNSW algorithm. While it offers excellent performance and accuracy, it lacks production-grade features like multi-language support and fault tolerance. Voyager builds on hnswlib by adding these enhancements, making it a more robust and versatile tool for real-world deployments.

Voyager vs. ScaNN

ScaNN, short for Scalable Nearest Neighbors, excels in inner-product search but does not support dynamic datasets. This limitation makes it less suitable for real-time systems that require frequent updates. Voyager’s flexibility and support for evolving data make it a better choice for dynamic environments.

Voyager vs. Faiss

Faiss is optimized for GPU-accelerated batch processing, making it ideal for offline tasks like training or preprocessing large datasets. In contrast, Voyager is designed for real-time applications, with CPU-based operations that allow for dynamic indexing and low-latency responses.

These comparisons show Voyager excels as a vector search library, but let’s see how it compares to vector databases, which address additional needs beyond search.

Vector Search Libraries vs. Vector Databases

Vector search libraries and vector databases serve different purposes in managing and searching vector data. While libraries like Voyager focus on speed and efficient in-memory operations, vector databases offer additional features such as persistent storage, distributed scalability, and advanced querying. Deciding between the two depends on the specific needs of the application.

Vector Search Libraries

Vector search libraries are designed for rapid querying and dynamic updates, making them ideal for real-time applications where data changes frequently. They operate entirely in memory, which allows for low-latency responses but limits their ability to handle long-term data storage or complex workloads. For example, libraries like Voyager excel in recommendation engines or personalized search platforms where datasets are updated in real time.

However, these libraries do not provide persistent storage or built-in support for distributed systems, meaning users must manage the infrastructure and data lifecycle manually. This makes libraries well-suited for scenarios that prioritize speed over data permanence or scale.

Vector Databases

Vector databases extend the functionality of libraries by incorporating persistent storage, distributed architectures, and query optimization for large-scale datasets. They are built to manage vector data over time, ensuring durability and accessibility across system restarts. In addition to similarity searches, vector databases often support hybrid queries that combine vector-based retrieval with structured metadata filtering.

Milvus, an open-source vector database created by Zilliz, is a primary example of purpose-built vector databases. It is designed to manage large-scale vector datasets and supports horizontal scaling to handle growing data volumes and query demands. Milvus enables hybrid searches, allowing users to combine similarity searches with filtering on additional attributes, such as timestamps or categories. It also supports hybrid full-text search and vector searches. This capability makes it suitable for e-commerce recommendations, image retrieval, and video analytics applications.

Zilliz also offers Zilliz Cloud, a managed vector database service powered by Milvus that simplifies deployment and maintenance. Zilliz Cloud provides all the functionality of Milvus while eliminating the operational overhead associated with managing infrastructure. This makes it a good option for organizations seeking to scale their vector search capabilities without the need to handle backend complexities.

When to Use Each

Choosing between a vector search library and a vector database depends on the specific requirements of the application:

Use a library like Voyager for applications that require rapid querying and real-time updates. These are ideal for lightweight, in-memory operations where datasets change frequently and long-term storage is not a priority.
Opt for a database like Milvus when persistent storage, distributed scalability, and advanced query capabilities are necessary. Vector databases are well-suited for enterprise use cases involving large datasets, hybrid search needs, or applications requiring high availability and scalability.

Conclusion

Voyager provides an effective solution for Approximate Nearest Neighbor (ANN) search, designed to meet the needs of real-time, high-dimensional data applications. Its dynamic indexing, graph-based navigation, and scalability make it a reliable tool for managing evolving datasets with speed and precision.

Libraries like Voyager are well-suited for lightweight, real-time applications that prioritize fast querying. However, vector databases like Milvus offer a comprehensive alternative when persistent storage, scalability, and advanced search capabilities are required. Together, these tools address a wide range of vector search requirements, supporting the growing demands of modern AI applications.

Further Resources

Updated on Mar 26, 2025

Simon Mwaniki

Next: What is Annoy?

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Information Retrieval Metrics

Understand Information Retrieval Metrics and learn how to apply these metrics to evaluate your systems.

Popular Machine-learning Algorithms Behind Vector Searches

Getting Started with ScaNN

Google’s ScaNN is a library for ANNS. This guide walks you you through implementing ScaNN and demonstrate how to integrate it with Milvus.

What is Voyager?

Introduction

What is Voyager?

Why Was Voyager Built?

How Voyager Works

Core Mechanism: Graph-Based Navigation

Dynamic Indexing

Multithreaded Operations

Key Features of Voyager

Scalability

Real-Time Updates

High Accuracy

Production-Ready Design

Optimized for CPU-Based Operations

Comparison with Other Libraries

Vector Search Libraries vs. Vector Databases

Vector Search Libraries

Vector Databases

When to Use Each

Conclusion

Further Resources

Content

Start Free, Scale Easily

Share this article

Keep Reading

Information Retrieval Metrics

Popular Machine-learning Algorithms Behind Vector Searches

Getting Started with ScaNN

AI Assistant