Blog
HNSWlib vs Voyager: Choosing the Right Vector Search Tool for Your GenAI Application

HNSWlib vs Voyager: Choosing the Right Vector Search Tool for Your GenAI Application

Oct 06, 202410 min read

Vector search is key in AI applications where finding similarities between high-dimensional data points is the game. Tools like HNSWlib and Voyager are designed to efficiently perform nearest-neighbor searches so systems can quickly fetch related items from large datasets. While HNSWlib has gained popularity for its speed and accuracy, Voyager is Spotify’s latest addition to address HNSWlib's limitations.

This post compares the two, explaining their features and strengths and how they differ so you can decide which one is better for your project.

What is Vector Search?

Before diving into the specifics of HNSWlib and Voyager, it's essential to understand vector search. Simply put, Vector search, or vector similarity search, finds the closest vectors (data points) in a high-dimensional space to a given query vector. These vectors are often generated by machine learning models to capture the essence of the unstructured data (e.g., the meaning of a sentence or the features of an image).

Unlike traditional databases, where searches are based on exact matches or filtering, vector search focuses on similarity. The goal is to find vectors that are "close" to each other based on a distance metric (such as Euclidean distance or cosine similarity). For instance, vectors can represent words or sentences in natural language processing (NLP), and vector search helps find the most semantically similar words or texts. In recommendation systems, vector search identifies items closest to a user's preferences. Vector searches also play a crucial part in retrieval augmented generation (RAG), a technique that augments the output of large language models (LLMs) by providing them with extra contextual information.

There are many solutions available on the market for performing vector searches, including:

Vector search libraries such as HNSWlib and Voyager.
Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus)
Lightweight vector databases such as Chroma and Milvus Lite.
Traditional databases with vector search add-ons

What is HNSWlib? Overview

HNSWlib is an open-source library for approximate nearest-neighbor search (ANNS). It’s built on the Hierarchical Navigable Small World (HNSW) algorithm, which forms a graph-based structure where data points are nodes. The algorithm navigates this graph to quickly find approximate neighbors, making HNSWlib very efficient for vector search.

HNSWlib Features and Strengths

HNSW Algorithm: The core of HNSWlib’s power is the HNSW algorithm, which uses a multi-layer graph structure to navigate data points based on their proximity to find nearest neighbors.
Speed and Accuracy: HNSWlib is known for balancing speed and accuracy. It gives fast results without significant loss of precision, which is good for use cases that require high-quality nearest-neighbor results.
Memory Efficiency: HNSWlib has a low memory footprint while still being able to process large datasets, which is good for applications with limited memory.
Scalability: HNSWlib scales well to datasets with millions of entries, good for small and large applications.
Flexibility: The library allows you to tune search parameters like precision and recall to optimize for your use case.

HNSWlib has become the go-to choice for ANN search tasks because of its speed, flexibility, and reliability.

What is Voyager? Overview

Voyager is Spotify’s newest nearest-neighbor search library, built after extensive use of HNSWlib and identifying areas for improvement. While it’s based on the HNSW algorithm like HNSWlib, Voyager has several optimizations and additional features that make it more suitable for production environments.

Voyager Features and Strengths

Faster and More Accurate: Voyager builds on HNSWlib’s speed and accuracy but takes it further, with faster search and even higher precision in some cases, especially for complex large-scale applications.
Memory Efficiency: Voyager uses E4M3 8-bit floating point representation, which allows it to handle high-dimensional data with less memory usage compared to HNSWlib during index creation.
Multithreading and Scalability: Voyager supports fully multithreaded index creation and querying to handle larger datasets than HNSWlib, especially in distributed or cloud environments.
Language Support: While HNSWlib is primarily a Python library, Voyager supports both Python and Java, making it more flexible for production environments that require support for multiple languages.
Production Ready: Voyager is designed for enterprise use, with features like fault-tolerant index files, corruption detection, and Google Cloud integration, making it more robust and scalable for high-traffic applications.

Voyager takes the solid foundation of HNSWlib and enhances it with features that make it more suitable for modern, large-scale systems.

Key Differences Between HNSWlib and Voyager

Search Method

Both HNSWlib and Voyager use the HNSW algorithm, which is known for its speed and accuracy in nearest-neighbor searches. However, Voyager has optimizations that make it faster and more memory efficient than HNSWlib. For example, Voyager’s multithreaded index creation can process large datasets faster, and its optimized memory handling reduces resource consumption, making it more suitable for enterprise applications.

Data

Both tools handle high dimensional vector data, but Voyager has more features that make it more suitable for large-scale, cloud-based environments. Voyager’s support for streaming data from Google Cloud Services and fault-tolerant index files makes it more reliable for distributed systems. HNSWlib is good for local setup or smaller applications but lacks the advanced data handling features that make Voyager more versatile for complex environments.

Scalability and Performance

HNSWlib is already scalable and works well for most use cases. But Voyager has multithreading, which gives it an advantage when dealing with larger datasets or environments where parallel processing is needed. Voyager can create and query indexes in parallel, reducing large-scale systems' processing time. Also, its memory optimizations, like 8-bit floating point representation, allow it to handle larger datasets with less memory, making it more resource-efficient than HNSWlib.

Flexibility and Customization

Both libraries are flexible regarding search parameters, but Voyager offers more customization for production use. It supports Python and Java, which can be integrated into more environments. Also, its cloud-based features, like Google Cloud integration and fault tolerance, make it more suitable for modern large-scale applications. HNSWlib is flexible but lacks some of these advanced features and is more limited in environments where these features are needed.

Integration and Ecosystem

HNSWlib is designed to be integrated into Python-based workflows, so it’s good for machine learning pipelines and smaller applications. However, it lacks the broader integration capabilities of Voyager. Voyager supports Python, Java, and Google Cloud integration, so it’s more versatile in enterprise-level deployments. Voyager can handle distributed systems and cloud-based environments, making it a more comprehensive solution for organizations with complex infrastructure needs.

Ease of Use

HNSWlib is easy to use and set up, especially for Python users. It’s good for those who want a simple, no-frills library for ANN search. Voyager has more advanced features but a slightly higher learning curve due to multithreading, fault tolerance, and cloud integration. However, its production-ready design and extensive documentation for both Python and Java make it easier to integrate in larger, complex systems.

Cost

Both are open-source and free to use. However, Voyager’s memory efficiency and multithreading can reduce resource consumption in large-scale or cloud-based deployments, leading to cost savings. The decision here would depend on whether Voyager's extra speed, memory efficiency, and features are worth the added complexity or infrastructure cost.

Security

Neither HNSWlib nor Voyager has built-in security features like encryption or access control, so these must be implemented separately. However, Voyager’s fault-tolerant index files and corruption detection make it more reliable for critical data integrity environments.

When to Choose HNSWlib

HNSWlib is a great choice if:

You need a fast and accurate ANN search tool for smaller-scale applications.
Your project is Python-based and does not require Java support.
You are working in a local environment or with smaller datasets where advanced fault tolerance and cloud features are unnecessary.
You want a simple, easy-to-implement solution with minimal overhead.

When to Choose Voyager

Voyager is a better fit if:

You need a production-ready solution with support for both Python and Java.
Your project involves large-scale datasets and requires multithreaded processing for faster index creation and querying.
You need memory efficiency for handling high-dimensional data in resource-constrained environments.
Your infrastructure includes cloud-based environments; you need features like Google Cloud integration and fault-tolerant index files.
You require advanced customization and a tool that is optimized for enterprise-level deployments.

Ultimately, your choice between HNSWlib and Voyager depends on your specific project requirements. Both tools offer strong performance, but your choice should align with the scale and complexity of your application and the resources and infrastructure you have available.

Comparing Vector Search Libraries and Purpose-built Vector Databases

Both vector search libraries like HNSWlib and Voyager and purpose-built vector databases like Milvus aim to solve the similarity search problem for high-dimensional vector data, but they serve different roles.

Vector search libraries focus solely on the task of efficient nearest neighbor search. They offer lightweight, fast solutions for finding vectors similar to a query vector. They are often used in smaller, single-node environments or for applications with static or moderately sized datasets. However, they generally lack features for managing dynamic data, providing persistence, or scaling across distributed systems. Developers using these libraries typically need to manually handle data management, updates, and scaling.

On the other hand, purpose-built vector databases like Milvus and Zilliz Cloud (the managed Milvus) are comprehensive systems designed for large-scale vector data management. These databases go beyond simple vector search, offering features like persistent storage, real-time updates, distributed architecture, and advanced querying capabilities. They support dynamic datasets and can easily handle real-time applications where data is frequently updated. Additionally, vector databases often include integrated support for combining vector searches with traditional filtering and metadata queries, making them ideal for production environments requiring scalability, high availability, and more complex search functionalities.

Check out the latest new features and enhancements of Zilliz Cloud: Zilliz Cloud Update: Migration Services, Fivetran Connectors, Multi-replicas, and More

When to Choose Each Vector Search Solution

Choose Vector Search Libraries if:
- You have a small to medium-sized, relatively static dataset.
- You prefer full control over indexing and search algorithms.
- You're embedding search in an existing system and can manage the infrastructure.
Choose Purpose-Built Vector Databases if:
- You need to scale to billions of vectors across distributed systems.
- Your dataset changes frequently, requiring real-time updates.
- You prefer managed solutions that handle storage, scaling, and query optimizations for you.

In summary, vector search libraries are best suited for simpler, smaller-scale use cases where speed and memory efficiency are priorities, but operational complexity is minimal. Purpose-built vector databases, by contrast, are designed for large-scale, production-grade systems that demand dynamic data handling, scalability, and ease of use, often providing significant operational benefits for developers managing complex applications.

Evaluating and Comparing Any Vector Search Solutions

OK, now we've learned the difference between different vector search solutions. The following questions are: how do you ensure your search algorithm returns accurate results and does so at lightning speed? How do you evaluate the effectiveness of different ANN algorithms, especially at scale?

To answer these questions, we need a benchmarking tool. Many such tools are available, and two emerge as the most efficient: ANN benchmarks and VectorDBBench.

ANN benchmarks

ANN Benchmarks (Approximate Nearest Neighbor Benchmarks) is an open-source project designed to evaluate and compare the performance of various approximate nearest neighbor (ANN) algorithms. It provides a standardized framework for benchmarking different algorithms on tasks such as high-dimensional vector search, allowing developers and researchers to measure metrics like search speed, accuracy, and memory usage across various datasets. By using ANN-Benchmarks, you can assess the trade-offs between speed and precision for algorithms like those found in libraries such as Faiss, Annoy, HNSWlib, and others, making it a valuable tool for understanding which algorithms perform best for specific applications.

ANN Benchmarks GitHub repository: https://github.com/erikbern/ann-benchmarks

ANN Benchmarks Website: https://ann-benchmarks.com/

VectorDBBench: an open source benchmarking tool

VectorDBBench is an open-source benchmarking tool designed for users who require high-performance data storage and retrieval systems, particularly vector databases. This tool allows users to test and compare the performance of different vector database systems such as Milvus and Zilliz Cloud (the managed Milvus) using their own datasets, and determine the most suitable one for their use cases. VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it.

VectorDBBench GitHub repository: https://github.com/zilliztech/VectorDBBench
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Techniques & Insights on VectorDB Evaluation:
- Benchmark Vector Database Performance: Techniques & Insights
- Compare any vector database to an alternative

Further Resources about VectorDB, GenAI, and ML

Updated on Oct 15, 2024

Chloe Williams
Chloe Williams is a technical writer at Zilliz.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

The Great AI Agent Protocol Race: Function Calling vs. MCP vs. A2A

Compare Function Calling, MCP, and A2A protocols for AI agents. Learn which standard best fits your development needs and future-proof your applications.

Building RAG Applications with Milvus, Qwen, and vLLM

In this blog, we will explore Qwen and vLLM and how combining both with the Milvus vector database can be used to build a robust RAG system.

Elasticsearch Was Great, But Vector Databases Are the Future

Purpose-built vector databases outperform dual-system setups by unifying Sparse-BM25 and semantic search in a single, efficient implementation.

The Definitive Guide to Choosing a Vector Database

Overwhelmed by all the options? Learn key features to look for & how to evaluate with your own data. Choose with confidence.

Get the Free Guide