pgvector vs Neo4j: Choosing the Right Vector Database for Your Needs
As AI and data-driven technologies advance, selecting an appropriate vector database for your application is becoming increasingly important. pgvector and Neo4j are two options in this space. This article compares these technologies to help you make an informed decision for your project.
What is a Vector Database?
Before we compare pgvector and Neo4j, let's first explore the concept of vector databases.
A vector database is specifically designed to store and query high-dimensional vectors, which are numerical representations of unstructured data. These vectors encode complex information, such as the semantic meaning of text, the visual features of images, or product attributes. By enabling efficient similarity searches, vector databases play a pivotal role in AI applications, allowing for more advanced data analysis and retrieval.
Common use cases for vector databases include e-commerce product recommendations, content discovery platforms, anomaly detection in cybersecurity, medical image analysis, and natural language processing (NLP) tasks. They also play a crucial role in Retrieval Augmented Generation (RAG), a technique that enhances the performance of large language models (LLMs) by providing external knowledge to reduce issues like AI hallucinations.
There are many types of vector databases available in the market, including:
- Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus), and Weaviate
- Vector search libraries such as Faiss and Annoy.
- Lightweight vector databases such as Chroma and Milvus Lite.
- Traditional databases with vector search add-ons capable of performing small-scale vector searches.
pgvector is a traditional database and Neo4j is a graph database. Both have vector search as an add-on. This post compares their vector search capabilities.
pgvector: Overview and Core Technology
pgvector is an extension for PostgreSQL that adds support for vector operations. It allows users to store and query vector embeddings directly within their PostgreSQL database, providing vector similarity search capabilities without the need for a separate vector database.
Key features of pgvector include:
- Support for exact and approximate nearest neighbor search
- Integration with PostgreSQL's indexing mechanisms
- Ability to perform vector operations like addition and subtraction
- Support for various distance metrics (Euclidean, cosine, inner product)
pgvector, by default, employs exact nearest neighbor search, which guarantees perfect recall but can be slower for large datasets. To optimize performance, pgvector offers the option to create indexes for approximate nearest neighbor search. This approach trades some accuracy for significantly improved speed, which is often a worthwhile tradeoff in many real-world applications.
It's important to note that adding an approximate index can change the results of your queries. This is different from typical database indexes, which don't affect the actual results returned. The two types of approximate indexes supported by pgvector are:
- HNSW (Hierarchical Navigable Small World): Introduced in pgvector version 0.5.0, HNSW is known for its high performance and quality of results. It builds a multi-layer graph structure that allows for fast traversal during searches.
- IVFFlat (Inverted File Flat): This method divides the vector space into clusters. During a search, it first identifies the most relevant clusters and then performs an exact search within those clusters. This can significantly speed up searches in large datasets.
The choice between these index types depends on your specific use case, considering factors like dataset size, required query speed, and acceptable trade-off in accuracy. HNSW generally offers better performance but may use more memory, while IVFFlat can be more memory-efficient but might be slightly slower or less accurate in some cases.
When implementing pgvector in your project, try to experiment with both index types and their parameters to find the optimal configuration for your specific needs. This process of fine-tuning can impact the performance and accuracy of your vector search operations.
Wanna learn how to get started using pgvector? Check out this tutorial!
Neo4J: The Basics
Neo4j’s vector search allows developers to create vector indexes to search for similar data across their graph. These indexes work with node properties that contain vector embeddings - numerical representations of data like text, images or audio that capture the meaning of the data. The system supports vectors up to 4096 dimensions and cosine and Euclidean similarity functions.
The implementation uses Hierarchical Navigable Small World (HNSW) graphs to do fast approximate k-nearest neighbor searches. When querying a vector index, you specify how many neighbors you want to retrieve and the system returns matching nodes ordered by similarity score. These scores are 0-1 with higher being more similar. The HNSW approach works well by keeping connections between similar vectors and allowing the system to quickly jump to different parts of the vector space.
Creating and using vector indexes is done through the query language. You can create indexes with the CREATE VECTOR INDEX command and specify parameters like vector dimensions and similarity function. The system will validate that only vectors of the configured dimensions are indexed. Querying these indexes is done with the db.index.vector.queryNodes procedure which takes an index name, number of results and query vector as input.
Neo4j’s vector indexing has performance optimizations like quantization which reduces memory usage by compressing the vector representations. You can tune the index behavior with parameters like max connections per node (M) and number of nearest neighbors tracked during insertion (ef_construction). While these parameters allow you to balance between accuracy and performance, the defaults work well for most use cases. The system also supports relationship vector indexes from version 5.18, so you can search for similar data on relationship properties.
This allows developers to build AI powered applications. By combining graph queries with vector similarity search applications can find related data based on semantic meaning not exact matches. For example a movie recommendation system could use plot embedding vectors to find similar movies, while using the graph structure to ensure the recommendations come from the same genre or era as the user prefers.
Key Differences
Search Methodology
pgvector uses vector operations directly within PostgreSQL, supports both exact and approximate nearest neighbor (ANN) search. It has options:
- Exact Search: perfect recall, suitable for smaller dataset or where accuracy is top priority.
- Approximate Search: HNSW and IVFFlat index types for faster query time, trade off between accuracy and speed.
Neo4j uses HNSW graph for approximate k-nearest neighbor search within graph database context. This uses graph structure to optimize search for applications where relationships between entities (e.g. nodes) matters.
Both support distance metrics like cosine similarity and Euclidean distance, but Neo4j’s graph relationships adds a layer of complexity for hybrid graph + vector search scenarios.
Data Handling
- pgvector is good for environments where structured and semi-structured data is handled natively by PostgreSQL. You can store vectors alongside relational data in the same database, making life easier.
- Neo4j is optimized for graph data, so is better if your data is naturally a network (e.g. social network, recommendation system). It can combine graph query with vector search for semantic data retrieval within graph context.
If you are embedding search within structured tabular data, pgvector might feel more natural. For graph connected data, Neo4j has the edge.
Scalability and Performance
pgvector uses PostgreSQL’s scalability mechanism, which may require external sharding or partitioning for very large dataset. Performance tuning involves experimenting with index types and PostgreSQL configuration.
Neo4j supports native distributed graph storage and query execution. HNSW-based vector search is optimized for scalability and vector quantization reduces memory usage while maintaining good performance.
If your workload scales heavily or benefits from distributed architecture, Neo4j might handle growth better, especially for graph centric data.
Flexibility and Customization
pgvector provides direct integration with PostgreSQL’s indexing and querying mechanism, allows custom vector operation (e.g. addition, subtraction). It’s good for applications that need deep control over indexing strategy.
Neo4j provides customization through its query language (Cypher) and supports vector search on both nodes and relationships, allows creative data model for AI powered application. But Cypher might require a learning curve for developers who are not familiar with graph databases.
For traditional data model, pgvector feels more natural, while Neo4j shines for graph-first architecture.
Integration and Ecosystem
- pgvector fits into PostgreSQL’s ecosystem nicely, supports integration with ORMs and analytics platform.
- Neo4j integrates well with graph based tools and framework. Its ecosystem includes connector for languages like Python, tools like Neo4j Bloom, AI/ML workflow.
Your choice depends on whether your stack revolves around relational or graph data tools.
Ease of Use
pgvector is easy to use for PostgreSQL users, minimal changes to existing workflow. It’s simple for teams already familiar with relational database.
Neo4j has a steeper learning curve for teams without graph database experience. But its documentation and community resource is rich, can help developers to get up to speed.
If simplicity is the priority, pgvector is easier to get started with.
Cost
- pgvector is open-source, benefits from PostgreSQL’s open-source model. Cost is largely dependent on the infrastructure you deploy it on.
- Neo4j has more complex cost structure, especially for enterprise or cloud managed offering. Its advanced feature might justify the cost for graph heavy use cases.
If budget is a constraint, pgvector is more cost effective unless Neo4j’s feature is must have.
Security
Both systems have robust security option, but implementation differs:
- pgvector inherits PostgreSQL’s mature security feature, including role-based access control, SSL and data encryption.
- Neo4j has advanced security features like role-based access for graph data, fine-grained access control and encryption for vector index.
Your choice depends on whether you need fine tuned security for graph data or rely on PostgreSQL’s security model.
When to use pgvector
pgvector is for teams already using PostgreSQL or working with structured and semi-structured data where vector embeddings are a new requirement. It’s perfect for applications that need simple integration with relational data, such as e-commerce recommendations, document similarity search or AI enhanced analytics. pgvector supports both exact and approximate search but since it’s so tightly coupled with PostgreSQL it’s best for smaller datasets or scenarios where the entire application can run within a single database.
When to use Neo4j
Neo4j is better when your data is naturally complex, such as social networks, recommendation systems or knowledge graphs. Its ability to combine graph queries with vector search unlocks hybrid use cases like finding semantically similar items within specific graph constraints. If you have large scale distributed graph data or need advanced optimizations for graph traversals and vector operations Neo4j is the way to go.
Conclusion
pgvector is great for simplicity and seamless integration with PostgreSQL for structured and semi-structured data, Neo4j offers more flexibility to combine graph data with vector search. The choice ultimately depends on your use case: pgvector is good for simple relational database scenarios, Neo4j is good for graph centric applications. Evaluate your data type, workload complexity and scaling needs to see which tool fits your goals.
Read this to get an overview of pgvector and Neo4j but to evaluate these you need to evaluate based on your use case. One tool that can help with that is VectorDBBench, an open-source benchmarking tool for vector database comparison. In the end, thorough benchmarking with your own datasets and query patterns will be key to making a decision between these two powerful but different approaches to vector search in distributed database systems.
Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
VectorDBBench is an open-source benchmarking tool for users who need high-performance data storage and retrieval systems, especially vector databases. This tool allows users to test and compare different vector database systems like Milvus and Zilliz Cloud (the managed Milvus) using their own datasets and find the one that fits their use cases. With VectorDBBench, users can make decisions based on actual vector database performance rather than marketing claims or hearsay.
VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it. The tool is actively maintained by a community of developers committed to improving its features and performance.
Download VectorDBBench from its GitHub repository to reproduce our benchmark results or obtain performance results on your own datasets.
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Read the following blogs to learn more about vector database evaluation.
Further Resources about VectorDB, GenAI, and ML
- What is a Vector Database?
- pgvector: Overview and Core Technology
- Neo4J: The Basics
- Key Differences
- Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
- Further Resources about VectorDB, GenAI, and ML
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free