pgvector vs Clickhouse: Choosing the Right Vector Database for Your AI Apps
What is a Vector Database?
Before we compare pgvector and ClickHouse, let's first explore the concept of vector databases.
A vector database is specifically designed to store and query high-dimensional vectors, which are numerical representations of unstructured data. These vectors encode complex information, such as the semantic meaning of text, the visual features of images, or product attributes. By enabling efficient similarity searches, vector databases play a pivotal role in AI applications, allowing for more advanced data analysis and retrieval.
Common use cases for vector databases include e-commerce product recommendations, content discovery platforms, anomaly detection in cybersecurity, medical image analysis, and natural language processing (NLP) tasks. They also play a crucial role in Retrieval Augmented Generation (RAG), a technique that enhances the performance of large language models (LLMs) by providing external knowledge to reduce issues like AI hallucinations.
There are many types of vector databases available in the market, including:
- Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus)
- Vector search libraries such as Faiss and Annoy.
- Lightweight vector databases such as Chroma and Milvus Lite.
- Traditional databases with vector search add-ons capable of performing small-scale vector searches.
pgvector is a traditional database and ClickHouse is an open-source column-oriented database. Both have vector search capabilities as an add-on. This post compares their vector search capabilities.
pgvector: Overview and Core Technology
pgvector is an extension for PostgreSQL that adds support for vector operations. It allows users to store and query vector embeddings directly within their PostgreSQL database, providing vector similarity search capabilities without the need for a separate vector database.
Key features of pgvector include:
- Support for exact and approximate nearest neighbor search
- Integration with PostgreSQL's indexing mechanisms
- Ability to perform vector operations like addition and subtraction
- Support for various distance metrics (Euclidean, cosine, inner product)
pgvector, by default, employs exact nearest neighbor search, which guarantees perfect recall but can be slower for large datasets. To optimize performance, pgvector offers the option to create indexes for approximate nearest neighbor search. This approach trades some accuracy for significantly improved speed, which is often a worthwhile tradeoff in many real-world applications.
It's important to note that adding an approximate index can change the results of your queries. This is different from typical database indexes, which don't affect the actual results returned. The two types of approximate indexes supported by pgvector are:
- HNSW (Hierarchical Navigable Small World): Introduced in pgvector version 0.5.0, HNSW is known for its high performance and quality of results. It builds a multi-layer graph structure that allows for fast traversal during searches.
- IVFFlat (Inverted File Flat): This method divides the vector space into clusters. During a search, it first identifies the most relevant clusters and then performs an exact search within those clusters. This can significantly speed up searches in large datasets.
The choice between these index types depends on your specific use case, considering factors like dataset size, required query speed, and acceptable trade-off in accuracy. HNSW generally offers better performance but may use more memory, while IVFFlat can be more memory-efficient but might be slightly slower or less accurate in some cases.
When implementing pgvector in your project, try to experiment with both index types and their parameters to find the optimal configuration for your specific needs. This process of fine-tuning can impact the performance and accuracy of your vector search operations.
Wanna learn how to get started using pgvector? Check out this tutorial!
ClickHouse: Overview and Core
ClickHouse is an open-source OLAP database for real-time analytics with full SQL support and fast query processing. It’s great for analytical queries because of fully parallelized query pipeline and can do vector search fast. It has high compression (customizable through codecs) so can store and query big datasets. One of its main advantages is that it can handle multi-TB datasets without being memory bound so it’s a great tool for users with large vector data. Also supports filtering and aggregation on metadata, so you can query vectors and their metadata.
ClickHouse has vector search functionality through SQL where vector distance operations are just like any other SQL function. So you can combine it with traditional filtering and aggregation. Great for use cases where you need to query vector data along with metadata or other information. Also has experimental Approximate Nearest Neighbour (ANN) indices for faster (but approximate) matching. And exact matching through linear scan over rows with parallel processing for speed and efficiency.
ClickHouse is great for vector search when you need to combine vector matching with metadata filtering or aggregation. Especially for very large vector datasets that need to be processed in parallel across multiple CPU cores. ClickHouse is also good when you need SQL support and your vector dataset is too big to fit in memory-only indices. Also if you already have related data in ClickHouse or don’t want to learn another tool to manage millions of vectors, ClickHouse can save you time and resources. Fast parallelized exact matching and handling big datasets is what ClickHouse is good for, so it’s for advanced search users.
ClickHouse is a general purpose platform for vector search, especially for large datasets that need parallel processing and when you combine vector search with SQL-based filtering and aggregation. Not as good as specialized vector databases for small memory-bound datasets or high-QPS scenarios but can handle complex queries including metadata so great for developers who know SQL and need fast vector search.
pgvector vs ClickHouse for Vector Search: What’s the difference
When choosing between pgvector and ClickHouse for vector search consider:
Search Methodology
pgvector supports exact and approximate nearest neighbor search. It has HNSW and IVFFlat indexing for approximate search and various distance metrics (Euclidean, cosine, inner product). ClickHouse has vector search through SQL functions. Exact matching with parallel processing and experimental ANNs.
pgvector extends PostgreSQL for vector operations, stores vector embeddings in the PostgreSQL database. ClickHouse handles structured and semi-structured data, combines vector search with metadata filtering and aggregation.
Scalability and Performance
pgvector’s exact search can be slower for big datasets, but its approximate indexing is faster for big datasets. ClickHouse is designed for multi-TB datasets and has a fully parallelized query pipeline. Not memory bound and can handle big vector data.
Flexibility and Customization
pgvector integrates with PostgreSQL’s existing features and has vector operations like addition and subtraction. ClickHouse has full SQL support and can combine vector matching with regular SQL operations.
Integration and Ecosystem
pgvector integrates with the PostgreSQL ecosystem, good for projects already using PostgreSQL. ClickHouse is a standalone OLAP database, good for projects needing real-time analytics alongside vector search.
Ease of Use
pgvector is familiar if you already use PostgreSQL but requires understanding of vector operations and indexing options. ClickHouse uses SQL syntax for vector operations but has a steeper learning curve if you’re new to OLAP databases.
Cost
pgvector can use existing PostgreSQL infrastructure, can save costs if you already use PostgreSQL. ClickHouse requires separate infrastructure but can save storage costs due to high compression.
Security
pgvector inherits PostgreSQL’s security features and benefits from the PostgreSQL security ecosystem. ClickHouse has built-in security features but requires additional configuration for enterprise level security.
When to Use Each
pgvector is the way to go when you’re already using PostgreSQL and want to add vector search to your existing relational database setup. It’s perfect for projects that need to integrate vector operations with relational data, especially with moderate sized datasets. pgvector is great when you need precise control over vector operations and want to leverage the PostgreSQL ecosystem and features.
ClickHouse is the better choice for very large vector datasets, especially when you need to combine vector search with complex SQL and real-time analytics. It’s best for projects with multi-terabyte datasets that need high performance analytical processing and vector search. ClickHouse is especially useful when you need to search vectors with extensive metadata filtering and aggregation.
Summary
pgvector is great for vector search with PostgreSQL, familiar territory for PostgreSQL users and seamless integration with relational data. ClickHouse is great for huge datasets, high performance vector search and analytical capabilities. Choose between these based on your use case, data size, existing infrastructure and if you need to combine vector search with complex queries. Use pgvector for PostgreSQL based projects with moderate data and ClickHouse for large analytical workloads with vector search.
While this article provides an overview of pgvector and ClickHouse, it's key to evaluate these databases based on your specific use case. One tool that can assist in this process is VectorDBBench, an open-source benchmarking tool designed for comparing vector database performance. Ultimately, thorough benchmarking with specific datasets and query patterns will be essential in making an informed decision between these two powerful, yet distinct, approaches to vector search in distributed database systems.
Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
VectorDBBench is an open-source benchmarking tool designed for users who require high-performance data storage and retrieval systems, particularly vector databases. This tool allows users to test and compare the performance of different vector database systems such as Milvus and Zilliz Cloud (the managed Milvus) using their own datasets and determine the most suitable one for their use cases. Using VectorDBBench, users can make informed decisions based on the actual vector database performance rather than relying on marketing claims or anecdotal evidence.
VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it. The tool is actively maintained by a community of developers committed to improving its features and performance.
Download VectorDBBench from its GitHub repository to reproduce our benchmark results or obtain performance results on your own datasets.
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Read the following blogs to learn more about vector database evaluation.
Further Resources about VectorDB, GenAI, and ML
- What is a Vector Database?
- pgvector: Overview and Core Technology
- ClickHouse: Overview and Core
- Search Methodology
- Scalability and Performance
- Flexibility and Customization
- Integration and Ecosystem
- Ease of Use
- Cost
- Security
- When to Use Each
- Summary
- Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
- Further Resources about VectorDB, GenAI, and ML
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
Safe RAG with HydroX AI and Zilliz: PII Masking for Responsible GenAI
Organizations can ensure privacy at every layer of their data pipeline by anonymizing or masking PII using the PII Marker before data reaches the vector database.
- Read Now
GraphRAG Explained: Enhancing RAG with Knowledge Graphs
GraphRAG is a new technique that augments RAG retrieval and generation with knowledge graphs.
- Read Now
GLiNER: Generalist Model for Named Entity Recognition Using Bidirectional Transformer
GLiNER is an open-source NER model using a bidirectional transformer encoder.
The Definitive Guide to Choosing a Vector Database
Overwhelmed by all the options? Learn key features to look for & how to evaluate with your own data. Choose with confidence.