pgvector vs MyScale: Choosing the Right Vector Database for Your Needs
As AI and data-driven technologies advance, selecting an appropriate vector database for your application is becoming increasingly important. pgvector and MyScale are two options in this space. This article compares these technologies to help you make an informed decision for your project.
What is a Vector Database?
Before we compare pgvector and MyScale, let's first explore the concept of vector databases.
A vector database is specifically designed to store and query high-dimensional vectors, which are numerical representations of unstructured data. These vectors encode complex information, such as the semantic meaning of text, the visual features of images, or product attributes. By enabling efficient similarity searches, vector databases play a pivotal role in AI applications, allowing for more advanced data analysis and retrieval.
Common use cases for vector databases include e-commerce product recommendations, content discovery platforms, anomaly detection in cybersecurity, medical image analysis, and natural language processing (NLP) tasks. They also play a crucial role in Retrieval Augmented Generation (RAG), a technique that enhances the performance of large language models (LLMs) by providing external knowledge to reduce issues like AI hallucinations.
There are many types of vector databases available in the market, including:
- Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus), and Weaviate
- Vector search libraries such as Faiss and Annoy.
- Lightweight vector databases such as Chroma and Milvus Lite.
- Traditional databases with vector search add-ons capable of performing small-scale vector searches.
pgvector is a traditional database with vector search capabilities as an add-on. MyScale is a database built on ClickHouse that combines vector search and SQL analytics with added vector search capabilities. This post compares their vector search capabilities.
pgvector: Overview and Core Technology
pgvector is an extension for PostgreSQL that adds support for vector operations. It allows users to store and query vector embeddings directly within their PostgreSQL database, providing vector similarity search capabilities without the need for a separate vector database.
Key features of pgvector include:
- Support for exact and approximate nearest neighbor search
- Integration with PostgreSQL's indexing mechanisms
- Ability to perform vector operations like addition and subtraction
- Support for various distance metrics (Euclidean, cosine, inner product)
pgvector, by default, employs exact nearest neighbor search, which guarantees perfect recall but can be slower for large datasets. To optimize performance, pgvector offers the option to create indexes for approximate nearest neighbor search. This approach trades some accuracy for significantly improved speed, which is often a worthwhile tradeoff in many real-world applications.
It's important to note that adding an approximate index can change the results of your queries. This is different from typical database indexes, which don't affect the actual results returned. The two types of approximate indexes supported by pgvector are:
- HNSW (Hierarchical Navigable Small World): Introduced in pgvector version 0.5.0, HNSW is known for its high performance and quality of results. It builds a multi-layer graph structure that allows for fast traversal during searches.
- IVFFlat (Inverted File Flat): This method divides the vector space into clusters. During a search, it first identifies the most relevant clusters and then performs an exact search within those clusters. This can significantly speed up searches in large datasets.
The choice between these index types depends on your specific use case, considering factors like dataset size, required query speed, and acceptable trade-off in accuracy. HNSW generally offers better performance but may use more memory, while IVFFlat can be more memory-efficient but might be slightly slower or less accurate in some cases.
When implementing pgvector in your project, try to experiment with both index types and their parameters to find the optimal configuration for your specific needs. This process of fine-tuning can impact the performance and accuracy of your vector search operations.
Wanna learn how to get started using pgvector? Check out this tutorial!
What is MyScale? Overview and Core Technology
MyScale is a cloud based database built on top of the open source ClickHouse database, designed for AI and machine learning workloads. It can handle structured and vector data and real time analytics and machine learning. MyScale is focused on time series, vector search and full text search so it’s good for real time processing and AI driven insights. By using ClickHouse architecture, MyScale is high performance and scalable for AI.
One of the key features of MyScale is native SQL support which simplifies AI driven queries by integrating vector search, full text search and traditional SQL queries in one system. This reduces the need for multiple tools and makes it scalable for AI. MyScale supports and manages analytical processing of both structured and vectorized data on one platform using OLAP database architecture to operate on vectorized data. Developers can interact with MyScale using SQL so it’s accessible to all programmers familiar with relational databases.
MyScale has multiple vector index types and similarity metrics to support different use cases. It supports common distance metrics like Euclidean distance (L2), inner product (IP) and cosine similarity. The database has multiple indexing algorithms: MSTG (Multi-Scale Tree Graph), ScaNN, IVFFLAT, IVFPQ, IVFSQ and HNSW, each with its own set of parameters to tune. MyScale’s proprietary MSTG vector engine uses NVMe SSDs to increase data density so it outperforms specialized vector databases in both performance and cost.
By combining the functionality of an SQL database, vector database and full text search engine into one system MyScale reduces infrastructure and maintenance costs. This unification allows for joint data queries and analytics and a single data foundation for AI applications. MyScale also has MyScale Telemetry for full observability of LLM systems so you can monitor and debug efficiently. As data gets more complex MyScale is a future proof solution that can handle newer data modalities and database sizes while keeping computing performance and integration between different data types.
Key Differences
Search Methodology
pgvector offers HNSW and IVFFlat indexing with standard distance metrics (Euclidean, cosine, inner product). MyScale provides more options including MSTG, ScaNN, IVFFLAT, IVFPQ, IVFSQ and HNSW, supporting the same distance metrics.
Data Handling
pgvector inherits PostgreSQL's relational capabilities for structured data management. MyScale combines vector, structured data, and full-text search using OLAP architecture.
Scalability and Performance
pgvector works best with moderate datasets, requiring tuning for larger scales. MyScale's MSTG engine uses NVMe SSDs for higher data density and claims better performance than specialized vector databases.
Flexibility and Customization
pgvector relies on PostgreSQL's query flexibility. MyScale offers SQL customization and multiple indexing algorithms with tunable parameters.
Integration and Ecosystem
pgvector integrates with existing PostgreSQL deployments and tools. MyScale provides native SQL support and telemetry tools for LLM system monitoring.
Ease of Use
pgvector is straightforward for PostgreSQL users. MyScale's SQL interface makes it accessible to developers familiar with relational databases.
When to Choose pgvector
Choose pgvector when you already use PostgreSQL, need basic vector search capabilities, want to avoid managing multiple databases, and work with moderate-sized datasets that don't require complex scaling or advanced vector operations.
When to Choose MyScale
Choose MyScale when you need advanced vector indexing options, combined vector and full-text search capabilities, high-performance scaling for large datasets, built-in monitoring for LLM systems, or plan to handle complex data types requiring sophisticated query operations.
Conclusion
pgvector excels in providing vector search capabilities within PostgreSQL environments, offering simplicity and integration with existing workflows. MyScale stands out with its advanced indexing options, combined search capabilities, and scalability features. Your choice should depend on your current infrastructure, dataset size, search complexity requirements, and whether you need specialized AI and LLM monitoring tools.
Read this to get an overview of pgvector and MyScale but to evaluate these you need to evaluate based on your use case. One tool that can help with that is VectorDBBench, an open-source benchmarking tool for vector database comparison. In the end, thorough benchmarking with your own datasets and query patterns will be key to making a decision between these two powerful but different approaches to vector search in distributed database systems.
Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
VectorDBBench is an open-source benchmarking tool for users who need high-performance data storage and retrieval systems, especially vector databases. This tool allows users to test and compare different vector database systems like Milvus and Zilliz Cloud (the managed Milvus) using their own datasets and find the one that fits their use cases. With VectorDBBench, users can make decisions based on actual vector database performance rather than marketing claims or hearsay.
VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it. The tool is actively maintained by a community of developers committed to improving its features and performance.
Download VectorDBBench from its GitHub repository to reproduce our benchmark results or obtain performance results on your own datasets.
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Read the following blogs to learn more about vector database evaluation.
Further Resources about VectorDB, GenAI, and ML
- What is a Vector Database?
- pgvector: Overview and Core Technology
- What is MyScale? Overview and Core Technology
- Key Differences
- **When to Choose pgvector** 
- **When to Choose MyScale** 
- **Conclusion** 
- Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
- Further Resources about VectorDB, GenAI, and ML
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
Milvus on GPUs with NVIDIA RAPIDS cuVS
GPU-accelerated vector search through NVIDIA's cuVS library and CAGRA algorithm are highly beneficial for optimizing AI app performance in production.
- Read Now
Combining Images and Text Together: How Multimodal Retrieval Transforms Search
Discuss multimodal retrieval and composed image retrieval (CIR) techniques, including Pic2Word, CompoDiff, CIReVL, and MagicLens.
- Read Now
Introducing IBM Data Prep Kit for Streamlined LLM Workflows
The Data Prep Kit (DPK) is an open-source toolkit by IBM Research designed to streamline unstructured data preparation for building AI applications.
The Definitive Guide to Choosing a Vector Database
Overwhelmed by all the options? Learn key features to look for & how to evaluate with your own data. Choose with confidence.