TiDB vs Neo4j Choosing the Right Vector Database for Your AI Apps
What is a Vector Database?
Before we compare TiDB and Neo4j, let's first explore the concept of vector databases.
A vector database is specifically designed to store and query high-dimensional vectors, which are numerical representations of unstructured data. These vectors encode complex information, such as the semantic meaning of text, the visual features of images, or product attributes. By enabling efficient similarity searches, vector databases play a pivotal role in AI applications, allowing for more advanced data analysis and retrieval.
Common use cases for vector databases include e-commerce product recommendations, content discovery platforms, anomaly detection in cybersecurity, medical image analysis, and natural language processing (NLP) tasks. They also play a crucial role in Retrieval Augmented Generation (RAG), a technique that enhances the performance of large language models (LLMs) by providing external knowledge to reduce issues like AI hallucinations.
There are many types of vector databases available in the market, including:
- Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus)
- Vector search libraries such as Faiss and Annoy.
- Lightweight vector databases such as Chroma and Milvus Lite.
- Traditional databases with vector search add-ons capable of performing small-scale vector searches.
TiDB is a traditional database and Neo4j is a graph database. Both with vector search as an add-on. This post compares their vector search capabilities.
TiDB: Overview and Core Technology
TiDB, developed by PingCAP, is an open-source, distributed SQL database that offers hybrid transactional and analytical processing (HTAP) capabilities. It is MySQL-compatible, making it easy to adopt for teams already familiar with the MySQL ecosystem. TiDB's distributed SQL architecture provides horizontal scalability like NoSQL databases while retaining the relational model of SQL databases, making it highly flexible for handling both transactional and analytical workloads.
One of TiDB's core strengths is its HTAP architecture, which allows it to process transactional (OLTP) and analytical (OLAP) workloads in a single database, reducing the need for separate systems. Additionally, TiDB's MySQL compatibility makes it easy to integrate into existing environments that rely on MySQL without significant changes to the application code. The database also features auto-sharding, automatically distributing data across nodes to improve read and write performance while maintaining strong consistency.
TiDB supports vector search through integration with external libraries and plugins, enabling efficient management and querying of vectorized data. This feature, combined with TiDB's HTAP architecture, makes it a versatile option for businesses needing vector search capabilities alongside transactional and analytical workloads. The distributed architecture of TiDB allows it to handle large-scale vector queries once the necessary configurations are in place.
While including vector search functionalities in TiDB requires additional configuration, the system's SQL compatibility allows developers to combine vector search with traditional relational queries. This flexibility makes TiDB suitable for complex applications that require both vector search and relational database capabilities, offering a comprehensive solution for diverse data management needs.
Neo4j: The Basics
Neo4j’s vector search allows developers to create vector indexes to search for similar data across their graph. These indexes work with node properties that contain vector embeddings - numerical representations of data like text, images or audio that capture the meaning of the data. The system supports vectors up to 4096 dimensions and cosine and Euclidean similarity functions.
The implementation uses Hierarchical Navigable Small World (HNSW) graphs to do fast approximate k-nearest neighbor searches. When querying a vector index, you specify how many neighbors you want to retrieve and the system returns matching nodes ordered by similarity score. These scores are 0-1 with higher being more similar. The HNSW approach works well by keeping connections between similar vectors and allowing the system to quickly jump to different parts of the vector space.
Creating and using vector indexes is done through the query language. You can create indexes with the CREATE VECTOR INDEX command and specify parameters like vector dimensions and similarity function. The system will validate that only vectors of the configured dimensions are indexed. Querying these indexes is done with the db.index.vector.queryNodes procedure which takes an index name, number of results and query vector as input.
Neo4j’s vector indexing has performance optimizations like quantization which reduces memory usage by compressing the vector representations. You can tune the index behavior with parameters like max connections per node (M) and number of nearest neighbors tracked during insertion (ef_construction). While these parameters allow you to balance between accuracy and performance, the defaults work well for most use cases. The system also supports relationship vector indexes from version 5.18, so you can search for similar data on relationship properties.
This allows developers to build AI powered applications. By combining graph queries with vector similarity search applications can find related data based on semantic meaning not exact matches. For example a movie recommendation system could use plot embedding vectors to find similar movies, while using the graph structure to ensure the recommendations come from the same genre or era as the user prefers.
Key Differences
Search Methodology
TiDB: TiDB uses external libraries and plugins for vector search, so the system integrates third-party tools to handle vectorized data. This gives flexibility but relies heavily on external config to optimize vector query performance. It can run hybrid transactional and analytical processing (HTAP) workloads so it’s a good choice for applications that combines vector search with traditional SQL-based operations.
Neo4j: Neo4j supports vector indexing using Hierarchical Navigable Small World (HNSW) graphs. It can do efficient approximate k-nearest neighbor (k-NN) search with built-in support for cosine and Euclidean similarity metrics. Neo4j’s methodology is tightly integrated with its graph architecture so it can handle vector-based queries along with graph traversal operations.
Data
TiDB: As a distributed SQL database, TiDB is good at managing structured data with MySQL compatibility. It supports hybrid workloads and can integrate unstructured data through external tools so it’s good for environments that need a mix of relational and vector data management. But this flexibility comes with extra config for vector specific tasks.
Neo4j: Neo4j is good at graph data modeling, ideal for managing highly connected and semi-structured data. Its native vector search capabilities complements its strength in traversing relationships and handling graph structures. It’s good for applications like recommendation systems, fraud detection or knowledge graphs that require semantic understanding and connections between entities.
Scalability and Performance
TiDB: TiDB is horizontally scalable with its distributed architecture. Auto-sharding ensures data is distributed evenly across nodes so it’s good for large scale workloads. But high performance vector search may require tuning the external libraries and ensuring optimal integration with TiDB’s architecture.
Neo4j: Neo4j’s vector search performance is optimized through HNSW graphs which reduces query time by structuring connections among similar vectors. Features like quantization help to conserve memory while maintaining query accuracy. While Neo4j scales well for graph workloads, managing very large vector datasets may require careful resource planning.
Flexibility and Customization
TiDB: Flexible through SQL compatibility and integration with existing MySQL-based applications. It can combine vector and relational queries so it’s good for applications that needs both. But customization often depends on the capabilities of the integrated vector libraries.
Neo4j: Highly customizable for graph-based applications, Neo4j allows developers to tune vector indexing parameters to balance performance and accuracy. It can integrate vector search into graph queries so it’s a unique advantage for applications that relies on semantic relationships.
Integration and Ecosystem
TiDB: TiDB integrates well with MySQL tools and ecosystem so it’s a natural choice for teams already in SQL-based workflow. Vector search requires external plugins but its compatibility with the broader MySQL ecosystem makes it easier to adopt.
Neo4j: Neo4j’s integration capabilities is strong in graph-centric ecosystem, with good support for AI/ML workflows. It can handle graph and vector operations in one environment so it’s a big advantage for AI-powered applications.
Ease of Use
TiDB: If you’re familiar with MySQL the learning curve for TiDB is lower. But setting up vector search requires understanding the external libraries used which can add complexity.
Neo4j: While Neo4j’s graph query language (Cypher) has a steeper learning curve for SQL users, its native vector search is easy to use and requires less external setup compared to TiDB.
Cost
TiDB: Costs depend on the number of distributed nodes and the additional licensing or operational cost of the integrated vector libraries. Managed services are available but adds to the overall cost.
Neo4j: Neo4j’s cost depends on the scale of graph workloads and the features required. For vector search the native implementation has lower overhead compared to TiDB’s reliance on third-party tools.
Security
TiDB: SQL-based security features, encryption, access control, authentication. Security for vector operations depends on the external library used.
Neo4j: Built-in security features like encryption and fine-grained access controls for graph and vector data. It integrates vector search into the core platform so security management is simplified.
When to use TiDB
TiDB is for applications that need large scale distributed data management with both transactional and analytical workloads. HTAP allows you to manage structured data and semi-structured or unstructured data through external integrations. If your use case is to combine vector search with SQL queries or integrate vector operations into existing MySQL compatible environment, TiDB is a flexible and scalable solution. It’s perfect for scenarios where strong consistency and scalability across distributed systems matters.
When to use Neo4j
Neo4j is for applications that are based on graph data models and need advanced capabilities to explore relationships between entities. Its native vector search, integrated with graph queries, is perfect for building AI powered applications like recommendation systems, knowledge graphs or fraud detection systems. If you are focused on semantic understanding and finding connections in highly connected datasets, Neo4j’s graph centric approach with vector indexing is the unique advantage. Combining graph traversal with similarity search is efficient for workloads that prioritize connected data exploration.
Summary
TiDB and Neo4j are for different use cases, each excels in different areas. TiDB’s strength is in hybrid transactional and analytical processing, distributed scalability and MySQL compatibility, so it’s a good choice for SQL focused applications that need vector search. Neo4j’s graph based architecture and native vector indexing is perfect for applications that prioritize relationships and semantic insights. Choose between the two based on your use case: do you need robust distributed SQL capabilities with vector search or a graph database that integrates vector search into connected data workflows. Evaluate your data types, workload patterns and performance requirements to decide.
Read this to get an overview of TiDB and Neo4j but to evaluate these you need to evaluate based on your use case. One tool that can help with that is VectorDBBench, an open-source benchmarking tool for vector database comparison. In the end, thorough benchmarking with your own datasets and query patterns will be key to making a decision between these two powerful but different approaches to vector search in distributed database systems.
Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
VectorDBBench is an open-source benchmarking tool for users who need high-performance data storage and retrieval systems, especially vector databases. This tool allows users to test and compare different vector database systems like Milvus and Zilliz Cloud (the managed Milvus) using their own datasets and find the one that fits their use cases. With VectorDBBench, users can make decisions based on actual vector database performance rather than marketing claims or hearsay.
VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it. The tool is actively maintained by a community of developers committed to improving its features and performance.
Download VectorDBBench from its GitHub repository to reproduce our benchmark results or obtain performance results on your own datasets.
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Read the following blogs to learn more about vector database evaluation.
Further Resources about VectorDB, GenAI, and ML
- What is a Vector Database?
- TiDB: Overview and Core Technology
- Neo4j: The Basics
- Key Differences
- Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own
- Further Resources about VectorDB, GenAI, and ML
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading
- Read Now
Multimodal RAG: Expanding Beyond Text for Smarter AI
Multimodal RAG systems provide a comprehensive solution for leveraging the full spectrum of available information, providing better context to LLMs.
- Read Now
How Metadata Lakes Empower Next-Gen AI/ML Applications
Metadata lakes are centralized repositories that store metadata from various sources, connecting data silos and addressing various challenges in RAG.
- Read Now
From CLIP to JinaCLIP: General Text-Image Representation Learning for Search and Multimodal RAG
In this blog, we will implement a multimodal similarity search system. This system will use JinaCLIP to generate multimodal embeddings and the Milvus vector database to store and retrieve similar embeddings given a certain query.
The Definitive Guide to Choosing a Vector Database
Overwhelmed by all the options? Learn key features to look for & how to evaluate with your own data. Choose with confidence.