Blog
Elasticsearch vs MyScale Selecting the Right Database for GenAI Applications

Elasticsearch vs MyScale Selecting the Right Database for GenAI Applications

Nov 23, 20249 min read

As AI-driven applications evolve, the importance of vector search capabilities in supporting these advancements cannot be overstated. This blog post will discuss two prominent databases with vector search capabilities: Elasticsearch and MyScale. Each provides robust capabilities for handling vector search, an essential feature for applications such as recommendation engines, image retrieval, and semantic search. Our goal is to provide developers and engineers with a clear comparison, aiding in the decision of which database best aligns with their specific requirements.

What is a Vector Database?

Before we compare Elasticsearch vs MyScale let's first explore the concept of vector databases.

A vector database is specifically designed to store and query high-dimensional vectors, which are numerical representations of unstructured data. These vectors encode complex information, such as the semantic meaning of text, the visual features of images, or product attributes. By enabling efficient similarity searches, vector databases play a pivotal role in AI applications, allowing for more advanced data analysis and retrieval.

Common use cases for vector databases include e-commerce product recommendations, content discovery platforms, anomaly detection in cybersecurity, medical image analysis, and natural language processing (NLP) tasks. They also play a crucial role in Retrieval Augmented Generation (RAG), a technique that enhances the performance of large language models (LLMs) by providing external knowledge to reduce issues like AI hallucinations.

There are many types of vector databases available in the market, including:

Purpose-built vector databases such as Milvus, Zilliz Cloud (fully managed Milvus)
Vector search libraries such as Faiss and Annoy.
Lightweight vector databases such as Chroma and Milvus Lite.
Traditional databases with vector search add-ons capable of performing small-scale vector searches.

Elasticsearch is a search engine based on Apache Lucene and MyScale is a database built on ClickHouse that combines vector search and SQL analytics. Both have vector search capabilities as an add-on. This post compares their vector search capabilities.

Elasticsearch: Overview and Core Technology

Elasticsearch is an open source search engine built on top of the Apache Lucene library. It’s known for real time indexing and full text search so it’s a go to search for heavy applications and log analytics. Elasticsearch lets you search and analyse large amounts of data fast and efficiently.

Elasticsearch was built for search and analytics, with features like fuzzy searching, phrase matching and relevance ranking. It’s great for scenarios where complex search queries and real time data retrieval is required. With the rise of AI applications, Elasticsearch has added vector search capabilities so it can do similarity search and semantic search, which is required for AI use cases like image recognition, document retrieval and Generative AI.

Vector Search

Vector search is integrated in Elasticsearch through Apache Lucene. Lucene organises data into immutable segments that are merged periodically, vectors are added to the segments the same way as other data structures. The process involves buffering vectors in memory at index time, then serializing these buffers as part of segments when needed. Segments are merged periodically for optimization, and searches combine vector hits across all segments.

For vector indexing, Elasticsearch uses the HNSW (Hierarchical Navigable Small World) algorithm which creates a graph where similar vectors are connected to each other. This is chosen for its simplicity, strong benchmark performance and ability to handle incremental updates without requiring complete retraining of the index. The system performs vector searches typically in tens or hundreds of milliseconds, much faster than brute force approaches.

Elasticsearch’s technical architecture is one of its biggest strengths. The system supports lock free searching even during concurrent indexing and maintains strict consistency across different fields when updating documents. So if you update both vector and keyword fields, searches will see either all old values or all new values, data consistency is guaranteed. While the system can scale beyond available RAM, performance optimizes when vector data fits in memory.

Beyond the core vector search capabilities, Elasticsearch provides practical integration features that makes it super valuable. Vector searches can be combined with traditional Elasticsearch filters, so you can do hybrid search that mixes vector similarity with full text search results. The vector search is fully compatible with Elasticsearch’s security features, aggregations and index sorting, so it’s a complete solution for modern search use cases.

What is MyScale? Overview and Core Technology

MyScale is a cloud based database built on top of the open source ClickHouse database, designed for AI and machine learning workloads. It can handle structured and vector data and real time analytics and machine learning. MyScale is focused on time series, vector search and full text search so it’s good for real time processing and AI driven insights. By using ClickHouse architecture, MyScale is high performance and scalable for AI.

One of the key features of MyScale is native SQL support which simplifies AI driven queries by integrating vector search, full text search and traditional SQL queries in one system. This reduces the need for multiple tools and makes it scalable for AI. MyScale supports and manages analytical processing of both structured and vectorized data on one platform using OLAP database architecture to operate on vectorized data. Developers can interact with MyScale using SQL so it’s accessible to all programmers familiar with relational databases.

MyScale has multiple vector index types and similarity metrics to support different use cases. It supports common distance metrics like Euclidean distance (L2), inner product (IP) and cosine similarity. The database has multiple indexing algorithms: MSTG (Multi-Scale Tree Graph), ScaNN, IVFFLAT, IVFPQ, IVFSQ and HNSW, each with its own set of parameters to tune. MyScale’s proprietary MSTG vector engine uses NVMe SSDs to increase data density so it outperforms specialized vector databases in both performance and cost.

By combining the functionality of an SQL database, vector database and full text search engine into one system MyScale reduces infrastructure and maintenance costs. This unification allows for joint data queries and analytics and a single data foundation for AI applications. MyScale also has MyScale Telemetry for full observability of LLM systems so you can monitor and debug efficiently. As data gets more complex MyScale is a future proof solution that can handle newer data modalities and database sizes while keeping computing performance and integration between different data types.

Key Differences

When choosing a vector search solution, knowing the main differences between Elasticsearch and MyScale will help you make a decision. Let’s break them down:

Architecture and Foundation

Elasticsearch is built on top of Apache Lucene’s library, focused on search and analytics. It stores vectors as immutable segments that merge periodically using the HNSW algorithm for vector indexing. This creates a graph where similar vectors connect, so searches are usually sub-milliseconds.

MyScale takes a different approach, built on top of ClickHouse. It uses OLAP architecture designed for AI and machine learning workloads. MyScale offers multiple indexing options including MSTG, ScaNN, IVFFLAT, IVFPQ, IVFSQ, and HNSW so you have more flexibility to choose the right algorithm for your use case.

Search and Data Management

Elasticsearch is good at combining vector search with traditional search. You can mix vector similarity searches with full-text queries, so it’s strong for hybrid search scenarios. The system is strict on consistency during updates - when you modify both vector and keyword fields, searches will see either all old or all new values.

MyScale stands out with native SQL support so you can combine vector search, full-text search and SQL queries in one system. It handles both structured and vector data with its OLAP architecture which might be what you’re used to if you already work with SQL databases.

Performance and Storage

Elasticsearch performs best when vector data fits in memory, but can scale beyond available RAM. Its lock-free search architecture allows concurrent indexing without blocking searches.

MyScale uses a unique approach with its MSTG vector engine which uses NVMe SSDs to increase data density. According to the docs this gives better performance and cost efficiency than specialized vector databases.

Integration and Monitoring

Elasticsearch has good integration features, works well with its security features, aggregations and index sorting. So it’s good for most modern search use cases.

MyScale has MyScale Telemetry to monitor LLM systems, so you can track and debug your apps. It aims to reduce infrastructure complexity by combining SQL database, vector database and full-text search in one system.

When to use each

Elasticsearch is great for hybrid search scenarios where you need to combine full text search with vector similarity search. Its architecture built on top of Apache Lucene makes it perfect for applications that require real-time indexing, strict data consistency and concurrent search while maintaining performance. So if you’re already in the Elastic ecosystem or building a search application that needs to balance semantic and keyword search.

MyScale is better suited for organizations that have SQL based workflows and need more vector indexing options. Its ClickHouse and OLAP architecture makes it perfect for AI and machine learning workloads that combine structured data analysis with vector operations. Its ability to use NVMe SSDs through its MSTG vector engine and built in LLM system monitoring makes it great for teams building AI applications that need cost effective storage and observability.

Conclusion

Ultimately the choice between Elasticsearch and MyScale depends on your technical requirements and existing infrastructure. Elasticsearch has mature hybrid search capabilities and proven scalability with strict consistency guarantees so it’s great for search heavy applications. MyScale has SQL native vector operations with multiple indexing options and efficient storage utilization so it’s good for AI focused applications that need structured data analysis. Your decision should be based on your team’s expertise (SQL vs search specific knowledge), existing technology stack, storage requirements and whether you need hybrid search capabilities or AI workload optimization.

Read this to get an overview of Elasticsearch and MyScale but to evaluate these you need to evaluate based on your use case. One tool that can help with that is VectorDBBench, an open-source benchmarking tool for vector database comparison. In the end, thorough benchmarking with your own datasets and query patterns will be key to making a decision between these two powerful but different approaches to vector search in distributed database systems.

Using Open-source VectorDBBench to Evaluate and Compare Vector Databases on Your Own

VectorDBBench is an open-source benchmarking tool for users who need high-performance data storage and retrieval systems, especially vector databases. This tool allows users to test and compare different vector database systems like Milvus and Zilliz Cloud (the managed Milvus) using their own datasets and find the one that fits their use cases. With VectorDBBench, users can make decisions based on actual vector database performance rather than marketing claims or hearsay.

VectorDBBench is written in Python and licensed under the MIT open-source license, meaning anyone can freely use, modify, and distribute it. The tool is actively maintained by a community of developers committed to improving its features and performance.

Download VectorDBBench from its GitHub repository to reproduce our benchmark results or obtain performance results on your own datasets.
Take a quick look at the performance of mainstream vector databases on the VectorDBBench Leaderboard.
Read the following blogs to learn more about vector database evaluation.

Further Resources about VectorDB, GenAI, and ML

Updated on Nov 23, 2024

Chloe Williams
Chloe Williams is a technical writer at Zilliz.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Zilliz Cloud Delivers Better Performance and Lower Costs with Arm Neoverse-based AWS Graviton

Zilliz Cloud adopts Arm-based AWS Graviton3 CPUs to cut costs, speed up AI vector search, and power billion-scale RAG and semantic search workloads.

How AI and Vector Databases Are Transforming the Consumer and Retail Sector

AI and vector databases are transforming retail, enhancing personalization, search, customer service, and operations. Discover how Zilliz Cloud helps drive growth and innovation.

RocketQA: Optimized Dense Passage Retrieval for Open-Domain Question Answering

RocketQA is a highly optimized dense passage retrieval framework designed to enhance open-domain question-answering (QA) systems.

The Definitive Guide to Choosing a Vector Database

Overwhelmed by all the options? Learn key features to look for & how to evaluate with your own data. Choose with confidence.

Get the Free Guide