We spent 8 years making vector databases faster. Then we stopped.
Cost matters. It always has. But there's an order: you can only cut costs after you've met the performance bar. A system that's cheap but returns wrong results isn't useful. Neither is one that can't hold latency under load.
Milvus started in 2017 with a simple belief: vector databases would become core data infrastructure, not a feature hidden inside an application. For eight years, that belief led us in one direction: make vector search faster and more predictable. Index compression, segment scheduling, HNSW tuning, prefetch strategies — almost every major optimization pointed at the same thing: get data into local cache and search faster.
That work is still the foundation. Always-on serving is the right architecture for high-QPS, low-latency vector search workloads. If a collection is queried constantly, keeping indexes resident in memory is not waste — it is the cost of serving the product experience.
Then we turned to cost. Tiered storage helped — hot segments in memory, cold data on disk and object storage, real savings. But the nodes never turned off. For a workload that runs five hours a month, you were still paying for the other 715.
That gap is one of the problems the new Zilliz Vector Lakebase is designed to solve. The bigger shift is not simply “make vector search cheaper.” It is to let persistent semantic data support more than one compute lifecycle: always-on serving when latency and throughput matter, and on-demand compute when the data needs to stay queryable but does not need dedicated machines running all month.
The physics behind the always-on serving model
S3 read latency is 20–50 ms per request. HNSW graph traversal touches hundreds of nodes per query. Put those two numbers together and the conclusion is obvious: vector indexes have to live in local memory to serve queries. Not a design flaw — physics.
To make this concrete: 100M vectors, 768 dimensions, float32. Raw vector data is ~286 GB; the HNSW graph (M=48) adds another ~55 GB in neighbor links — roughly 340 GB total.
Traditional Milvus QueryNode model:
┌──────────────────────────────────────────────────────────────┐
│ Traditional Milvus architecture │
│ │
│ 100M × 768-dim float32 → ~340 GB split across 3 QueryNodes │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ QueryNode 1 │ │ QueryNode 2 │ │ QueryNode 3 │ │
│ │ 128GB RAM │ │ 128GB RAM │ │ 128GB RAM │ │
│ │ + NVMe │ │ + NVMe │ │ + NVMe │ │
│ │ seg 0-99 │ │ seg 100-199 │ │ seg 200-299 │ │
│ │ (~113 GB) │ │ (~113 GB) │ │ (~113 GB) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ load() │ load() │ load() │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ S3 (source of truth) │ │
│ │ 340 GB full dataset │ │
│ └───────────────────────┘ │
│ Collection queryable only when all 340 GB are loaded │
│ Node fails → its segments go dark → reload from S3 │
└──────────────────────────────────────────────────────────────┘
Every segment needs a resident node before the collection is queryable. 340 GB of data, three 128 GB machines, running 24/7. For frequently queried collections, this works fine. Then AI changed the demand pattern.
Product teams run two-week A/B experiments, after which those embeddings are never queried again. In SaaS products, 90% of users didn't log in last week. In RAG knowledge bases, 80% of documents haven't been retrieved in the past month. The data isn't useless — it might be queried anytime — but it's rarely queried. Traditional databases handle this with tiering: hot data in memory, cold data on disk, and pages-in on demand. Vector databases had no such concept. Either you loaded the entire collection, or it wasn't queryable.
Before AI-generated embeddings became widespread, that binary wasn't a problem. Most vector workloads were either clearly online serving systems, where keeping indexes resident in memory made sense, or offline experiments that could tolerate bespoke pipelines. AI changed that middle ground.
We started seeing this shift in customer conversations. Embeddings were no longer just powering production RAG chatbots. A global GPU leader was embedding autonomous driving data — camera frames, driving sessions, weather, location, timestamps, and other metadata — so engineers could mine rare driving scenarios across tens of billions of vectors. An education technology company was using semantic search for multilingual plagiarism detection, where workloads could swing from a handful of documents to 10,000+ documents in a batch during exam periods.
This is the context for Vector Lakebase. AI teams are accumulating unstructured data that needs to remain persistent and discoverable, but the access pattern is uneven. Some paths need continuous serving. Others need occasional search, exploration, or batch discovery over the same underlying data. Treating all of those paths as always-on serving leaves too much infrastructure idle.
A user posted in our community Slack:
"My embeddings are already in S3. You're telling me I need to spend three hours importing them, keep three machines with 128 GB RAM running 24/7, and pay $24,000 a year — just to run occasional queries?"
He was right. The problem wasn't where the data lived or whether the index was fast enough. He was paying hot-data prices for a cold-data access pattern: 0.7% active, 100% billed.
The market had already started proving that object-storage-first economics mattered for vector workloads. And keeping stateless compute on object storage was a direction many users wanted. But the harder question for us was how to bring that cost model into a complete vector database: with filtered search, database semantics, operational isolation, and a path that still connects back to always-on serving when workloads become hot.
That is our Vector Lakebase thesis: keep semantic data persistent, and let the compute layer match the workload. On-demand search is one expression of that architecture. Getting it right required clearing four technical obstacles.
Four barriers to the Lakebase on-demand search
In the Lakebase on-demand search model, QueryNodes spin up on demand, serve queries, then release. Data stays in object storage as the source of truth. Compute scales to zero between query sessions. That sounds simple, but making it usable required addressing cold-start latency, scan volume, I/O amplification during retrieval, and control-plane fixed costs.
Cold start was too slow
340 GB of HNSW index. Loading from S3: over four minutes. Four minutes of cold start kill any on-demand use case. A user fires a query and waits four minutes — that's not a delay, that's a broken product.
The solution was to compress the index while keeping it usable. We built 1+3-bit matryoshka quantization based on RabitQ (Gao & Long, 2024). Two layers, nested like matryoshka dolls.
The 1-bit layer loads first — 13 GB instead of 340 GB. Search runs on it immediately: RabitQ gives a provable error bound on 1-bit distances, so you can safely prune candidates and guarantee nothing in the true top-k gets dropped. 85–90% recall at first query.
The 3-bit layer downloads in the background while the 1-bit search is running. Once ready, it's used as a refinement pass — survivors from the 1-bit stage get rescored with full 1+3-bit precision. Recall goes to 95%+. The two layers aren't alternatives; the inner one performs filtering, and the outer one improves the results.
Raw quantization throughput is a bottleneck at scale. GPU-accelerated index build and AVX512 / ARM SVE query kernels bring distance computation throughput to the point where quantization overhead is negligible. Two further improvements push recall higher: per-vector optimal scaling, where each vector gets its own quantization error minimized rather than sharing a global factor; and non-uniform bit allocation across dimensions based on variance, so information-dense dimensions get more bits. Both directly reduce quantization error without increasing index size.
First obstacle cleared. But even with full quantization, scanning 100M vectors is still expensive.
Scanning 100M vectors
The 1-bit index is small, but distance computation over 100M vectors is still linear. In an on-demand model, this compounds: longer compute time means the QueryNode stays resident longer, which shrinks the window for elastic release.
IVF clustering with global index pruning (bucket count scales with data volume):
┌──────────────────────────────────────────────────────────────┐
│ Global Index + IVF pruning │
│ │
│ 100M vectors → IVF clustering (N buckets, N scales with │
│ data volume) │
│ │
│ ┌───┬───┬───┬───┬───┬───┬───┬───┬─── ··· ───┬───┐ │
│ │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ │ N │ │
│ └───┴───┴───┴───┴───┴───┴───┴───┴─── ··· ───┴───┘ │
│ ▲ ▲ ▲ │
│ █ █ █ ← scan only ~3% │
│ │
│ Query q → find nearest centroids → search those buckets │
│ │
│ Data scanned: ~3% of total │
│ S3 I/O: pull ~3% of data │
│ Compute: distance calc on ~3% only │
└──────────────────────────────────────────────────────────────┘
IVF isn't new, but two things make ours different.
First, scale. Most IVF implementations fall apart at the billion-vector scale because building the index requires loading everything into memory at once. We built a distributed index construction that shards the clustering work across nodes — IVF at any scale, including billions of vectors.
Second, the Lakebase interaction. At query time, only the relevant buckets get pulled from S3. Probe ~3% of the clusters, fetch ~3% of the data, hold ~3% in QueryNode memory. A node that only loaded 3% of the dataset can be reclaimed almost immediately after the query completes.
Together with 1-bit quantization, the two barriers compound: 340 GB → 13 GB (quantization) → ~400 MB per query (IVF pruning). Cold start loads only the cluster centroids and the 1-bit index metadata — 5–10 seconds. Each subsequent query fetches only the relevant buckets, not the full 13 GB.
Second obstacle cleared.
Retrieve was amplifying S3 I/O
Vector search returns IDs, not raw data. Getting original vectors or scalar fields means a second round of reads, and in a storage-native query path, each one is an S3 point read.
The problem was the storage format. Standard Parquet files use 64 MB row groups. A single vector record is around 3 KB. Reading it means downloading the whole row group: 3 KB of useful data, 64 MB of actual I/O — about 20,000x amplification. Tolerable on local disk. Brutal on S3.
Storage V2 tackled half of it: separate wide and narrow columns, with vectors and scalar fields stored independently, and row groups shrunk to 1 MB — 64x less amplification. The catch: Parquet's block-level compression relies on large row groups. Shrink them, and compression degrades; files grow larger. Small row groups and good compression are mutually exclusive in Parquet. That's where Vortex comes in.
Vortex, developed by Spiral and hosted by the Linux Foundation, has a fully configurable layout with no forced row group structure; direct point queries on compressed data through Delta → RLE → BitPacking nested encoding, no decompression required; and automatic encoding selection, based on the BtrBlocks algorithm, that balances compression ratio, encode speed, and decode speed.
Benchmarks: 3M rows, 128-dimensional vectors, S3, 256 concurrent readers, 10-row batch per read.
| Metric | Parquet | Lance | Vortex |
|---|---|---|---|
| Point read throughput (reads/s) | 162 | 464 | 620 |
| S3 bytes per read (MB) | 9.44 | 0.006 | 0.07 |
| S3 GETs per row | ~2 | ~5 | ~2 |
| Full scan throughput (MB/s) | 638 | 730 | 1,548 |
| Write throughput (MB/s) | 216 | 247 | 244 |
Parquet downloads 9.44 MB per read — the entire row group. Lance gets that down to 0.006 MB by reading at 512-byte granularity, but pays for it in IOPS: ~5 S3 GETs per row vs. ~2 for the others. Vortex lands at 0.07 MB with ~2 GETs per row — 135x less traffic than Parquet, without the IOPS penalty. Full-scan throughput is 2.4x higher than Parquet; writes are comparable.
Third obstacle cleared.
Control plane costs didn't scale to zero
The first three changes were in the query path. The fourth was hidden in the control plane.
Even when all QueryNodes are idle, each Milvus instance keeps its Coordinator and etcd alive. N tenants means N sets. QueryNodes could scale to zero; those two components couldn't — they're stateful and must stay resident. At a million tenants, the control plane overhead exceeds the QueryNode cost.
The Lakebase control plane changes this from O(N) to O(1):
Traditional Milvus: control plane cost O(N)
┌──────────────────────────────────────────────────────────────┐
│ Shared infrastructure │
│ Kafka / Pulsar (shared) Index Pool (shared) │
└──────────────────────────────────────────────────────────────┘
| | |
Tenant A Tenant B Tenant C
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Coordinator │ │ Coordinator │ │ Coordinator │
│ etcd │ │ etcd │ │ etcd │
├──────────────────┤ ├──────────────────┤ ├──────────────────┤
│ QueryNode │ │ QueryNode │ │ QueryNode │
│ (dedicated) │ │ (dedicated) │ │ (dedicated) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
└─────────────────────┼─────────────────────┘
↓
┌──────┐
│ S3 │
└──────┘
Lakebase: control plane cost O(1)
┌───────────────────────────────────────────────────────────────┐
│ Shared control plane (per-region) │
│ │
│ ┌──────────────────┐ ┌──────────┐ ┌───────────────────┐ │
│ │ Shared │ │ Catalog │ │ WAL Service │ │
│ │ Coordinator │ │ ≠ etcd │ │ → S3, ≠ Kafka │ │
│ │ │ │ │ │ │ │
│ └──────────────────┘ └──────────┘ └───────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Index Service (GPU Build Pool) │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────┬─────────────────────────────────┘
┌────────────────────┼─────────────────────┐
Tenant A NS Tenant B NS Tenant C NS
┌────────────┐ ┌────────────┐ ┌───────────┐
│ QueryNode │ │ (idle) │ │ QueryNode │
│ QueryNode │ │ scale = 0 │ └────┬──────┘
└──────┬─────┘ └────────────┘ │
└─────────────────────┬─────────────────────┘
↓
┌──────┐
│ S3 │
└──────┘
Lakebase replaces each piece of the old per-tenant model. The coordinator is shared across tenants, replacing per-tenant coordinators. Catalog replaces per-instance etcd and removes the 2 GB storage cap. WAL Service writes directly to S3 without local disk — 750 MB/s measured throughput, 5.8x Kafka — replacing Kafka/Pulsar. Index Service is a GPU build pool shared across tenants, replacing per-instance GPU allocation.
"Scale to zero" stops meaning "QueryNodes can release" and starts meaning "the entire instance costs nearly nothing when idle."
┌──────────────────────────────────────────────────────────────┐
│ Multi-tenant × Lakebase On-demand Search │
│ │
│ S3 storage layer compute (on demand) │
│ ┌──────────────┐ │
│ │Tenant A data │ ◄──── query ──── QueryNode A (active) │
│ ├──────────────┤ │
│ │Tenant B data │ (idle, no QueryNode) │
│ ├──────────────┤ │
│ │Tenant C data │ (idle, no QueryNode) │
│ ├──────────────┤ │
│ │Tenant N data │ ◄──── query ──── QueryNode N (active) │
│ └──────────────┘ │
│ │
│ 1M tenants, 1% active → 99% of data has zero compute cost │
└──────────────────────────────────────────────────────────────┘
Traditionally, multi-tenancy meant sharing a cluster across tenants via separate collections or partitions — but that cluster had hard ceilings: etcd's 2 GB metadata limit, the coordinator's throughput, and fixed QueryNode capacity. Scaling beyond those limits meant more clusters, which meant more overhead.
Lakebase changes the ceiling. Catalog replaces etcd with a scalable metadata store, and the shared coordinator handles far more tenants without per-tenant overhead. S3 provides storage elasticity. The result is a single cluster that can serve many more isolated tenants — and only the tenants actively receiving queries consume compute. The rest pay only for storage.
Back to that Slack user
Same scenario: 100M vectors, 768-dimensional float32, 10 queries a day, one minute each. Active ~5 hours a month.
For this workload, the important difference is not just where the bytes live. It is whether compute has to remain attached to those bytes while nobody is querying them.
Both the cold start times of self-hosted Milvus and Zilliz Cloud tiered storage model are one-time loading costs — once warm, queries are fast. Lakebase on-demand cold start occurs at the start of each session after the node scales back to zero, which, for this workload, is essentially every time. 5–10 seconds per session is the tradeoff for paying nothing between sessions.
Self-hosted cost is mostly EC2 always-on, with 3 × r6g.4xlarge on-demand at roughly $2,073/month, plus Kafka. Zilliz Cloud tiered storage model removes the ops burden, but the billing model stays the same. Lakebase on-demand search changes the model: pay only for the five hours you actually use.
| Self-hosted Milvus | Zilliz Cloud Tiered Storage Model | Lakebase On-demand Search | |
|---|---|---|---|
| Compute lifecycle | Always on | Always on | On demand |
| Idle compute cost | Full rate | Full rate | $0 |
| Cold start pattern | One-time load, then warm | One-time load, then warm | 5–10s at session start |
| Best fit | Hot serving workloads | Managed hot/cold tiering | Rarely queried semantic data |
~$240/year. Zero compute cost 99% of the time. Four obstacles, four layers of change.
The physics didn't change. S3 is still 20–50 ms per read.
What changed is the compute model around those physics: tiered storage reduced the cost of storing colder data, but Lakebase on-demand search removes the always-on compute floor for workloads that are mostly idle.
That gap matters more than the savings. The Slack user who couldn't justify $24,000/year didn't just save money when he moved over — he started indexing more data because search was cheap enough to do more of it. Lower price, more demand.
That is the larger Vector Lakebase story. Once semantic data can persist independently from a single always-on serving cluster, teams can choose the compute shape that matches the workload: continuous serving for hot paths, on-demand search for rarely queried data, and batch compute for discovery or processing jobs.
Zilliz Vector Lakebase is available in public preview
We've launched the public preview of Zilliz Vector Lakebase— a major evolution of Zilliz Cloud from a managed vector database to a unified semantic data platform, combining low-latency vector serving with the openness, scalability, and economics of a data lake.
Vector Lakebase Core capabilities:
- Tiered serving optimized for different real-time performance-cost trade-offs
- On-demand search for large-scale or exploratory workloads without always-on compute
- External data lake search — index and search directly over your existing lake data
- Full-spectrum search across vectors, text, JSON, and geospatial data with hybrid retrieval and reranking
- Unified lake-native storage built on Vortex, an open format with faster and cheaper random reads than Lance or Parquet
If your current stack splits serving and discovery into separate systems, Vector Lakebase might be worth a look. Try it on Zilliz Cloud — new work-email signups get $100 free credits — or talk to us about your use case.
Keep Reading

How to Improve Retrieval Quality for Japanese Text with Sudachi, Milvus/Zilliz, and AWS Bedrock
Learn how Sudachi normalization and Milvus/Zilliz hybrid search improve Japanese RAG accuracy with BM25 + vector fusion, AWS Bedrock embeddings, and practical code examples.

Introducing Zilliz MCP Server: Natural Language Access to Your Vector Database
Developers can easily manage and query vector databases with natural language via Zilliz MCP Server in AI-native environments.

Bringing AI to Legal Tech: The Role of Vector Databases in Enhancing LLM Guardrails
Discover how vector databases enhance AI reliability in legal tech, ensuring accurate, compliant, and trustworthy AI-powered legal solutions.



