What is compute-storage separation in vector databases?

Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz

Direct answer. Compute-storage separation in a vector database means the vectors and their index live in a persistent storage layer — typically object storage like Amazon S3 — decoupled from the compute that serves queries, so the two scale independently. Instead of a fixed cluster where adding capacity means adding CPU and disk together, you keep one copy of the data at rest and attach, resize, or release compute to match the workload. This decoupling is what makes elastic behaviors — scale-to-zero, per-query compute, independent storage growth — possible, and it is the architectural shift behind serverless and on-demand vector search.

How this works

A traditional vector database bundles storage and compute in the same node: the vectors, the index, and the query engine sit together. Scaling throughput means provisioning more nodes that also carry more disk, and the cluster is sized for peak load even while it sits idle. Compute-storage separation breaks that bond. The persistent layer — vectors and index — lives on cheap, durable object storage such as Amazon S3, while a separate, stateless compute tier loads what it needs and caches hot data on local NVMe and RAM.

Data warehouses made this shift first; Snowflake and BigQuery decoupled storage from compute years ago. Vector databases followed — Pinecone's serverless tier, Weaviate and Qdrant in their managed deployment modes, and pgvector running on storage-decoupled Postgres platforms like Amazon Aurora, separate the stored index from query compute so each scales on its own axis.

The hard part is latency. Object stores answer reads in roughly 20-50 ms — far slower than the nanoseconds of RAM — so a naive design that reads the index from S3 on every query is too slow to serve. Separated architectures close the gap with tiered caching (RAM → NVMe → object storage), partial loading (fetch only the clusters a query touches), and quantization to shrink what moves — techniques like RaBitQ (Gao & Long, 2024) compress each vector by an order of magnitude while keeping search quality within a provable error bound, which is what makes streaming an index from object storage tolerable. Done well, the data stays in one durable place and compute becomes elastic and disposable.

In practice (example)

For example, Zilliz Vector Lakebase is built on compute-storage separation as its core architecture: vectors and indexes persist on object storage, and compute attaches per workload on Zilliz Cloud. Lakebase builds on the Milvus serving engine — this is the same engine with storage and compute pulled apart, not a separate product. Two rebuilds make the separation hold at scale. The control plane moved from O(N) to O(1) coordination so the metadata layer doesn't bottleneck as collections grow: a Catalog service replaces per-instance etcd (removing its 2 GB ceiling), and a write-ahead log (WAL) service writes to object storage at 750 MB/s in Zilliz's internal setup. The payoff is On-Demand Search: because the data sits persistently and compute is decoupled, a query can spin compute up, run, and release it, paying only for active minutes. The principle, in Zilliz's engineering account, is to keep semantic data persistent and let the compute layer match the workload.

What is compute-storage separation in vector databases?

What is compute-storage separation in vector databases?

How this works

In practice (example)

Related questions

Keep Reading