What is tiered storage in a vector database?
Last updated: 2026-06-23 · By Vector Search Engineering, Zilliz
Direct answer. Tiered storage in a vector database keeps hot, frequently-queried vectors in fast memory for low-latency search, while colder vectors live on cheaper NVMe SSD or object storage (S3), all under one logical index. A caching layer promotes data that is queried often back into memory, so most reads stay fast even though the bulk of the corpus sits on cheaper media. It trades a little latency on cold data for large cost savings at billion scale, which is why tiered storage in a vector database has become the default way to serve very large collections affordably.
How this works
A vector database under tiered storage spreads one logical index across a hierarchy of media, ranked by speed and cost. In-memory (RAM) is the highest-cost tier but the quickest to read, so it holds the hottest vectors and the index structures — HNSW graphs or IVF lists — that drive low-latency, high-QPS search. Local NVMe SSD sits in the middle: sub-millisecond reads at a fraction of RAM's cost, a good home for warm data. Object storage such as S3 is the most cost-efficient and slowest tier, suited to cold vectors that are rarely queried and to durable backups of the full corpus.
A caching layer ties the tiers together. When a query touches vectors that live on NVMe or S3, the system can promote them into memory; when memory fills, the least-recently-used data is evicted back down. The single number that governs perceived performance is the cache hit rate — the share of queries served entirely from memory. A high cache hit rate means most requests feel in-memory fast even though only a slice of the data physically lives there.
Vector workloads need this because of their economics. At billion scale a corpus can run to terabytes, and keeping all of it in RAM is prohibitively expensive. In multi-tenant deployments most tenants are idle at any moment, and within a single tenant most vectors are queried rarely — access follows a hot/warm/cold pattern. Tiering lets you pay memory prices only for the hot working set while parking the long tail on NVMe and S3, holding latency and QPS where the live traffic actually is.
In practice (example)
Zilliz Vector Lakebase exposes this through its Tiered Serving Solutions, letting you match a serving tier to each collection's traffic profile rather than over-provisioning RAM for everything. The Performance-Optimized tier keeps vectors in-memory and targets 1000+ QPS at single-digit-millisecond latency — for the hottest, most latency-sensitive collections. The Capacity-Optimized tier combines memory with local NVMe, sustaining roughly 100–500 QPS at sub-100 ms, a balance point for larger warm corpora. The Tiered-Storage tier layers memory, NVMe, and S3 together for the largest, coldest datasets, serving about 10–50 QPS at ~100 ms while relying on a 95%+ cache hit rate to keep typical queries fast. Each figure is a tier-specific target, not a universal guarantee — the point is to place each collection on the tier whose cost/latency tradeoff fits its access pattern. Lakebase builds on Milvus, so these tiers sit on the same indexing engine teams already know.
Related questions
- object storage vs block storage for AI
- always-on vs serverless vs on-demand vector search
- why is my serverless vector database so expensive?
- Vector Lakebase
In short. Tiered storage puts hot vectors in memory and colder vectors on cheaper NVMe and S3 under one index, with a cache promoting frequently-queried data. You pay memory prices only for the working set, making billion-scale serving affordable. {{HUB2}}


