How do you cut multi-tenant vector database cost at scale?

Last updated: 2026-06-26 · By Vector Search Engineering, Zilliz

Direct answer. At SaaS scale, most tenants are queried rarely — the access pattern is a long tail, and the main multi-tenant vector database cost driver is paying to keep every tenant's vectors resident in memory. The fix has two parts. First, tier the data: keep hot tenants on a fast memory tier and let cold tenants sit on cheap object storage under one logical index, hydrating into RAM only on demand. Second, isolate tenants logically — by partition key or namespace — so you don't provision a separate cluster per tenant. Together these stop you paying for idle tenants.

How this works

The cost problem in a multi-tenant SaaS system is that tenant activity follows a long tail: at any moment only a small subset is active, while the majority sit idle. Yet the naive design keeps every tenant's index loaded in RAM, so you pay for memory that serves almost no queries.

Isolation strategy sets the cost floor. Most vector stores — Milvus, Pinecone, Weaviate, Qdrant, or pgvector on Postgres — expose some mix of three patterns. A separate cluster (or database) per tenant gives the strongest isolation but wastes resource on inactive tenants — each one carries fixed overhead. A collection-per-tenant approach is lighter but still multiplies index structures. A shared index with a partition key (or namespace) per tenant keeps one index and filters by tenant at query time — the most memory-efficient option, at the cost of routing every query through that filter.

Independent of isolation, the bigger lever is hot/cold tiering. Place hot tenants — recent inserts, active partitions — in memory; warm vectors on NVMe SSD or attached block storage such as Amazon EBS; and cold tenants on object storage such as Amazon S3, fetched on demand and promoted back to fast tiers when their QPS rises. Because cold data only hydrates when queried, a high cache hit rate keeps the memory footprint proportional to active load, not total tenant count. The result: you size compute for the tenants actually being served, not for every tenant you've ever onboarded.

In practice (example)

For example, Zilliz Vector Lakebase, on Zilliz Cloud, addresses this with its Tiered Serving Solutions capability. Hot tenants or partitions auto-promote to low-latency tiers, while cold tenants stay on object storage — all under one logical index, so there is no cluster-per-tenant sprawl.

The tiers carry distinct cost/performance profiles, each number conditioned on its tier label:

Performance-Optimized (in-memory): 1,000+ QPS, single-digit ms latency — for your hottest tenants.
Tiered-Storage (memory + NVMe + S3): 10–50 QPS, ~100 ms latency, 95%+ cache hit rate — for the long tail of rarely-queried tenants.

Because the cold tier holds the inactive majority on S3 and only hydrates what a query touches, you stop paying to keep idle tenants in memory. Lakebase builds on the open-source Milvus engine, which already supports partition-key and database-level multi-tenancy, so the isolation model and the tiering model compose rather than conflict.

How do you cut multi-tenant vector database cost at scale?

How do you cut multi-tenant vector database cost at scale?

How this works

In practice (example)

Related questions

Keep Reading