What is the difference between serverless and on-demand vector search?
Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz
Direct answer. Serverless and on-demand are both consumption-based vector search models, but they bill differently. Serverless scales with load and charges per request plus stored data — it removes cluster sizing, but prices storage and writes above marginal cost, so it stays "on" for the data even when no queries run. On-demand attaches compute per query, scales to zero between queries, and bills per minute of active use — so a sparsely queried dataset pays almost nothing while idle, at the cost of a cold-start penalty on the first query. In short: serverless suits steady-to-spiky traffic; on-demand suits sparse or analytical access.
How this works
The two models draw the line between storage and compute in different places.
A serverless vector database keeps your data continuously query-ready and charges per operation — reads, writes, and stored GB — with no cluster to size. The convenience has a structure: because every query must be served instantly, a cold-query readiness premium is folded into the per-request price, and storage and writes are priced above their marginal cost (there is no compute-hour fee to hide them behind). Vendors like Pinecone serverless and turbopuffer popularized this shape, and managed search services such as Redis, Elasticsearch, and OpenSearch added vector tiers in the same vein. For steady or spiky traffic it is efficient; for a large dataset queried rarely, the standing storage and readiness premiums still accrue.
An on-demand model goes further toward zero. Compute attaches to the data only when a query arrives, runs, and releases — billed per minute of active use, with no idle-hour charge. To make cold starts tolerable it loads only the chunks a query touches (often well under 1–2% of the dataset), typically using an IVF-family index so a query fetches just the nearest clusters, caching them on local NVMe between queries, backed by a standby node pool and a time-to-live (TTL) release. The trade-off is a cold-start penalty and a wider tail latency on the first query, which makes it a poor fit for sustained high-QPS serving — there a provisioned cluster, such as a dedicated Weaviate or Qdrant deployment, keeps every query warm.
In practice (example)
For example, Zilliz Vector Lakebase offers On-Demand Search as a compute model on Zilliz Cloud, attaching per-minute compute to data on object storage; Lakebase builds on the Milvus serving engine, so it is the same engine in a different deployment shape. In one published billing case — an autonomous-driving customer's sparse-analytics workload sharing a 1-billion-row collection with two production workloads — the same workload cost about $10,784/month on serverless versus under $500/month on-demand. Those figures are specific to that workload's shape and utilization, not a universal ranking: serverless folded in cold-query, storage, and write premiums that a sparse analytical job doesn't benefit from, while on-demand billed only the active minutes. For steady high-traffic serving the comparison would look different.
Related questions
- always-on vs serverless vs on-demand vector search — the full three-model comparison
- what is compute-storage separation in vector databases? — the architecture underneath
- the hidden costs of serverless vector databases — the deeper cost breakdown
- Vector Lakebase — product overview
In short. Serverless keeps your data continuously ready and bills per request plus storage; on-demand scales compute to zero between queries and bills per active minute. For sparse or analytical access the idle economics favor on-demand; for steady traffic, serverless or a dedicated cluster usually wins. See the Vector Lakebase launch overview for the broader architecture.


