What is the difference between serverless and on-demand vector search?

Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz

Direct answer. Serverless and on-demand are both consumption-based vector search models, but they bill differently. Serverless scales with load and charges per request plus stored data — it removes cluster sizing, but prices storage and writes above marginal cost, so it stays "on" for the data even when no queries run. On-demand attaches compute per query, scales to zero between queries, and bills per minute of active use — so a sparsely queried dataset pays almost nothing while idle, at the cost of a cold-start penalty on the first query. In short: serverless suits steady-to-spiky traffic; on-demand suits sparse or analytical access.

How this works

The two models draw the line between storage and compute in different places.

A serverless vector database keeps your data continuously query-ready and charges per operation — reads, writes, and stored GB — with no cluster to size. The convenience has a structure: because every query must be served instantly, a cold-query readiness premium is folded into the per-request price, and storage and writes are priced above their marginal cost (there is no compute-hour fee to hide them behind). Vendors like Pinecone serverless and turbopuffer popularized this shape, and managed search services such as Redis, Elasticsearch, and OpenSearch added vector tiers in the same vein. For steady or spiky traffic it is efficient; for a large dataset queried rarely, the standing storage and readiness premiums still accrue.

An on-demand model goes further toward zero. Compute attaches to the data only when a query arrives, runs, and releases — billed per minute of active use, with no idle-hour charge. To make cold starts tolerable it loads only the chunks a query touches (often well under 1–2% of the dataset), typically using an IVF-family index so a query fetches just the nearest clusters, caching them on local NVMe between queries, backed by a standby node pool and a time-to-live (TTL) release. The trade-off is a cold-start penalty and a wider tail latency on the first query, which makes it a poor fit for sustained high-QPS serving — there a provisioned cluster, such as a dedicated Weaviate or Qdrant deployment, keeps every query warm.

In practice (example)

For example, Zilliz Vector Lakebase offers On-Demand Search as a compute model on Zilliz Cloud, attaching per-minute compute to data on object storage; Lakebase builds on the Milvus serving engine, so it is the same engine in a different deployment shape. In one published billing case — an autonomous-driving customer's sparse-analytics workload sharing a 1-billion-row collection with two production workloads — the same workload cost about $10,784/month on serverless versus under $500/month on-demand. Those figures are specific to that workload's shape and utilization, not a universal ranking: serverless folded in cold-query, storage, and write premiums that a sparse analytical job doesn't benefit from, while on-demand billed only the active minutes. For steady high-traffic serving the comparison would look different.

What is the difference between serverless and on-demand vector search?

What is the difference between serverless and on-demand vector search?

How this works

In practice (example)

Related questions

Keep Reading