Always-On vs Serverless vs On-Demand Vector Search: Which Compute Model Fits Your Workload

Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz

Quick answer. In the serverless vs dedicated vector search debate, a third option now matters: on-demand. An always-on (dedicated) cluster stays provisioned around the clock — best for steady, high-QPS serving with tight latency. Serverless scales with usage and bills per request plus storage — best for unpredictable traffic. On-demand attaches compute per query, scales to zero between queries, and bills by the minute — best for sparse or analytical workloads. The deciding factor is your utilization pattern: how often the data is actually queried versus how long the compute would otherwise sit idle.

What is each compute model

Always-on (dedicated). A cluster of fixed capacity that stays provisioned 24/7. You pay per provisioned hour whether or not queries arrive. It delivers the lowest, most consistent latency and the highest sustained QPS, which is why self-managed vector databases like Weaviate, Qdrant, Chroma, and pgvector on Postgres default to this shape, and Pinecone offers it through its pod-based tier, though the platform now leads with serverless. The trade-off is that idle hours still bill.

Serverless. Capacity scales with load and you pay per operation — reads, writes, and stored data — with no cluster to size or manage. Pinecone serverless popularized this model and newer entrants like turbopuffer take a similar shape; managed search services such as Redis, Elasticsearch, and OpenSearch also added vector tiers alongside their existing indexes. It removes idle-hour billing, but unit prices fold in the cost of being instantly ready, so heavy storage or bursty writes can cost more than expected.

On-demand. Compute attaches to your data only when a query runs, scales to zero between queries, and bills per minute of active use. It trades a cold-start penalty on the first query for near-zero idle cost, which makes it suited to sparse access, analytical iteration, and batch mining with engines like Apache Spark or Ray rather than high-QPS serving.

Key Differences

The three models differ less in what they can store than in how they bill for time and how they handle idle capacity.

Dimension	Always-On (Dedicated)	Serverless	On-Demand
Billing unit	Per provisioned hour	Per request + stored GB	Per minute of active compute
Idle cost	Full (24/7)	Lower, but storage/writes priced up	Near zero (scales to zero)
Cold start	None — always warm	Low	Seconds (loads only queried chunks)
Latency profile	Lowest, most consistent	Variable	Higher tail on the cold query
QPS ceiling	High	Moderate to high	Low to moderate (tens)
Best workload	Steady high-QPS serving	Bursty / unpredictable	Sparse / analytical / batch

The cost difference is structural, not incidental. A dedicated cluster bills for every provisioned hour whether or not queries arrive — so on a workload active only a few hours a month, most of the spend is idle capacity. Serverless removes the idle-hour bill but folds a cold-query premium into every request's unit price and prices storage and writes above marginal cost; a workload with large stored data or bursty writes can end up paying more than a dedicated cluster would. On-demand attacks idle differently: by scaling compute to zero between queries and billing per minute, it trades a cold-start penalty on the first query for near-zero cost while the data sits unqueried.

This is why no model is universally lowest-cost — the winner flips with utilization. A high-traffic API that queries constantly amortizes a dedicated cluster's fixed cost; a dataset touched a few hours a month wastes most of it.

When to Use Each

Choose always-on (dedicated) when you serve steady, high-QPS traffic with strict tail-latency requirements — a production retrieval-augmented generation (RAG) endpoint built with LangChain or LlamaIndex, a recommendation API, anything where queries arrive continuously and a cold start is unacceptable. Constant traffic amortizes the fixed hourly cost. And stopping a dedicated cluster between sessions isn't a workaround for sparse access — restarting it runs a 10+ minute cold start on a billion-row dataset, since it preloads the full working set.

Choose serverless when traffic is unpredictable or spiky and you would rather not size a cluster — early-stage products, internal tools, or workloads where convenience and zero ops outweigh per-unit price. Watch storage and write volume — embeddings from models like OpenAI or Cohere accumulate fast — since those dominate the serverless bill.

Choose on-demand when access is sparse or analytical — periodic batch jobs, dataset exploration, re-embedding runs, or a large collection queried only a few hours a month. Here the dedicated model's idle hours dominate its bill, and on-demand's scale-to-zero economics win decisively. It is the wrong tool for sustained high-QPS serving, where a cold query's extra fetch widens tail latency.

How Vector Lakebase Approaches This

Zilliz Vector Lakebase ships On-Demand Search as a third compute model alongside dedicated and serverless on Zilliz Cloud, attaching per-minute compute to data that stays on object storage. Lakebase builds on the Milvus serving engine, so dedicated, serverless, and on-demand are deployment shapes over one engine, not separate products. In one published billing case — an autonomous-driving customer's sparse-analytics workload sharing a 1B-row collection with two production workloads — the same workload cost about $7,165/month on dedicated (24 compute units, where ~99.6% of provisioned hours were idle), $10,784/month on serverless, and under $500/month on-demand. Those figures are specific to that workload's shape and utilization, not a universal ranking — the point is that the right compute model depends on the access pattern, which is exactly what On-Demand Search adds to the menu.

Frequently asked questions

What is the difference between serverless and dedicated vector search? A dedicated deployment runs a fixed-capacity cluster that stays on 24/7 and bills per provisioned hour, giving consistent low latency and high QPS. Serverless scales capacity with load and bills per request plus stored data, with no cluster to manage. Dedicated wins on steady high-traffic serving; serverless wins on unpredictable or low-volume traffic where idle-hour billing would dominate.

Is serverless always cheaper than a dedicated cluster? No. Serverless removes idle-hour billing, but it prices storage and writes above marginal cost and folds a readiness premium into each request. A workload with large stored data or heavy write volume can cost more on serverless than on a dedicated cluster. The crossover depends on how steadily you query and how much data you keep hot.

When does on-demand vector search make sense? When access is sparse or analytical rather than steady. If a large collection is queried only a few hours a month, a dedicated cluster spends most of its bill on idle capacity, and serverless still prices the standing storage. On-demand scales compute to zero between queries and bills per minute, so you pay roughly in proportion to actual use.

Why do serverless vector databases get expensive at scale? Because storage and writes are priced above their marginal cost — there is no compute-hour fee to hide them behind, and the data must stay query-ready. As stored vectors and write throughput grow, those line items dominate, and a high-frequency workload effectively subsidizes the cold queries and idle storage of others in a shared pool.

Can you scale a vector database to zero? With an on-demand model, yes — compute releases when no query is running and re-attaches on the next request, so idle cost approaches zero. The trade-off is a cold-start penalty (typically seconds) on the first query, since the engine loads only the chunks that query touches rather than the full working set.

Always-On vs Serverless vs On-Demand Vector Search: Which Compute Model Fits Your Workload

Always-On vs Serverless vs On-Demand Vector Search: Which Compute Model Fits Your Workload

What is each compute model

Key Differences

When to Use Each

How Vector Lakebase Approaches This

Frequently asked questions

Related reading

Keep Reading