What is cold start in a vector database, and how is it cut from minutes to seconds?

Last updated: 2026-06-26 · By Vector Search Engineering, Zilliz

Direct answer. A vector database cold start is the delay you pay when a scaled-to-zero or idle service has to load its index before it can answer the first query. For a billion-vector index sitting in object storage, a naive cold start can take minutes, because the full index has to transfer over the network before any search runs. It drops to seconds when the system loads only the small fraction of data a query actually touches — using an inverted-file index to fetch the nearest clusters — and serves compressed, quantized vectors instead of full-precision ones. The lever is how much you load, not just how fast.

How this works

Cold start happens because modern vector systems detach compute from storage. When a service scales to zero or sits idle — the model behind serverless offerings like Pinecone serverless or Zilliz Cloud — its compute is released; the index persists on object storage like Amazon S3, or in lake tables such as Parquet and Iceberg. The next query has to wake compute and pull index data back before it can run approximate nearest neighbor (ANN) search — the indexed method that returns near-best matches without scanning every vector. (A brute-force scan with Spark needs no index load, but reads everything, so it trades cold start for a slow query.)

The cost is set by physics and by index choice. Each S3 read carries roughly 20–50 ms of latency, and a graph index like HNSW (Hierarchical Navigable Small World) traverses hundreds of nodes per query, so a naive pull of a full index is slow. A 1B × 768-dim float32 dataset with an HNSW neighbor graph is about 340 GB; transferring all of it before the first query takes minutes.

Three levers cut this. First, index structure: an IVF-family index (Inverted File — vectors are bucketed into clusters, and a query reads only the nearest buckets) lets a cold query fetch under 1–2% of the dataset instead of the whole graph. Second, quantization (compressing each vector to fewer bits, e.g. 1-bit codes plus a refinement pass): the same 340 GB index shrinks to roughly 13 GB, so less data crosses the network. Third, caching the touched chunks on local NVMe SSD and keeping a standby node pool warm, so the second query and the next cold start are cheaper. Together these turn "load everything" into "load only what this query needs."

In practice (example)

For example, Zilliz Vector Lakebase's On-Demand Search capability is built to make cold start cheap rather than just fast. On a 1B × 768-dim workload, pulling a full ~340 GB index from S3 can take more than four minutes; On-Demand Search instead loads only the chunks the current query touches — under 1–2% of the dataset — by using an IVF-family index that fetches the closest clusters rather than a graph that walks the whole structure.

It pairs that with 1+3-bit matryoshka quantization (based on RaBitQ): the 340 GB index compresses to about 13 GB, where the 1-bit stage gives roughly 85–90% recall and a 3-bit refinement pass brings it back above 95%. A standby node pool and TTL-based release keep nodes ready and free them when idle. The combined effect, in Zilliz's reported figures for that 1B-vector workload, is a cold start of about 5–10 seconds instead of 10+ minutes. Lakebase builds on Milvus, so this serving path inherits its index machinery.

What is cold start in a vector database, and how is it cut from minutes to seconds?

What is cold start in a vector database, and how is it cut from minutes to seconds?

How this works

In practice (example)

Related questions

Keep Reading