Can you search a data lake without moving data?

Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz

Direct answer. Yes. You can run vector search directly over data that already sits in your data lake — Parquet, Iceberg, or Lance files on S3 or another object store — by building a vector index over those files in place. The index references the existing data, queries hit the index, the source files stay put, and only changed files are reprocessed on refresh. That removes the ETL job, the sync pipeline, and the duplicate storage a separate vector database would otherwise require. The trade-off is cold-start latency on the first query, which lake-native engines reduce with caching, partial loading, and quantization.

How searching in place works

The pattern has two parts: register the files, then build an index over them.

The standard alternative is an export pipeline — Spark for batch embedding, Kafka for change capture, a separate vector store such as Pinecone, Weaviate, or Qdrant for serving. Every update has to flow through that pipeline, and your data lives in two places. In-place search skips the pipeline: the engine reads the embedding column directly from your Parquet, Iceberg, or Delta Lake files on object storage and builds an approximate-nearest-neighbor (ANN) index that points at those files. Queries run against the index; the source data stays where it is. When a file changes, an incremental refresh reprocesses only that file rather than re-embedding the whole dataset.

lake table (Iceberg / Parquet on S3)
      │  register + index in place (no copy)
      ▼
vector index  ──►  query ──► results
      ▲  incremental refresh on changed files only

The reason this was historically slow is physics, not design. Object stores answer reads in tens of milliseconds — orders of magnitude slower than RAM — and a naive HNSW (Hierarchical Navigable Small World) graph traversal touches hundreds of nodes per query, so each lookup fans out into many sequential S3 reads. A one-billion-vector HNSW index can reach hundreds of gigabytes; pulling it cold from object storage takes minutes. Lake-native engines close that gap with two tools. The first is clustering and pruning, so a query scans only a small fraction of the data — around three percent in published setups. The second is quantization: a RaBitQ-style 1+3-bit two-stage scheme can compress a roughly 340 GB index to about 13 GB at the 1-bit coarse stage (with 85–90% recall) and then rerank with the extra 3 bits to about 95% recall.

In practice (example)

For example, in Zilliz Vector Lakebase this is the External Data Lake Search capability, surfaced through External Collections. You call create_external_collection against an S3 bucket or Iceberg path, call create_index to build the ANN index in place (the index persists back to S3 alongside the source files), and then search. The data itself never moves, and a refresh only reprocesses files that changed. Vector Lakebase builds on the Milvus serving engine, so the query path inherits Milvus's index types and tunables.

Our architecture write-up illustrates the mode-by-mode latency profile for a one-billion-vector Iceberg setup — illustrative figures from that engineering account, not a formally specified benchmark:

Mode	Latency	Context
Spark brute-force scan (no index)	hours	baseline lake scan over the same data
Cold — just-built index	~30 seconds	index builds from the Iceberg table in ~20 minutes
Warm — disk cache	double-digit milliseconds	index cached on local SSD
Hot — in-memory	single-digit milliseconds	production serving

The Spark-scan baseline is the reason pure lake-side scans don't substitute for an indexed serving path.

Can you search a data lake without moving data?

Can you search a data lake without moving data?

How searching in place works

In practice (example)

Related questions

Keep Reading