Why is brute-force vector search on Spark so slow?
Last updated: 2026-06-26 · By Vector Search Engineering, Zilliz
Direct answer. Because a brute-force vector search on Spark is an exact KNN (k-nearest-neighbor) scan: it computes the distance from your query to every row, so the cost is O(n) and grows linearly with the data. On a billion-vector table, Spark must read and score the whole dataset for each query — pushing latency from minutes into hours. Spark is built for distributed batch scans over the lake, not for the sublinear lookups an approximate-nearest-neighbor (ANN) index gives. Building an ANN index over those same vectors is what turns a full-scan query into milliseconds.
How this works
A brute-force (exact KNN) search compares the query vector against every vector in the table, computing a distance — cosine, dot product, or L2 — for each one, then keeping the top-k closest. That is O(n) work per query: double the rows and you double the query time, with no way out except faster or more hardware.
Spark parallelizes this scan across executors, which helps throughput, but it does not change the fundamental cost. Every executor still reads its partitions in full from object storage, so the query is dominated by I/O — pulling Parquet files off S3 — plus the distance math over the whole dataset. At 1M × 768-dim that's roughly 768M multiply-adds per query; at 100M+ vectors a single query slides from milliseconds into seconds and beyond.
ANN indexes change the shape of the problem. Structures like HNSW (a navigable small-world graph) or IVF (inverted-file clustering) organize vectors so a query touches only a fraction of them — clusters near the query, or a few hops through a graph — making search sublinear instead of linear. The trade-off: an ANN result is approximate (very close to the true top-k, not provably exact), and the index has to be built first.
Teams reach for Spark because the embeddings already live in the lake — in Iceberg, Delta Lake, or Parquet on S3 — so a quick brute-force scan feels free. It works at a few million rows. It hits a wall at scale, exactly where interactive RAG and agent retrieval need low latency. Spark's real strength is the batch side: computing embeddings (alongside engines like Ray) and processing data, not serving each query. Dedicated stores like Pinecone or Milvus exist precisely to take over that low-latency serving step.
In practice (example)
For example, Zilliz Vector Lakebase addresses this with External Data Lake Search: instead of repeatedly scanning the lake, you build an ANN index in place over the existing lake table, so the data stays in the lake but every query goes through the index rather than a full scan. Our architecture write-up illustrates the gap on a 1B-vector Iceberg table — a Spark brute-force scan with no index runs in hours, while a just-built index answers a cold query in roughly 30 seconds, dropping to double-digit milliseconds once warm (illustrative figures for a 1B × 768-dim HNSW setup, not a formally specified benchmark).
The pattern shows up with real workloads too. In a pharmaceutical molecular-similarity case, a brute-force Spark scan over lake data ran on the order of 1000× slower than IVF-based indexed retrieval — with the exact multiple depending on data distribution, index parameters, and hardware. Vector Lakebase builds on the open-source Milvus engine, so the IVF and HNSW behavior is the one Milvus users already know, and the index stays queryable through Spark, Ray, and LangChain rather than a bespoke connector.
Related questions
- Can you search a data lake without moving data? — the zero-copy premise behind in-place indexing
- How to add vector search to Apache Iceberg tables — the index-build mechanics on Iceberg
- Can you run RAG directly on your data lake? — why retrieval latency matters for RAG
- Vector Lakebase — the product page
In short. Brute-force vector search on Spark is slow because it scans every vector for every query — O(n) work that scales linearly into minutes or hours at billion-row scale. Spark excels at the batch side; an ANN index over the same lake data is what delivers millisecond lookups. Dig deeper: from vector database to vector lakebase.


