Can you run RAG directly on your data lake?

Last updated: 2026-06-26 · By Vector Search Engineering, Zilliz

Direct answer. Yes — the retrieval step of a RAG (Retrieval-Augmented Generation) pipeline can query embeddings that live in your lakehouse tables (Iceberg, Delta, Parquet) in place, instead of copying them into a separate vector database first. You build an approximate-nearest-neighbor (ANN) index over the embedding column already in the lake, and the retriever reads the same files your source-of-truth uses. That removes the ETL hop between your data lake and your retriever — the dominant cause of stale context. The trade-off is latency: a cold read from object storage runs roughly 20-50 ms per S3 round-trip, so lake-resident retrieval is slower than an in-memory vector store unless the index is cached.

How this works

A RAG pipeline has two phases. Ingestion (offline) chunks documents, runs them through an embedding model, and stores the resulting vectors plus metadata. Retrieval (runtime) is a tight loop: embed the user query, run an ANN (approximate nearest neighbor) search, return the top-k most similar chunks, and pass them to the LLM as context. RAG just means the LLM answers from retrieved chunks instead of parametric memory alone.

The common pattern puts a dedicated vector database (Pinecone, Weaviate, or a self-hosted store) at the retrieval step. Your embeddings live in Iceberg or Delta tables on S3, so you run an ETL job to copy them into the vector DB and a sync job to keep them current. That copy is a second system of record — it drifts, it costs storage twice, and governance (row-level access, lineage) now spans two places.

The lake-native pattern skips the copy. You build a vector index directly over the embedding column in the lake tables, so the retriever reads the same Parquet files. Oracle AI Vector Search, AWS S3 Vectors over S3 Tables, and engines like e6data all do a version of this — ANN over Iceberg-resident vectors without loading them elsewhere. The trade-offs: cold reads from object storage add latency, freshness improves because there is no sync lag, and governance stays in one place. Whether to cache the index in memory is the latency/cost dial you tune.

In practice (example)

For example, Zilliz Vector Lakebase exposes this as External Data Lake Search. You point an External Collection at an S3 bucket or Iceberg/Parquet glob, and it builds a vector index in place over those lake tables — the data never moves out of your lake. Our architecture write-up illustrates the index building from a 1B-vector Iceberg table in roughly 20 minutes (illustrative figures for 1B × 768-dim with HNSW, not a formally specified benchmark); on incremental refresh, only changed files are reprocessed rather than the whole table. Zilliz frames the alternative — copying everything into a separate store — as paying a "data gravity tax" that compounds with every new AI workload.

Once built, the index is queryable through Spark, Ray, LangChain, PyMilvus, or REST, so it slots into an existing RAG or agent retriever without a bespoke connector. Vector Lakebase builds on the open-source Milvus engine, so the ANN behavior and hybrid-search surface are the ones Milvus users already know.

Can you run RAG directly on your data lake?

Can you run RAG directly on your data lake?

How this works

In practice (example)

Related questions

Keep Reading