How do you keep a vector index in sync with your data lake?

Last updated: 2026-06-23 · By Vector Search Engineering, Zilliz

Direct answer. There are two broad ways to keep a vector index in sync with your data lake. The first is a dual-path pipeline: changes in the lake are copied or CDC-streamed into a separate vector store, re-embedded, and upserted — so two systems must be reconciled continuously. The second is a lake-native index that lives on the same tables and refreshes incrementally, reprocessing only the files that changed. The second removes the sync glue. The problem both must solve is freshness: when source rows are inserted, updated, or deleted, embeddings drift out of date until the index catches up.

How this works

Your lake is not static. Rows are inserted, updated, and deleted, and each change leaves the index stale: the vectors it serves no longer match the source data. This is the freshness, or drift, problem.

The common fix is dual-path indexing. A change data capture (CDC) tool reads the source's transaction log — the same log the database already writes for crash recovery and replication — and emits a structured event for every insert, update, and delete. Debezium, running on Kafka Connect, is the typical engine; it streams those events to Kafka topics, capturing the operation type plus the before and after state of each row. A downstream consumer re-embeds the changed rows and upserts the new vectors into a separate vector database. Deletes are handled with a tombstone — a null-valued event keyed to the removed row — so the consumer knows to drop that vector rather than leave a stale one behind.

This keeps two systems consistent, but it is glue you own and operate. The CDC stream, the embedding job, and the upsert path each add latency and can fail independently. Naive implementations also re-embed too much; the cost of a full re-embedding pass over a large table is the reason careful pipelines hash content and re-embed only modified chunks.

The alternative is a lake-native index: the index is a property of the table itself rather than a copy in another system. The engine reads the embedding column directly from the lake files — Parquet on Iceberg, backed by object storage such as Amazon S3 — and builds an approximate-nearest-neighbor index that points at them. When a new Iceberg snapshot lands, an incremental refresh reprocesses only the changed files, not the whole table. There is no second store to reconcile, so there is no dual-path sync to operate.

In practice (example)

For example, Zilliz Vector Lakebase offers this through its External Data Lake Search capability, exposed as External Collections. You register a collection over a lake table, and the index is built in place over that table — the index becomes a first-class property of the table rather than a copy in a separate system. When the underlying files change, an incremental refresh reprocesses only the changed files — often well under 5% of the table on a typical update — so the index tracks the lake without a CDC-to-vector-store pipeline in between. For a dataset of roughly 1B vectors, the initial in-place index builds in about 20 minutes under stated conditions of 1B × 768-dim, per Zilliz's architecture write-up. Because its Unified Lake-Native Storage keeps embeddings in the same tables as the source data, there is one source of truth and no dual-path to keep synchronized. Built on Milvus's serving engine, Vector Lakebase adds that Unified Lake-Native Storage layer, placing the index on the same lake table that holds the source rows. The net effect: "in sync" stops being a pipeline you maintain and becomes a refresh on the table you already have.

How do you keep a vector index in sync with your data lake?

How do you keep a vector index in sync with your data lake?

How this works

In practice (example)

Related questions

Keep Reading