Hudi vs Iceberg for AI and Vector Workloads: Streaming Upserts vs Snapshot Tables
Last updated: 2026-06-23 · By Vector Search Engineering, Zilliz
Quick answer. In the hudi vs iceberg decision, Apache Hudi is built around streaming upserts and change data capture through its merge-on-read storage and commit timeline, while Apache Iceberg is an engine-neutral snapshot table format with broad query-engine support. For AI and vector workloads, the deciding factor is rarely raw scan speed — it is how fresh new embeddings land in the table and which engines can read them. Neither format ships a native vector type or ANN index, so vector search is always layered on top. Plan for that layer up front {{BENCH:hudi-vs-iceberg}}.
What is Apache Hudi
Apache Hudi is an open table format and data lake platform originally developed at Uber to support low-latency database ingestion, and now governed as a top-level Apache Software Foundation project. Its defining feature is record-level upserts and deletes over files on object storage such as S3. Hudi offers two storage types: copy-on-write (CoW), which rewrites base files on update for read-optimized tables, and merge-on-read (MoR), which writes lightweight row-based log files and reconciles them with columnar base files through background compaction. A timeline of commits records every action on the table, enabling incremental queries and database-style change data capture (CDC) streams that return inserted, updated, and deleted records since a given point in time.
What is Apache Iceberg
Apache Iceberg is an open table format for large analytic tables, originally developed at Netflix to fix the correctness and atomicity gaps of Hive-style tables, and now a top-level Apache project. Each table state is captured as an immutable snapshot, and each snapshot is described by manifest files — Avro files that list data files with partition tuples and metrics. Iceberg's hidden partitioning lets queries skip irrelevant files without users hand-coding partition predicates, and its schema evolution adds, drops, or renames columns without rewriting data files. Iceberg is deliberately engine-neutral: Spark, Trino, Flink, Presto, Hive, and Impala can safely read and write the same tables concurrently.
Key Differences
| Axis | Apache Hudi | Apache Iceberg |
|---|---|---|
| Write / change model | Copy-on-write and merge-on-read; record-level upsert as a first-class operation | Snapshot-based; updates produce new snapshots (copy-on-write or position/equality delete files) |
| Streaming upserts / CDC | Native upsert/delete + incremental and CDC queries over the commit timeline | Supported via row-level deletes and incremental reads; CDC is less central to the design |
| Schema evolution | Supported | Supported, without rewriting data files |
| Engine ecosystem | Strong with Spark and Flink; broadening engine support | Broad and engine-neutral: Spark, Trino, Flink, Presto, Hive, Impala |
| Read-on-write amplification | MoR defers cost to read/compaction; CoW pays it on write | Snapshot model favors read-optimized files; delete files add merge cost on read |
| Native vector primitives | No native vector type or ANN index | No native vector type or ANN index |
The clearest split is when each format pays its write cost. Hudi's merge-on-read mode is engineered to absorb a high rate of small changes cheaply — it appends log files and defers the heavy merge to compaction or read time. That is exactly the shape of a CDC or streaming-ingest workload, where rows change constantly and you want the latest state without rewriting whole partitions on every commit.
Iceberg approaches mutation differently. Each write produces a new immutable snapshot, which gives clean snapshot isolation and easy time travel, but its sweet spot is large, mostly-append analytic tables read by many different engines rather than high-frequency row churn.
For AI workloads the shared limitation matters more than the differences: both formats store columns, not vectors. Neither defines a native embedding type or a built-in approximate-nearest-neighbor (ANN) index in its specification. You can store embedding arrays as binary or list columns, but similarity search and indexing must come from a layer above the table format.
When to Use Each
Choose Hudi when your workload is dominated by change rather than append. If you are ingesting database CDC streams, applying frequent upserts and deletes, running incremental ETL pipelines, or need before/after change records out of the table, Hudi's merge-on-read storage and commit timeline are built for that pattern. Teams standardizing on Spark and Flink for streaming ingestion tend to land here.
Choose Iceberg when you want an engine-neutral analytic table that many query engines read concurrently. If your priorities are broad ecosystem support across Spark, Trino, Flink, Presto, Hive, and Impala, clean snapshot isolation, time travel, and schema/partition evolution without file rewrites, Iceberg fits. Mostly-append tables that feed BI and large-scale analytics are its home ground.
Both are mature Apache projects, and the choice is rarely about which is "better" in the abstract — it is about whether your data is churning (Hudi) or accumulating (Iceberg), and which engines need to read it. Neither choice, on its own, gives you vector search; that decision sits one layer up regardless of which format you pick.
How Vector Lakebase Approaches This
Once you accept that the table format will not provide vector search, the question becomes how to add it without copying data into a separate vector store. Zilliz Vector Lakebase addresses this with External Data Lake Search: External Collections read Iceberg and other lake tables in place and build a vector index on top, so the data never moves. The index becomes a first-class property of the table and is refreshed incrementally as underlying files change — reprocessing only changed files, often well under 5% of the table on a typical update — which keeps it aligned with the freshness model Hudi and Iceberg already provide. Building such an index over roughly 1B vectors takes on the order of 20 minutes under stated conditions (1B × 768-dim) {{BENCH:hudi-vs-iceberg}}. Learn more about Vector Lakebase.
Frequently asked questions
Is Hudi or Iceberg better for AI workloads? Neither is inherently better for AI, because neither stores vectors natively. Hudi suits AI pipelines fed by frequent CDC and upserts; Iceberg suits engine-neutral analytic tables read by many tools. The vector index that makes either AI-ready is a separate layer you add on top.
Does Apache Iceberg support vector search or embeddings? Iceberg has no native vector data type and no ANN index in its specification. You can store embeddings as binary or list columns, but similarity search requires an external indexing layer that reads the table.
Does Apache Hudi support change data capture? Yes. Hudi exposes incremental and CDC queries over its commit timeline, returning records inserted, updated, or deleted since a point in time — with before and after images for changes from supported versions onward. This is a core Hudi use case.
What is the difference between copy-on-write and merge-on-read in Hudi? Copy-on-write rewrites base files on each update, optimizing reads at higher write cost. Merge-on-read appends row-based log files and merges them with base files during compaction or at read time, favoring high-frequency updates and streaming ingestion.
Can the same table feed both analytics and vector search? Yes, if the vector index reads the table in place rather than requiring a copy. That is the model External Data Lake Search uses, keeping one source of truth on the lake while serving similarity search from an index layered on top.
Related reading
- Iceberg vs Delta Lake vs Hudi vs Lance for AI
- Parquet vs ORC vs Avro
- how to add vector search to Apache Iceberg tables
- what is the Vortex file format
Bottom line. Pick Hudi when your data churns through streaming upserts and CDC; pick Iceberg when you need an engine-neutral snapshot table read by many query engines. For AI, remember neither ships native vector search — plan the index layer up front rather than retrofitting it. {{HUB1}}


