Can you store vectors directly in Parquet or Iceberg tables?
Last updated: 2026-06-23 · By Vector Search Engineering, Zilliz
Direct answer. Yes — you can store vectors in Parquet files and Iceberg tables as array/list-of-float columns, and both formats handle them fine for storage and batch scans. A 768-dimension embedding simply becomes a
list<float>column alongside your other fields. But neither format has a native vector type or an approximate-nearest-neighbor (ANN) index — the index structure that lets similarity search skip most rows. So a top-k similarity query has no index to exploit and falls back to a brute-force scan of every row. Storing vectors is solved; searching them is what these formats leave to a layer built on top.
How this works
Parquet is a columnar file format, and its type system supports nested and repeated fields — including lists. That means an embedding vector maps cleanly onto a list<float> (a repeated float field) column. Iceberg is a table format that sits on top of data files written as Parquet, ORC, or Avro; its schema supports the same nested list type, so an Iceberg table can carry an embedding column the same way. At the storage layer, nothing is missing: vectors compress, scan, and travel like any other column.
What is missing is search machinery. Neither Parquet nor Iceberg defines a native vector or embedding type — their type catalogs cover numerics, dates, decimals, strings, geospatial, variant, and the list/map/struct nesting that wraps them, but stop there. More importantly, neither defines an ANN (approximate-nearest-neighbor) index — a structure such as HNSW (a navigable graph) or IVF (inverted-file clustering) that prunes the search space so you only compare against a small candidate set.
Without that index, a top-k similarity query must compute the distance from your query vector to every stored vector — an O(n) brute-force scan that grows linearly with row count. Three ways to add real search: copy the vectors out into a separate vector database (now you maintain two systems and sync them); build an in-place index layer over the lake files; or adopt an AI-native columnar format. Newer formats in this space — Vortex and Lance — are designed to carry vectors and their indexes in the lake itself, rather than treating embeddings as opaque float lists sitting on S3.
In practice (example)
This is where Lakebase's Unified Lake-Native Storage capability fits. Two pieces work together. Vortex is a lake-native columnar format — open, with compact encodings — built so vector columns are first-class rather than incidental float lists. On top of it, External Collections build a vector index in place over existing Parquet, Vortex, Lance, or Iceberg tables: the embeddings stay where they already live in the lake, but an ANN index is constructed over them so similarity search no longer means a full scan. Per Zilliz's architecture write-up, building that index over roughly 1B vectors takes on the order of 20 minutes under stated conditions (1B × 768-dim), and once it exists an incremental refresh reprocesses only changed files — often well under 5% of the table on a typical update. You don't extract vectors into a second store and reconcile two copies; the lake table becomes searchable as-is. Because Vortex is tuned for vector access patterns, it can cut the per-read traffic pulled from object storage compared with reading the same vectors out of Parquet — the gain depends heavily on vector dimensionality, batch size, and access pattern. Lakebase builds on the Milvus engine, so the index types are the same battle-tested HNSW/IVF families, now pointed at lake data.
Related questions
- what is the Vortex file format?
- Parquet vs ORC vs Avro for AI workloads
- how do you keep a vector index in sync with your data lake?
- Vector Lakebase
In short. Parquet and Iceberg store embedding vectors fine as list-of-float columns, but have no vector type and no ANN index — so similarity search is a brute-force scan until you add an index layer or move to an AI-native format. {{HUB1}}


