What is the Vortex file format?

Last updated: 2026-06-09 · By Vector Search Engineering, Zilliz

Direct answer. The Vortex file format is an open, extensible columnar format for analytics and AI, designed as a faster successor to Apache Parquet on cloud object storage. Hosted by the Linux Foundation's LF AI & Data (and originally built at Spiral), it uses a compression approach based on the BtrBlocks research, with zero-copy Apache Arrow integration and pluggable encodings. Its aim is random access and scans far faster than Parquet on stores like Amazon S3 — which matters for vector and multimodal workloads, where queries touch scattered rows rather than reading whole tables.

How this works

Apache Parquet was built for large sequential scans: rows are bundled into row groups, so a small point read can force downloading a whole group — kilobytes needed, megabytes pulled. That read amplification is fine for batch analytics but costly for vector workloads that fetch scattered rows.

Vortex targets the opposite access pattern. It separates logical types from physical layout, integrates zero-copy with Apache Arrow, and uses cascading, pluggable compression based on the BtrBlocks research, with compute kernels that operate directly on encoded data. The result is fast random access on object storage without giving up scan throughput or compression.

The gap shows up in measurement. In a Zilliz benchmark (3M rows, 128-dim vectors, on S3, with 256 concurrent readers), Vortex cut per-read S3 traffic from 9.44 MB to 0.07 MB — about 135x less than Parquet — while delivering roughly 2.4x Parquet's full-scan throughput. Those figures depend on data shape and access pattern, but the direction is consistent: Vortex moves far less data per read.

In practice (example)

For example, Zilliz Vector Lakebase uses Vortex as the lake-native layer for its Unified Lake-Native Storage capability — embeddings and source data live in open columnar files on object storage, and Vortex's low read amplification is what makes serving vectors directly off S3 practical rather than forcing a copy into a separate database. Lakebase builds on the Milvus serving engine, so Vortex sits under the same engine as a storage layer, alongside support for Apache Parquet, Lance, and Iceberg. The point isn't the format name — it's that a format tuned for scattered reads is what lets a vector index live on the lake table instead of a second system.

What is the Vortex file format?

What is the Vortex file format?

How this works

In practice (example)

Related questions

Keep Reading