DUCKLAKE VS ICEBERG: CHOOSING YOUR LAKEHOUSE FORMAT IN 2026
If you’re evaluating lakehouse formats in 2026, you’re staring at the same question I was last year: DuckLake or Iceberg?
Both solve the same core problem — ACID transactions, schema evolution, and petabyte-scale analytics on object storage — but they make radically different architectural tradeoffs. Pick wrong and you’re fighting your metadata layer instead of your actual data problems.
DuckLake v1.0 shipped April 2026 with backward-compatibility guarantees. Apache Iceberg — approaching a decade of production use at Netflix, Snowflake, and AWS — is the incumbent. This isn’t a choice between good and bad. It’s a choice between two valid designs that serve different use cases. I’ll give you the decision framework so you don’t have to learn the hard way.
Who Is This Guide For?
Data engineers, platform architects, and CTOs choosing a lakehouse format for a new project or evaluating whether to migrate an existing Iceberg deployment. You know what Parquet is and you’ve probably read the Iceberg docs. What you need is a decision framework, not another explainer.
By the End of This, You’ll Know
- Exactly when to pick DuckLake vs Iceberg, by data size, team count, and engine requirements
- Why the catalog question is the actual question — and how each format answers it
- What DuckLake v1.0’s data inlining means for streaming workloads
- Where Iceberg still wins unconditionally
- How to migrate between formats if you change your mind
The Verdict at a Glance
Your first question should be about scale and engine diversity. Find your row:
| Workload Tier | Format | Catalog | Primary Engines |
|---|---|---|---|
| Up to 100 GB, single team | DuckLake | Postgres or DuckDB file | DuckDB-native |
| 100 GB - 5 TB, one team | DuckLake | Managed Postgres (RDS/Cloud SQL) | DuckDB-centric + Iceberg reads via interop |
| 1 - 50 TB, multi-team read-heavy | Either — depends on engine plans | DuckLake: Postgres. Iceberg: REST + Polaris or Lakekeeper | DuckDB-first or Spark/Trino |
| 50 TB - 5 PB, multi-engine | Iceberg | REST catalog (Polaris, Lakekeeper) or Glue | Spark, Trino, Snowflake, Athena, BigQuery |
| 5 PB+, regulated, multi-region | Iceberg | Your compliance-approved catalog | Whatever the org standardized on |
These aren’t hard byte-count limits. The format choice tracks with how many engines, teams, and write clusters touch your data.
DuckLake v1.0: What Changed
DuckLake was first sketched as a spec in May 2025. A year later, v1.0 shipped with production guarantees. Here’s what you need to know.
Data inlining is the headline feature. When you insert fewer rows than a configurable threshold (default 10), DuckLake stores the data directly in the catalog database instead of writing a tiny Parquet file to object storage. The DuckDB Labs team published a streaming benchmark showing 926x faster queries and 105x faster ingestion compared to Iceberg on a streaming workload. Those numbers are vendor-published, not third-party validated, but the architectural advantage is real: Iceberg’s small-file problem requires compaction tooling, while DuckLake doesn’t create the problem in the first place.
Production-readiness: DuckLake v1.0 comes with backward-compatibility guarantees, a stable spec, and client implementations for DataFusion, Spark, Trino, and Pandas alongside DuckDB-native. Companies like Definite — an AI-native analytics platform — have been running it in production for over a year.
Apache Iceberg: The Incumbent
Iceberg started at Netflix in 2017 to solve a specific problem: petabyte-scale data lakes where Spark jobs needed consistent snapshots and schema evolution. It solved that problem so well that it became the industry standard.
Iceberg’s design philosophy is file-based metadata with no required external dependencies. You can put an Iceberg table on a bare S3 bucket and it works. The cost of that design freedom is that the catalog has to live somewhere, and “somewhere” turned into a five-year ecosystem race: AWS Glue, Apache Polaris, Lakekeeper, Project Nessie, Hive Metastore, Snowflake’s managed catalog, the REST catalog spec. Each one is a service you operate, integrate engines with, and monitor at 2am.
The ecosystem is Iceberg’s moat. Spark, Trino, Flink, Snowflake, Athena, BigQuery, ClickHouse (read), Dremio — every major engine reads and writes Iceberg. If you need multi-engine federation today, Iceberg is the answer, no asterisks.
The Core Architectural Difference: Catalogs
This is the single most important thing to understand about the DuckLake vs Iceberg choice.
Iceberg committed to file-based metadata. Everything — table snapshots, manifest lists, manifest files — lives as JSON and Avro files in object storage alongside your data. The catalog is just a pointer to the current metadata location. This design means Iceberg has zero required infrastructure dependencies. It also means every query traverses a tree of file reads just to figure out what to scan.
DuckLake commits to a database-backed catalog. Metadata lives in Postgres, DuckDB, MySQL, or SQLite. This single dependency buys you two things:
- ACID transactions come free — they’re how databases work, not something you have to build with file-level primitives
- Data inlining — small writes land directly in the catalog database instead of creating Parquet files, eliminating the small-file compaction problem entirely
Both BigQuery and Snowflake use database-as-catalog internally (Spanner and FoundationDB respectively). DuckLake is the first lakehouse format that exposes this pattern as an open spec.
DuckLake vs Iceberg: Side by Side
| Dimension | Apache Iceberg | DuckLake |
|---|---|---|
| Catalog | Pointer-based (REST, Glue, Hive, Polaris) | Database-native (Postgres, DuckDB, MySQL, SQLite) |
| Metadata format | JSON manifests + Avro manifest lists | SQL database tables |
| ACID transactions | Optimistic concurrency on object storage | Database transactions |
| Small writes | Creates tiny Parquet files — compaction needed | Inlined in catalog — zero files |
| Streaming | Requires compaction tooling | Data inlining handles it natively |
| Engine ecosystem | Spark, Trino, Flink, Snowflake, Athena, BigQuery, Dremio, ClickHouse | DuckDB-native, DataFusion, Spark (via MotherDuck), Trino (community), Pandas |
| Scaling model | Horizontal through object storage | Catalog database is coordination point |
| Starting fresh | Deploy a catalog service (Polaris, Glue, etc.) | Bring a database you already run |
| Production track record | 2017+, Netflix, Snowflake, AWS | 2025+, Definite, select early adopters |
Real-World Production Use Cases
The format debate is informed by who’s actually running each in production today. Here’s what the landscape looks like as of mid-2026.
DuckLake in Production
Definite — an AI-native analytics platform — migrated their entire infrastructure from Snowflake to DuckDB in May 2024 and adopted DuckLake as their lakehouse format. Their production system powers customer dashboards, AI agent queries, and data pipelines. Co-founder John Mark quoted the decision: “We already run Postgres for product state. Adding a Postgres-backed DuckLake catalog cost us nothing operationally — and it gave us ACID semantics over the lake without adding a service.” They published the full business case and an operator’s verdict after a year in production.
On Reddit’s r/DuckDB, multiple engineers report running DuckLake in production for analytics workloads in the “few GB per day” range, with one planning a full rollout across their data platform by end of 2026.
UK consultancy endjin published a comprehensive three-part analysis concluding DuckLake’s simplified architecture positions it as a potential disruptor to established lakehouse formats, particularly for teams that already run a database.
InfoQ covered DuckLake 1.0 as a notable data engineering milestone in May 2026, highlighting the SQL-catalog-metadata approach as a fundamental rethinking of lakehouse architecture.
Iceberg in Production
Netflix created Iceberg in 2017 to solve a specific problem: Spark jobs needing consistent snapshots across petabytes of data in S3. It worked so well they open-sourced it, and it became the industry standard. Netflix remains one of the largest Iceberg deployments, operating at multiple-petabyte scale with multi-region replication.
Apple, LinkedIn, and Airbnb all run Iceberg in production. Airbnb presented their migration journey at the 2025 Iceberg Summit, covering how they moved from Hive to Iceberg for their data lakehouse. Qlik’s report on Iceberg adoption cites these companies as reference deployments powering both analytics and AI workloads.
Snowflake natively reads and writes Iceberg tables — both managed catalogs and external tables. This integration alone makes Iceberg the default choice for any Snowflake-centric shop.
AWS Glue and Athena have deep Iceberg support. AWS doubled down on Iceberg as the open table format for their data lakehouse strategy.
The pattern is clear: Iceberg dominates at hyperscale with multi-engine, multi-team deployments. DuckLake is winning where teams run Postgres, use DuckDB as their primary engine, and value operational simplicity over ecosystem breadth. Both are legitimate choices for their respective use cases.
When to Pick Each
Pick DuckLake when:
You already run a database. If your stack includes Postgres, DuckLake’s catalog is just another schema. For a small team, “the catalog is free” is a meaningful operational unlock.
Your workload is AI-agent-driven. A human analyst runs maybe 50 queries a day. An AI agent doing schema inspection, query planning, and iterative refinement runs thousands. Iceberg’s metadata path walks S3 objects per read; DuckLake’s is a single SQL query. At human scale the difference is invisible. At agent scale, it compounds.
You’re building a DuckDB-centric stack. If DuckDB is your primary query engine and you don’t need Spark or Trino, DuckLake is the natural fit. For an in-depth look at DuckDB’s analytical capabilities, see our DuckDB guide.
You have streaming workloads with frequent small writes. DuckLake’s data inlining means you don’t need compaction tooling. The small-file problem is solved at write time, not patched by a maintenance job.
Pick Iceberg when:
You need multi-engine federation. If Spark, Trino, Snowflake, and Athena all need to read the same tables today, Iceberg is the only answer.
You’re already on Snowflake with Iceberg tables. Snowflake reads and writes Iceberg natively. Migration costs almost certainly outweigh the design wins. Run the cost numbers before you touch anything.
Your compliance team has a catalog mandate. If they’ve signed off on Glue or Unity Catalog as the system of record, you don’t get to swap in Postgres. That’s an audit decision, not a technical one.
You’re operating at 50TB+ with multi-write-cluster workloads. Iceberg’s optimistic concurrency on object storage scales horizontally without a single coordinator. DuckLake’s catalog database is a coordination point — fine at low-to-mid scale, but a bottleneck at the high end.
The Two Formats Are Converging
Here’s what doesn’t fit on a vendor slide: both formats store data as Parquet files. The bytes on disk are identical. A Parquet reader doesn’t know or care which catalog wrote them.
DuckLake 0.3 shipped Iceberg interoperability in September 2025: you can COPY data and table metadata between DuckLake and Iceberg in either direction. DuckLake’s deletion vectors are designed to be Iceberg-compatible.
On the Iceberg side, the V4 spec work is exploring pluggable catalog backends. A DuckLake-style RDBMS catalog could plausibly fit inside a future Iceberg spec. Whether that happens depends on community direction, but the architectural drift is real.
In eighteen months, the “DuckLake or Iceberg” question may matter less. The right move is to pick what fits your team now, knowing the migration cost in either direction is bounded.
Migration Path Between Formats
If you pick DuckLake and want to switch to Iceberg later, the data on disk doesn’t move — it’s Parquet. You export the catalog metadata, write it as Iceberg manifests, and point a catalog service at the result. DuckLake ships COPY operations to Iceberg that handle most of the mechanics.
It’s a real project — measured in weeks, not months — but it isn’t a rewrite. The migration cost in either direction is bounded.
If you’re inheriting an existing Snowflake-on-Iceberg deployment, the migration cost almost certainly outweighs the benefits. Stay where you are. The format wars converge anyway.
What You Can Actually Use Today
- DuckDB v1.5.2 includes the
ducklakeextension — runFORCE INSTALL ducklake; LOAD ducklake;and you’re running - DuckLake v1.0 is the production-ready spec with backward-compatibility guarantees
- Apache Iceberg is available through every major query engine and cloud vendor — no installation needed
- Apache Polaris is now a top-level Apache project (as of April 2026) for Iceberg catalog management
- For a complete managed lakehouse, Definite and MotherDuck offer DuckLake-native platforms