DUCKLAKE VS ICEBERG: CHOOSING YOUR LAKEHOUSE FORMAT IN 2026

If you’re evaluating lakehouse formats in 2026, you’re staring at the same question I was last year: DuckLake or Iceberg?

Both solve the same core problem — ACID transactions, schema evolution, and petabyte-scale analytics on object storage — but they make radically different architectural tradeoffs. Pick wrong and you’re fighting your metadata layer instead of your actual data problems.

DuckLake v1.0 shipped April 2026 with backward-compatibility guarantees. Apache Iceberg — approaching a decade of production use at Netflix, Snowflake, and AWS — is the incumbent. This isn’t a choice between good and bad. It’s a choice between two valid designs that serve different use cases. I’ll give you the decision framework so you don’t have to learn the hard way.

Who Is This Guide For?

Data engineers, platform architects, and CTOs choosing a lakehouse format for a new project or evaluating whether to migrate an existing Iceberg deployment. You know what Parquet is and you’ve probably read the Iceberg docs. What you need is a decision framework, not another explainer.

By the End of This, You’ll Know

  • Exactly when to pick DuckLake vs Iceberg, by data size, team count, and engine requirements
  • Why the catalog question is the actual question — and how each format answers it
  • What DuckLake v1.0’s data inlining means for streaming workloads
  • Where Iceberg still wins unconditionally
  • How to migrate between formats if you change your mind

The Verdict at a Glance

Your first question should be about scale and engine diversity. Find your row:

Workload TierFormatCatalogPrimary Engines
Up to 100 GB, single teamDuckLakePostgres or DuckDB fileDuckDB-native
100 GB - 5 TB, one teamDuckLakeManaged Postgres (RDS/Cloud SQL)DuckDB-centric + Iceberg reads via interop
1 - 50 TB, multi-team read-heavyEither — depends on engine plansDuckLake: Postgres. Iceberg: REST + Polaris or LakekeeperDuckDB-first or Spark/Trino
50 TB - 5 PB, multi-engineIcebergREST catalog (Polaris, Lakekeeper) or GlueSpark, Trino, Snowflake, Athena, BigQuery
5 PB+, regulated, multi-regionIcebergYour compliance-approved catalogWhatever the org standardized on

These aren’t hard byte-count limits. The format choice tracks with how many engines, teams, and write clusters touch your data.

DuckLake v1.0: What Changed

DuckLake was first sketched as a spec in May 2025. A year later, v1.0 shipped with production guarantees. Here’s what you need to know.

Data inlining is the headline feature. When you insert fewer rows than a configurable threshold (default 10), DuckLake stores the data directly in the catalog database instead of writing a tiny Parquet file to object storage. The DuckDB Labs team published a streaming benchmark showing 926x faster queries and 105x faster ingestion compared to Iceberg on a streaming workload. Those numbers are vendor-published, not third-party validated, but the architectural advantage is real: Iceberg’s small-file problem requires compaction tooling, while DuckLake doesn’t create the problem in the first place.

Production-readiness: DuckLake v1.0 comes with backward-compatibility guarantees, a stable spec, and client implementations for DataFusion, Spark, Trino, and Pandas alongside DuckDB-native. Companies like Definite — an AI-native analytics platform — have been running it in production for over a year.

Apache Iceberg: The Incumbent

Iceberg started at Netflix in 2017 to solve a specific problem: petabyte-scale data lakes where Spark jobs needed consistent snapshots and schema evolution. It solved that problem so well that it became the industry standard.

Iceberg’s design philosophy is file-based metadata with no required external dependencies. You can put an Iceberg table on a bare S3 bucket and it works. The cost of that design freedom is that the catalog has to live somewhere, and “somewhere” turned into a five-year ecosystem race: AWS Glue, Apache Polaris, Lakekeeper, Project Nessie, Hive Metastore, Snowflake’s managed catalog, the REST catalog spec. Each one is a service you operate, integrate engines with, and monitor at 2am.

The ecosystem is Iceberg’s moat. Spark, Trino, Flink, Snowflake, Athena, BigQuery, ClickHouse (read), Dremio — every major engine reads and writes Iceberg. If you need multi-engine federation today, Iceberg is the answer, no asterisks.

The Core Architectural Difference: Catalogs

This is the single most important thing to understand about the DuckLake vs Iceberg choice.

Iceberg committed to file-based metadata. Everything — table snapshots, manifest lists, manifest files — lives as JSON and Avro files in object storage alongside your data. The catalog is just a pointer to the current metadata location. This design means Iceberg has zero required infrastructure dependencies. It also means every query traverses a tree of file reads just to figure out what to scan.

DuckLake commits to a database-backed catalog. Metadata lives in Postgres, DuckDB, MySQL, or SQLite. This single dependency buys you two things:

  1. ACID transactions come free — they’re how databases work, not something you have to build with file-level primitives
  2. Data inlining — small writes land directly in the catalog database instead of creating Parquet files, eliminating the small-file compaction problem entirely

Both BigQuery and Snowflake use database-as-catalog internally (Spanner and FoundationDB respectively). DuckLake is the first lakehouse format that exposes this pattern as an open spec.

graph TD subgraph "Iceberg Metadata Path" I_Catalog[REST / Glue Catalog] I_Root[Root Metadata JSON] I_ManifestList[Manifest List Avro] I_Manifest[Manifest Avro] I_Parquet[Parquet Data Files] I_Catalog --> I_Root I_Root --> I_ManifestList I_ManifestList --> I_Manifest I_Manifest --> I_Parquet end subgraph "DuckLake Metadata Path" D_Catalog[SQL Catalog - Postgres / DuckDB] D_Parquet[Parquet Data Files] D_Catalog -.-> D_Parquet end

DuckLake vs Iceberg: Side by Side

DimensionApache IcebergDuckLake
CatalogPointer-based (REST, Glue, Hive, Polaris)Database-native (Postgres, DuckDB, MySQL, SQLite)
Metadata formatJSON manifests + Avro manifest listsSQL database tables
ACID transactionsOptimistic concurrency on object storageDatabase transactions
Small writesCreates tiny Parquet files — compaction neededInlined in catalog — zero files
StreamingRequires compaction toolingData inlining handles it natively
Engine ecosystemSpark, Trino, Flink, Snowflake, Athena, BigQuery, Dremio, ClickHouseDuckDB-native, DataFusion, Spark (via MotherDuck), Trino (community), Pandas
Scaling modelHorizontal through object storageCatalog database is coordination point
Starting freshDeploy a catalog service (Polaris, Glue, etc.)Bring a database you already run
Production track record2017+, Netflix, Snowflake, AWS2025+, Definite, select early adopters

Real-World Production Use Cases

The format debate is informed by who’s actually running each in production today. Here’s what the landscape looks like as of mid-2026.

DuckLake in Production

Definite — an AI-native analytics platform — migrated their entire infrastructure from Snowflake to DuckDB in May 2024 and adopted DuckLake as their lakehouse format. Their production system powers customer dashboards, AI agent queries, and data pipelines. Co-founder John Mark quoted the decision: “We already run Postgres for product state. Adding a Postgres-backed DuckLake catalog cost us nothing operationally — and it gave us ACID semantics over the lake without adding a service.” They published the full business case and an operator’s verdict after a year in production.

On Reddit’s r/DuckDB, multiple engineers report running DuckLake in production for analytics workloads in the “few GB per day” range, with one planning a full rollout across their data platform by end of 2026.

UK consultancy endjin published a comprehensive three-part analysis concluding DuckLake’s simplified architecture positions it as a potential disruptor to established lakehouse formats, particularly for teams that already run a database.

InfoQ covered DuckLake 1.0 as a notable data engineering milestone in May 2026, highlighting the SQL-catalog-metadata approach as a fundamental rethinking of lakehouse architecture.

Iceberg in Production

Netflix created Iceberg in 2017 to solve a specific problem: Spark jobs needing consistent snapshots across petabytes of data in S3. It worked so well they open-sourced it, and it became the industry standard. Netflix remains one of the largest Iceberg deployments, operating at multiple-petabyte scale with multi-region replication.

Apple, LinkedIn, and Airbnb all run Iceberg in production. Airbnb presented their migration journey at the 2025 Iceberg Summit, covering how they moved from Hive to Iceberg for their data lakehouse. Qlik’s report on Iceberg adoption cites these companies as reference deployments powering both analytics and AI workloads.

Snowflake natively reads and writes Iceberg tables — both managed catalogs and external tables. This integration alone makes Iceberg the default choice for any Snowflake-centric shop.

AWS Glue and Athena have deep Iceberg support. AWS doubled down on Iceberg as the open table format for their data lakehouse strategy.

The pattern is clear: Iceberg dominates at hyperscale with multi-engine, multi-team deployments. DuckLake is winning where teams run Postgres, use DuckDB as their primary engine, and value operational simplicity over ecosystem breadth. Both are legitimate choices for their respective use cases.

When to Pick Each

Pick DuckLake when:

You already run a database. If your stack includes Postgres, DuckLake’s catalog is just another schema. For a small team, “the catalog is free” is a meaningful operational unlock.

Your workload is AI-agent-driven. A human analyst runs maybe 50 queries a day. An AI agent doing schema inspection, query planning, and iterative refinement runs thousands. Iceberg’s metadata path walks S3 objects per read; DuckLake’s is a single SQL query. At human scale the difference is invisible. At agent scale, it compounds.

You’re building a DuckDB-centric stack. If DuckDB is your primary query engine and you don’t need Spark or Trino, DuckLake is the natural fit. For an in-depth look at DuckDB’s analytical capabilities, see our DuckDB guide.

You have streaming workloads with frequent small writes. DuckLake’s data inlining means you don’t need compaction tooling. The small-file problem is solved at write time, not patched by a maintenance job.

Pick Iceberg when:

You need multi-engine federation. If Spark, Trino, Snowflake, and Athena all need to read the same tables today, Iceberg is the only answer.

You’re already on Snowflake with Iceberg tables. Snowflake reads and writes Iceberg natively. Migration costs almost certainly outweigh the design wins. Run the cost numbers before you touch anything.

Your compliance team has a catalog mandate. If they’ve signed off on Glue or Unity Catalog as the system of record, you don’t get to swap in Postgres. That’s an audit decision, not a technical one.

You’re operating at 50TB+ with multi-write-cluster workloads. Iceberg’s optimistic concurrency on object storage scales horizontally without a single coordinator. DuckLake’s catalog database is a coordination point — fine at low-to-mid scale, but a bottleneck at the high end.

The Two Formats Are Converging

Here’s what doesn’t fit on a vendor slide: both formats store data as Parquet files. The bytes on disk are identical. A Parquet reader doesn’t know or care which catalog wrote them.

DuckLake 0.3 shipped Iceberg interoperability in September 2025: you can COPY data and table metadata between DuckLake and Iceberg in either direction. DuckLake’s deletion vectors are designed to be Iceberg-compatible.

On the Iceberg side, the V4 spec work is exploring pluggable catalog backends. A DuckLake-style RDBMS catalog could plausibly fit inside a future Iceberg spec. Whether that happens depends on community direction, but the architectural drift is real.

In eighteen months, the “DuckLake or Iceberg” question may matter less. The right move is to pick what fits your team now, knowing the migration cost in either direction is bounded.

Migration Path Between Formats

If you pick DuckLake and want to switch to Iceberg later, the data on disk doesn’t move — it’s Parquet. You export the catalog metadata, write it as Iceberg manifests, and point a catalog service at the result. DuckLake ships COPY operations to Iceberg that handle most of the mechanics.

It’s a real project — measured in weeks, not months — but it isn’t a rewrite. The migration cost in either direction is bounded.

If you’re inheriting an existing Snowflake-on-Iceberg deployment, the migration cost almost certainly outweighs the benefits. Stay where you are. The format wars converge anyway.

What You Can Actually Use Today

  • DuckDB v1.5.2 includes the ducklake extension — run FORCE INSTALL ducklake; LOAD ducklake; and you’re running
  • DuckLake v1.0 is the production-ready spec with backward-compatibility guarantees
  • Apache Iceberg is available through every major query engine and cloud vendor — no installation needed
  • Apache Polaris is now a top-level Apache project (as of April 2026) for Iceberg catalog management
  • For a complete managed lakehouse, Definite and MotherDuck offer DuckLake-native platforms

Need help choosing your lakehouse architecture?

I advise engineering teams on data platform architecture, lakehouse migration, and infrastructure strategy. If you’re evaluating DuckLake vs Iceberg for a real deployment, let’s talk.