Bringing DuckLake to Apache DataFusion

Eddie A TejedaJun 23, 20267 min read

We've just released a new version of DuckLake + DataFusion and have donated it to the datafusion-contrib repository. We’re delighted to bring a new lakehouse catalog format to the DataFusion ecosystem. But let’s back up and give context on what Apache DataFusion is, what DuckLake is, and why we think it’s a useful combination for building low-latency data systems.

What is Apache DataFusion?

Apache DataFusion is a query engine written in Rust that uses Apache Arrow as its in-memory format. It provides SQL and DataFrame APIs, a query planner, query optimization, vectorized processing, and Parquet support. It is also embeddable, which means you can put it directly inside your application without running a separate service. It provides the core machinery of a database while leaving room for developers to extend as they see it.

If you use DuckDB, the idea of a pure query engine may not be clear. To give some context: while DuckDB and DataFusion are both embeddable query engines, DuckDB also includes the surrounding database product: a command-line tool, a storage format, a catalog, transaction semantics, readers, writers, and a polished end-user experience. DuckDB is a complete database that you can configure and embed.

DataFusion operates at a lower level. The library gives you planning, optimization, and execution, but it does not prescribe the rest of the architecture. To build a full database on top of DataFusion, you need to define several important pieces: storage format, catalog format, table representation, indexing, caching, and how tables are represented on disk.

So that’s DataFusion in a nutshell. It is a powerful foundation for building a database, but it purposely leaves the higher-level product decisions to the builder. A tradeoff that drew us to using DataFusion for building Hotdata. We wanted control over the internals of the system and wanted to define how data is stored, how it is indexed, how it is cached, how requests are planned, and how the execution layer fits into the rest of the platform.

This is where it’s useful to know about lakehouse architectures.

What is a Lakehouse?

Unlike a traditional database, where your data files and database are on the same machine, lakehouses and other distributed data systems store their data files in object storage, and those files have to be ingested into the execution engine before results get back to the user.

At first glance, this may seem slow since disk access is much faster than going out to the network. But object storage has gotten so fast that it's possible to build large systems that ingest data from remote objects in milliseconds. And that's why lakehouses are so popular.

At a simplified level, a lakehouse maps table names to objects that live in object storage, usually as Parquet files in a bucket such as S3. A request comes in with details like the organization, dataset, table, and version. The system resolves that metadata, finds the relevant files, and passes those files into an execution engine such as DataFusion or DuckDB.

How a lakehouse works: a request keyed by (org, dataset, version) hits a metadata lookup table, which resolves Parquet and DuckDB files in S3 for execution.

When we previously built custom lakehouse metadata systems, this is the approach we took. It is a barebones system that works well when you have clear product requirements and can focus on reducing overhead.

But difficulty rears its head over time, and extending this basic architecture becomes extremely difficult. As products evolve, so do the requirements. For example, adding table versions, snapshots, dealing with schema evolution, reassigning data from one customer to another, and auditability all require custom bookkeeping. That is why Apache Iceberg was exciting when it was released. Iceberg defines a table format with metadata, snapshots, schemas, manifests, and object-store-backed files. It gives structure to a problem that otherwise tends to become a collection of custom conventions.

In Iceberg, a request comes in, the system resolves the table, and then walks through metadata files stored in object storage. It then reads snapshot metadata, partition specs, schema details, manifest lists, file manifests, per-file statistics, and information about deleted files. After traversing that metadata graph, the engine can identify the relevant Parquet files and begin execution.

Iceberg-style metadata path: the query engine asks a REST catalog for the current snapshot, walks immutable metadata files in S3 (snapshot, manifest list, manifest files, delete files) to prune partitions, then opens the candidate Parquet files for row-group and page pruning.

That model is powerful, especially when a broad data ecosystem and throughput matters. For a low-latency query system, though, that metadata path can involve many steps. Multiple object-store requests before opening the actual data files can become expensive when the goal is to serve small, fast queries.

That is why DuckLake caught our attention.

What is DuckLake?

DuckLake defines a lakehouse format where the metadata lives in a database such as Postgres, DuckDB, or MySQL, while the data files live in object storage. Instead of walking through many metadata files in S3, the system can query a metadata database and retrieve the information it needs about tables, snapshots, schemas, and Parquet files.

For our use case, that is perfect. With one metadata query, we can resolve the table, understand the relevant files, get the paths to the Parquet files, and begin execution in DataFusion.

DuckLake-style relational metadata catalog: the query engine issues one SQL call to a metadata database (Postgres, SQLite, or DuckDB) that returns the candidate Parquet files and row-group stats directly — no manifest walk, no extra object-store hops — before opening the files for row-group and page pruning.

The important point is that DuckLake is a specification. It is not only an application or a feature inside DuckDB. It defines how the metadata is represented and how the data files are organized. That means other systems can implement it.

That led us to building DataFusion+DuckLake.

Our goal is to combine the flexibility of Apache DataFusion with the structure of DuckLake. DataFusion gives us the execution engine. DuckLake gives us a concrete table and catalog model that works well with object storage and low-latency metadata access.

So far, we have implemented key parts of the DuckLake architecture. We support reads and writes, multiple catalog backends including DuckDB, Postgres, and MySQL, encrypted Parquet files, hints for optimized I/O, filter pushdown for row-group pruning, and page-level filtering. We are also working closely with the DuckLake team to help advance the standard.

Why DuckLake Fits DataFusion

This replaces a large amount of custom database infrastructure that we would otherwise have to build ourselves. Instead of inventing our own table format, catalog model, versioning system, and metadata layout, we can implement a shared format and focus our energy on execution, performance, caching, indexing, and the developer experience around the system.

Where time goes when you execute a query: Iceberg-style planning spends it on metadata traversal, object-store latency, and planning before the Parquet scan, while DuckLake-style planning collapses that to a single metadata DB lookup plus I/O.

We do not think DuckLake will replace Iceberg for every use case. If you are building a lakehouse that needs deep interoperability, Iceberg is the right choice. DuckLake is great for a different class of systems: applications, embedded query engines, low-latency serving layers, and systems that want the economics of object storage with the responsiveness of a database-backed catalog.

That is the gap we care about for Hotdata.

Conclusion

The project is now available in the Apache DataFusion Contrib GitHub repository, and we would like more people to help shape it. If you are building with DataFusion, DuckLake, Parquet, object storage, or embedded query engines, contributions are welcome. Issues, bug reports, tests, documentation improvements, catalog backend work, and performance benchmarks are all useful at this stage.