Speech & Audio

DuckDB flips lakehouse model with bring-your-own compute • The Register

DuckDB flips lakehouse model with bring-your-own compute • The Register


With a combined market value of around $150 billion, Snowflake and Databricks have divergent visions on how to get customers’ analytics and machine learning tools to their data, which is often spread across different systems.

And now fledgling database DuckDB is throwing another fresh format into the pot, along with a new architecture and database extension.

In-process analytics database DuckDB launched in 2022, backed by a small team in the Netherlands, and quickly won over a fanbase at Google, Facebook, and Airbnb. Now past its 1.0 release with 20 million monthly downloads, the database promises users their own data warehouse, whether that’s on a laptop, or running in Python software library Pandas, without importing or copying data. Written in C++, DuckDB is free and open source under the MIT License.

The DuckDB concept contrasts with the model of data warehousing established over the last 30 years or so. Whether on-prem with systems like Teradata, Netezza or Greenplum, or in the cloud on blob storage with Google’s BigQuery, or Snowflake, users were supposed to share a single, consistent set of data.

Self-funded and eschewing VC money, DuckDB says it can now adopt that “single version of the truth” while promising to solve the problems with data lakehouses — less orderly but more flexible than data warehouses — by way of a new architecture and database extension.

To allow users to access data across different systems, data lake vendors have backed table formats Iceberg, Delta Lake, and – to a lesser extent – Hudi.

One of the cool things about DuckLake as a standard is that it actually is not bound to specific metadata storage

Iceberg originated in 2015 at Netflix, and has broad support from a range of vendors and tech big hitters such as AWS, Google, Apple and Snowflake. Delta Lake, meanwhile, emerged from data lake vendor Databricks, and is the table format of choice for Microsoft and SAP. Since Databricks spent at least $1 billion on Tabular, an early-stage start up founded by Iceberg authors from Netflix, the prospect of merging the standards has been on the cards.

But both table formats share a problem of storing metadata, according to DuckDB, in a blog co-authored by CTO Mark Raasveldt and Hannes Mühleisen, a professor at Amsterdam’s Centrum Wiskunde & Informatica mathematical and theoretical computing research center.

“For example, every single root file in Iceberg contains all existing snapshots complete with schema information, etc,” DuckDB said in the blog.

“For every single change, a new file is written that contains the complete history. A lot of other metadata had to be batched together, for example, in the two-layer manifest files to avoid writing or reading too many small files, something that would not be efficient on blob stores. Making small changes to data is also a largely unsolved problem that requires complex cleanup procedures that are still not very well understood nor supported by open-source implementations,” the database purveyor added.

Open storage and metadata

DuckDB proposes to solve this problem and upend the architecture at the same time with two ideas. Firstly, storing data files in open formats on blob storage — AWS S3, Google Cloud Storage and Microsoft Azure Blob Storage — is a “great idea for scalability and to prevent lock-in.”

Meanwhile, the metadata management problem is given over to a dedicated database management system.

To implement its radical new architecture, DuckDB has launched an extension to allow DuckDB to operate as metadata storage, and a new table format, DuckLake, which promises to simplify lakehouses by using a standard SQL database for all metadata, instead of complex file-based systems, while still storing data in open formats like Parquet. This should make the system more reliable, faster, and easier to manage, DuckDB claimed.

“DuckLake is turning this all upside down. We’re using standard Parquet files as the storage, not some custom thing. We’re using standard SQL and tables for the metadata, which is much better, much proficient, much safer than throwing a bunch of files around,” Mühleisen told The Register.

DuckDB can serve as the dedicated metadata database, but so can PostgreSQL, SQLite and MySQL, giving users choice.

“One of the cool things about DuckLake as a standard is that it actually is not bound to specific metadata storage,” said Mühleisen, who is also CEO and co-founder of DuckDB Labs, which provides consultancy and services for DuckDB. “You can use DuckLake with DuckDB. You can use DuckLake with SQLite, MySQL; doesn’t really matter. And you can use DuckLake with an arbitrary file storage back end: S3, Google, an FTP server for all I care it is flexible in that. For example, run DuckLake fully locally on your laptop and you can scale it up to run it on a gigantic Amazon cluster. We’re just saying you need a file storage, and you need a SQL database for a metadata, and that’s it, really.

“Compute is absolutely local: everybody brings their bring your own compute, so to speak. You can still get a centralized, unified view and your data and then scale out compute by basically pushing that all into the clients,” Mühleisen said.

The DuckLake format supports the new architecture – as does DuckLake extension. Both are free and open source software under the MIT license with all IP resting in the non-profit DuckDB Foundation.

Andrew Pavlo, associate professor of databaseology at Carnegie Mellon University, said DuckDB’s assessment of the problems with lakehouse system was fair, but they concerned the internals.

“End users are unaware of how these systems store metadata and handle updates. It is about the inefficient implementations of metadata catalogs. But if the only place that your system can store data is in a blob/object store like S3 that doesn’t support updates in place, then you would end up with a design similar to Iceberg,” he said.

“The DuckLake proposal is about defining a generic database schema for representing a catalog. Think of it like defining a file format like Parquet that anybody could write a reader to interpret. The DuckLake extension integrated into DuckDB can connect to another DBMS (using their existing functionalities) and query against that known schema. I applaud using a relational database with SQL to maintain a catalog. This is what nearly all DBMSs do today,” Pavlo said.

However, there were also trade-offs and potential disadvantages to the DuckLake approach, he said.

“One key advantage of Iceberg’s approach is that the metadata is self-contained and can be included alongside the data files. That means if you trash your Iceberg service, you can still read Iceberg’s metadata Parquet files on the object store to extract the catalog and access your data. With DuckLake, if the external DBMS hosting the catalog is corrupted or goes away, there is no way to recreate it using the object store’s data files. This design choice is not a fatal flaw. Instead, it is a trade-off for DuckLake’s more efficient fine-grained updates in exchange for a (minor) increase in operational complexity,” he said.

Hyoun Park, CEO and chief analyst, Amalgam Insights, also highlighted the big innovation – metadata management in Parquet files that can be easily supported with SQL and eliminating the “small change mess that is endemic in current lakehouse deployments.”

However, Park opined that adoption of the new architecture would depend how willing user organizations are to divert from the standards supported by the largest vendor, and with the most momentum in the market.

In the business world, there are all sorts of reasons why the common sense decision can’t or won’t happen. DuckDB is doing needed and often thankless work in making the Lakehouse more usable and performant

“In the business world, there are all sorts of reasons why the common sense decision can’t or won’t happen. DuckDB is doing needed and often thankless work in making the Lakehouse more usable and performant. In a world where many data vendors often try to solve problems by throwing more compute and storage at it, DuckDB is taking a more reasonable approach to sorting out metadata challenges,” he said.

DuckDB has gained support from former Google executive and BigQuery engineering lead Jordan Tigani, who is founder and CEO of MotherDuck – which built a serverless analytics system based around DuckDB.

“DuckLake corrects some of the architectural weaknesses of the other lakehouse formats regarding how they store metadata, which is the information about where to find Parquet files in your object store. A good analytics engine can use this metadata to make queries dramatically faster by reading fewer files and less data. Data warehouses like BigQuery and Snowflake store their metadata in a database: Spanner and Foundation DB, respectively. Iceberg, Delta, and Hudi try to operate under the constraint that you don’t have a database around and so jump through a lot of hoops to store their metadata on S3 itself,” he said.

“To really use lakehouses in production you need a catalog such as Poloris, Unistore, Glue. What is a catalog? It is a database. So all of the awkward hacks that were added to avoid using a database are no longer relevant. As such, DuckLake can now work around many problems in other lakehouse formats, like dealing with frequent updates, problems with small files, and difficulty performing multi-statement and multi-table transactions,” he said.

Motherduck is planning a hosted DuckLake solution. “People will be able to use DuckDB from their laptops to analyze DuckLake data in S3, but generally you’re not going to want to pull all of your data locally to do your analytics; that would be both slow and expensive. MotherDuck will be a great solution to offload computation to the cloud, while also providing interoperability with other data tools in the ecosystem,” Tigani said.

But DuckLake must address the challenge of Iceberg’s popularity and momentum in the market, which might prove to be an “uphill battle,” said Tigani.

Whether DuckDB can win against incumbent vendors such as Snowflake and Databricks may come down to whether architecture or inertia prevails. ®

DuckDB flips lakehouse model with bring-your-own compute • The Register

Source link