Why Starburst’s Icehouse Is A Bad Bet

StarRocks Engineering
4 min readMar 7, 2024

--

Last week Starburst’s CEO posted a manifesto on the future of the open data lake, calling its next generation the “Icehouse”. This “Icehouse”, an architecture based on Trino and Iceberg, left a lot of engineers scratching their heads wondering how this is any more than just another data lakehouse architecture. The confusion around this concept is understandable and serves to obfuscate what Starburst is trying to do with the lakehouse market.

Let’s break down this “Icehouse” concept to understand why not only is it unnecessary, but why it is a bad bet for anyone who cares about open data lake architectures.

The Icehouse Architecture: A Step Backward For Open Architectures

Justin Borgman, Starburst’s CEO, calls the “Icehouse” a “revolution”, but it is anything but. Elaborating on the concept, Borgman tries to make the case that it’s “the sixth” modern data platform, but there’s really nothing new here. In fact, it’s an objectively worse architecture for many enterprises. Hardly a “revolution”. Even engineers at Databricks were quick to point this out, adding how Starburst’s “Icehouse” concept is simply a less open data lakehouse format.

Icehouse vs. Lakehouse what’s the difference?

So what is the “Icehouse” really? Following the money to see who stands to benefit from its adoption makes it pretty clear: the “Icehouse” is just another vendor buzzword meant to guide the market towards their products. It should be quite telling that this idea was first posted on Starburst’s blog by their CEO and not from within the Trino community. Starburst, whose commercial offering is built on Trino, is better positioned than anyone to profit from the adoption of the “Icehouse”.

And that’s all the “Icehouse” is: Starburst’s recommended lakehouse architecture. Is it a solid foundation for a lakehouse design? Absolutely, but it’s a far cry from a “revolution” and certainly not differentiated enough to be “the sixth” data platform.

Revolution or not, is Starburst’s lakehouse architecture a step backward for open data lakehouse designs? It depends, and that’s the problem.

By predesignating Trino and Iceberg as the de facto components of their architecture, there’s an implicit suggestion that the optimal path to an open lakehouse architecture has already been determined. This significantly narrows the field, sidelining other technologies that might offer alternative advantages or better alignment with specific organizational needs.

And where does it end? The whole point of openness is choice. As soon as you start to define solutions in an architecture, you’re denying choice and killing openness. At what point does openness end? 2 pre-defined solutions? 5? 10?

In essence, choosing two components to specify an architecture, especially one celebrated for its flexibility in swapping components, doesn’t create a fundamentally different architecture worthy of its own name. Starburst’s architecture should be seen as a specific example of a variation within the broader concept of open lakehouse architecture, not the future of what an open lakehouse should be.

Choices That Are Left Out By Starburst

While Apache Iceberg and Trino are excellent components for an open lakehouse architecture, they are not the only option. That’s the beauty of open architecture: you have the power to pick and choose a technological foundation that’s right for your given use cases. Here are just a few open-source table formats and compute solutions that may just suit you better than Iceberg or Trino depending on your needs:

Table Formats

Apache Hudi: Apache Hudi is a streaming data lake platform that improves on historically slow old-school batch data processing with a powerful incremental processing framework for low latency minute-level analytics. It is a great choice if you are streaming data from a CDC source.

Delta Lake: Delta Lake provides an optimized storage layer that serves as a foundation for storing data and tables in the Databricks’ lakehouse. If you are in the Databricks ecosystem, Delta Lake is the way to go.

Apache Paimon: Apache Paimon is a data lake format that makes it possible to build a real-time lakehouse architecture with Flink and Spark for both your streaming and batch operations. If you are in the Apache Flink Ecosystem for real-time analytics, consider Apache Paimon.

Compute

Apache Spark: Apache Spark is an open-source unified engine for large-scale data processing. Spark is great if you prioritize stability for long-running ETL-like workloads over query latency.

Presto: Presto is an open-source SQL query engine that’s fast, reliable, and efficient at scale. Presto is great for running interactive/ad hoc queries against your open data lake.

StarRocks: StarRocks is a high-performance lakehouse SQL engine designed to run the most demanding workloads on the data lakehouse. If you need low-latency, high-concurrency customer-facing-like SQL workloads directly on your open data lakehouse, StarRocks is a fantastic option. (Join the StarRocks community on Slack)

The Takeaway

The real question for you isn’t which catchy name sounds cooler. It’s about whether these architectures — lakehouse or otherwise — truly fit your needs.

Remember, the best choice for your data isn’t about following the latest buzzword. It’s about keeping your options open while finding a combination of technologies that work for your unique situation.

--

--