Demandbase Ditches Denormalization By Switching off ClickHouse

StarRocks Engineering
4 min readOct 22, 2024

--

Demandbase is the leading account-based GTM platform for B2B enterprises to identify and target the right customers, at the right time, with the right message. With a unified view of intent data, AI-powered insights, and prescriptive actions, go-to-market teams can align and execute confidently.. Thousands of businesses rely on Demandbase to maximize revenue, minimize waste, and consolidate their data and tech stacks into one platform. To do this effectively, Demandbase needs a high-performance data infrastructure that can handle growing and complex workloads.

The Problem: Growing Pains with the Previous Data Infrastructure

Demandbase leveraged ClickHouse to manage its analytical workloads. However, as their data operations scaled, they faced significant challenges that hindered performance and flexibility, impacting their internal teams and customers.

Challenges with JOIN operations

ClickHouse could not perform multi-table queries efficiently at scale, forcing Demandbase to denormalize all data. This imposed several critical limitations:

  • Data became locked in a single-view format, making updates and schema changes difficult due to the need for extensive backfilling, especially with the large data volumes Demandbase handles.
  • This approach resulted in a significant waste of storage — more than 10 times the required space.
  • The denormalization pipelines created added overhead, demanding more computing resources and infrastructure.
  • Real-time analytics became impractical due to the complexity caused by these denormalization pipelines.

Scalability limitations

ClickHouse did not scale well and required manual configurations to handle increasing workloads. As a workaround, Demandbase deployed 49 ClickHouse clusters to serve multiple users, but this led to further issues:

  • The deployment of many ClickHouse clusters resulted in overprovisioned hardware, leading to massive resource waste.
  • The operational overhead of managing 49 clusters became a significant burden, draining the engineering teams’ time and resources.

To Overcome Scaling Limitations, Demandbase Needed a New Data Warehouse Solution

Demandbase sought a new data warehouse solution to address the existing challenges of scalability and denormalization and provide improved performance at scale and real-time data freshness for their customers. After evaluating various options, they identified several key requirements for their new architecture:

  1. Robust Multi-Tenant Support: The solution needed to support thousands of tenants within a single database instance, ensure data isolation for privacy and security, and use partitioning to efficiently manage and segregate tenant data.
  2. Real-Time Data Updates: The data warehouse required support for row-level updates to provide better data freshness across different datasets.
  3. Scalability: The solution needed to allow for distributed SQL execution across the entire cluster, allowing it to handle dozens of concurrent queries and manage peak demand periods without performance bottlenecks.
  4. High-Performance JOINs: Given the complex JOIN query requirements, the new warehouse needed to be able to handle large-scale JOINs while delivering second-level query latencies.

After evaluating various options, CelerData Cloud stood out as the only solution that could meet all these requirements. It provided the scalability, performance, and flexibility Demandbase needed to serve its customer-facing applications at scale.

The Solve: Upgrading Demandbase’s Data Architecture

To overcome the challenges of their previous architecture, Demandbase implemented a modern data infrastructure that combined Apache Iceberg with CelerData Cloud, a high-performance data warehouse and a lakehouse query engine built on the StarRocks project.

The new system operates as follows:

  • Apache Iceberg data lakehouse: Demandbase’s centralized data store and source of truth for analytics data
  • ETL Pipelines: Data flows from OLTP databases serving production systems through ETL pipelines, undergoing cleaning, transformation, and selective pre-computations before being ingested into Apache Iceberg.
  • CelerData Cloud: Data is ingested into CelerData Cloud, which is in CelerData’s optimized format for performance.
  • Customer-facing applications: CelerData Cloud serves customer-facing applications directly via a MySQL connector, offering second-level query latencies.
  • Greatly reduced the need for denormalization: CelerData Cloud’s efficient JOIN performance enables denormalization only for specific use cases, unlike the previous ClickHouse solution that required it for all data, leading to significant savings in storage and resources.

Results

By leveraging StarRocks’ On-the-Fly JOIN capabilities, Demandbase successfully replaced their existing ClickHouse clusters, optimizing performance while significantly reducing costs across multiple areas:

  • Cluster Costs: The original ClickHouse deployment that consisted of 49 clusters of 3 nodes per cluster was replaced with a more efficient 45-node StarRocks cluster, resulting in a 60% reduction in hardware resources.
  • Storage Costs: By dramatically reducing denormalized data, Demandbase reduced storage costs by 90%.
  • ETL Costs: Eliminating the need for heavy ETL pipelines to maintain denormalized data simplified their data pipeline and reduced the associated operational burden.

These improvements enabled Demandbase to achieve a more scalable, cost-effective infrastructure without sacrificing query performance or data freshness for their customers.

What’s Next

With denormalization no longer required, Demandbase can now focus on enhancing data freshness, which was previously hindered by the complexity of the ETL pipeline. This opens up opportunities to deliver even more real-time analytics and insights to their customers.

Looking ahead, Demandbase plans to explore using CelerData Cloud to query data directly from Apache Iceberg. This would simplify their data architecture by reducing the need for data ingestion and pre-processing, allowing for a more seamless and efficient connection between their data lake and customer-facing applications.

Join Us on Slack

If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry.

--

--