Herdwatch Enable Customer-Facing Analytics Using Apache Iceberg And CelerData Cloud

3 min readFeb 13, 2025

Authors:

Alfred Johnson Head of Data, Herdwatch

At Herdwatch, our mission is to empower farmers across Europe with a platform that simplifies farm compliance, tracks animal lifecycle data, and improves operational efficiency. By centralizing information on livestock health, medication schedules, and farm productivity, we enable farmers to make data-driven decisions. As our operations expanded across Ireland, the UK, and beyond, we needed a scalable, unified data platform that could deliver real-time insights to our customers and handle growing analytics workloads.

The Challenges: Siloed Data and Slow Dashboards

In the past, our analytics workloads relied on region-specific MySQL RDS databases. While this worked initially, it soon created major issues:

Siloed Analytics: Each region had its own database, which meant users had to consult separate dashboards for insights. Unified reporting was impossible.
Performance Bottlenecks: Dashboards querying RDS replicas were slow, prone to timeouts, and lacked scalability — making them unsuitable for customer-facing dashboards.

We knew it was time for a significant overhaul of our architecture.

Exploring Apache Iceberg: Eliminating Data Silos

To address these challenges, we decided to rebuild our data foundation with Apache Iceberg as the backbone. Iceberg gave us the ability to unify regional datasets and create a scalable, efficient analytics framework.

With Iceberg in place, we built a robust data lakehouse architecture that included:

Centralized Data Lake: All regional pipelines now store data in Iceberg tables on AWS S3, giving us a single source of truth.
ETL Pipelines: Using AWS Glue and DBT, we transformed raw data into bronze, silver, and gold layers, ensuring optimized transformations and aggregations.

Exploring Athena: Limited Success

Initially, we tried AWS Athena as the analytics engine for querying Iceberg data. Its serverless model allowed us to get started quickly, but we ran into several issues:

High Latency: Dashboard queries often took 2–5 minutes to load — far too slow for interactive, customer-facing analytics.
Limited Configurability: Athena’s serverless design left us with little room for query optimization.
Cost Inefficiency: With frequent, complex dashboard queries, Athena’s pay-per-scan model became increasingly expensive.

It was clear we needed a more robust solution to meet our performance and cost requirements.

The Solution: Building a Modern Data Stack

This is when we discovered StarRocks — a lakehouse query engine purpose-built for customer-facing analytics on open table formats like Apache Iceberg and Delta Lake. After thorough testing, we adopted a modern architecture combining Iceberg and StarRocks to tackle our analytics challenges:

Centralized Data Lake: All regional pipelines now flow into a unified Iceberg data lake on AWS S3.
ETL Pipelines: AWS Glue and DBT transform data into bronze, silver, and gold layers, streamlining aggregation and optimization.
StarRocks as Query Layer: StarRocks dramatically accelerates queries on the gold layer, powering both internal BI tools and customer-facing applications.
Materialized Views: For complex queries, materialized views pre-aggregate data, ensuring optimal performance.
Customer-Facing Dashboards: Dashboards directly query StarRocks via MySQL-compatible APIs, delivering sub-second response times.

The Results: Faster Analytics, Lower Costs, Better Governance

To further enhance our architecture, we adopted CelerData Cloud, the cloud-managed version of StarRocks. With features like multi-warehouse support and advanced security controls, our new setup delivered transformative improvements:

Unified Analytics:

We consolidated all regional datasets into a single source of truth, enabling unified reporting and simplified governance.
Eliminated the complexity of region-specific dashboards.
Seamlessly handled customer-facing dashboards and internal BI use cases.

Improved Performance:

Reduced query latency from 2–5 minutes (Athena) to 700 milliseconds–1.5 seconds (StarRocks).
Achieved sub-second load times for dashboards with multiple queries per page.

Cost Savings:

Transitioned from Athena’s pay-per-scan model to StarRocks’ caching capabilities, minimizing S3 scan costs.
Materialized views optimized compute resources, reducing the need for on-the-fly joins.

Scalability:

Supported analytics for millions of livestock records across thousands of farms, with room to grow.

Operational Efficiency:

Unified our stack with Iceberg and StarRocks’ MySQL compatibility, reducing maintenance overhead.
Simplified developer onboarding with centralized data governance and lineage.

What’s Next for Herdwatch?

Looking ahead, we’re excited to expand our analytics capabilities:

IoT Integration: Incorporating real-time telemetry from wearable livestock devices for instant insights on health, location, and productivity.
Direct Querying Iceberg: Exploring StarRocks’ ability to query Iceberg tables directly, simplifying our architecture further and reducing latency.
Advanced Use Cases: Scaling to support more complex analytics for enterprise customers.

Join Us on Slack
If you’re curious about StarRocks or want to learn best practices, join the StarRocks community on Slack. We’d love to hear from you!