Xiaohongshu Builds Integrated Advertising Analytics Platform with StarRocks

Journey of Improving OLAP Engines

  • Redshift can not achieve auto-scaling without compromising online SQL query performance. That means auto-scaling may cause data redistribution and compromise the performance and availability of clusters.
  • Parallel ETL operations may severely undermine the availability of clusters. When multiple ETL tasks run in parallel on Redshift, these tasks may preempt resources, which reduces data analysis efficiency and causes queries to time out. Even worse, the entire cluster might become unavailable due to overload.
  • The lack of a storage-compute separation architecture results in bottlenecks in storage capacity, which cannot accommodate the increasing amount of business data.

Current Data Analysis Architecture

Data Collection Layer

  • Website event tracking data that is pulled by Flume from server logs and app logs is distributed to Amazon S3 and Kafka.
  • Data changes in MySQL binary logs are captured by Canal, an open-source middleware developed by Alibaba Group.

Data Storage and Processing Layer

  • Offline data processing: We useHive and Spark’s high scalability and batch processing capability to perform ETL operations and data modeling in offline data warehouses. We also use scheduling tools to pull offline tables into ClickHous or StarRocks.
  • Online data processing: Flink helps build a streaming ETL pipeline where data can be enriched, joined, and aggregated in real time. Data from ETL pipelines can be synchronized through Flink connectors to ClickHouse or StarRocks.

Data Analysis Layer

Data Application Layer

Comparison of OLAP Databases

ClickHouse

  • Analysis of log data and data in flat tables
  • Supports single-table queries and ad hoc queries of flat tables.
  • Provides the MergeTree engine family and supports pre-aggregation.
  • Manages huge amounts of log data.
  • Users must create a local table on each shard and then map it to the distributed table.
  • ClickHouse uses bucketing to organize data. It heavily relies on the performance of Zookeeper, its metadata store.
  • ClickHouse periodically performs compaction in the background to delete duplicate primary keys, consuming server resources and degrading performance.
  • Flink sink for Clickhouse only provides the at-least-once semantics. ClickHouse cannot automatically choose between LocalJoin or GlobalJoin.
  • ClickHouse doesn’t support UPDATE or DELETE commands at the moment.
  • Unsatisfactory join performance.Not suitable for high concurrency scenarios.

StarRocks

  • Analysis of real-time data for various data applications
  • Excellent performance in single-table and multi-table queries, support for flat-table queries and complex queriesExcellent in query concurrency
  • Support for real-time ETL and micro-batch processing
  • Outstanding in batch and streaming data processing
  • Compatible with MySQL protocols and standard SQL
  • Materialized views do not support filter clauses, such as WHERE and CASE WHEN.
  • Flink writes data in JSON format into StarRocks, which consumes more resources.
  • Routine Load jobs read data from Kafka to StarRocks. During this process, StarRocks uses its own store to save offsets, instead of using Kafka’s built-in store. When a Routine Load job fails, users must manually check and reset the offset to restart the Routine Load job.
  • When there are massive aggregated columns in StarRocks, COUNT queries must scan a large amount of data, which may cause node failures.
  • StarRocks ecosystem is still in development.

TiDB

  • Data analysis in OLTP and OLAP scenarios
  • Support for UPDATE and DELETE commands
  • Suitable for both OLTP and OLAP scenarios
  • Support for exactly-once semantics in Flink and idempotent operations
  • The performance in OLAP scenarios is not as good as ClickHouse and StarRocks.
  • Pre-aggregation is not supported.
  • The query performance in OLAP scenarios is not satisfactory.
  • TiDB does not support real-time pre-aggregation.
  • TiFlash supports only some of the SQL statements and functions.

StarRocks Powers Advertising Analytics for Xiaohongshu

Scenario Description

Previous Solution

  • Perform JOIN operations on data streams, such as news feed ads and ads from algorithm-based recommendations.
  • Detect and prevent commercial cheating.
  • Write aggregate data in different scenarios to different databases.
  • It is difficult to aggregate complex and disorganized data.
  • ClickHouse is not suitable for high concurrency scenarios and the storage capacity cannot scale on demand, which presents a challenge for the future.
  • Complex Flink jobs cannot ensure high availability.

Solution Offered by StarRocks

  • High throughput
  • Support for multidimensional queries
  • TP99 latency less than 100 ms
  • Materialized views
  • Capable to handle over 10,000 QPS
  • Capable to extract data from MySQL binary logs in real time and offer data encapsulation
  • Support for multiple-table join
  • Well-designed data processing architecture. Flink focuses on data cleaning. The business logic implementation is moved from Flink into StarRocks.
  • Unified and consistent advertising statistics through standardized data ingestion and transformation platform.
  • High QPS. The dual-active architecture can better support high-QPS workloads.

Database Design

  • Data on advertising exposures and clicks are written to the aggregate key model. This model helps pre-aggregate data of multiple dimensions and generates metrics on demand during the query process. Dimensions are data attributes with non-quantitative values, such as advertisers, ad types, ideas, ad units, search keywords, regions, and user attributes.
  • The primary key model is responsible for updating real-time ads data extracted from MySQL databases.
  • The duplicate key model can process data from offline data warehouses.

Data Ingestion

  • For data that is processed by ETL jobs, Flink is used to convert the processing logic of such data and write the data to StarRocks by using the StarRocks-Flink connector.
  • For data that is stored in real-time data warehouses, StarRocks Routine Load directly schedules one batch of such data to StarRocks every 10 seconds.

Data Query

Maintenance

Conclusion

--

--

--

A modern OLAP database enabling blazing fast and unified analytics. https://github.com/StarRocks/starrocks

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The 11 best apps for your new Mac — My List

Elmedia Player

RSpec Cheat Sheet

ENS Is Now In Chinese — How You Can Help Us Get More Languages

CHUCHEL

An Introduction to Identity Access Management (IAM) in AWS

Salesforce Process Builder Best Practices

A SYNOPSIS ON CRODOS: Finding New Stars In The Cronos Network Ecosystem

Message Processing in PHP — Symfony Messenger, Laravel Queues and Ecotone

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
StarRocks

StarRocks

A modern OLAP database enabling blazing fast and unified analytics. https://github.com/StarRocks/starrocks

More from Medium

StarRocks Provides Ultra-fast User Behavior Analysis with Greater Business Value

Metadata — Meet Big Data’s Little Brother

Apache Kafka Landscape for Automotive and Manufacturing