Tencent Games X StarRocks: The Road to Cloud Native
The largest gaming company on the planet, Tencent Games, is the video game division of Tencent Interactive Entertainment. Tencent Games studio has published titles like Honor of Kings — one of the world’s most popular multi-player games, and Call of Duty: Mobile. Tencent Games also provides a platform to host games like PUBG and League of Legends.
With hundreds of games across game consoles, mobile phones, and websites, Tencent Games’ total data volume has well exceeded 100 PB and continues to grow at the rate of 300 TB per day. Tencent Public Data Platform Department (shortened as PDP in this article) provides a common data platform for hundreds of games in various different but complex business environments. PDP evaluated different technologies and eventually chose StarRocks as the data warehouse engine for its platform.
In that process, PDP worked with the StarRocks community to design and implement a new serverless solution to separate the storage and compute. This project has been merged into StarRocks’ main code branch and is being used at Tencent Games.
This article is based on the StarRocks Community Talk given by the PDP team, with some editing for clarification and formatting purposes.
StarRocks’ Solution for Separation of Storage And Compute
How Did the Solution Come About?
In StarRocks’ original architecture, storage and compute were tightly coupled. Backend Nodes (BEs) act as both storage and compute nodes. As analytics scenarios became more diversified, we found the bottlenecks of many queries were compute (CPU and memory), not I/O. To improve user experience, we had to add more BEs, which came with storage resources we did not need, and added additional data migration loads. Besides, we wanted to isolate nodes for different types of analytics workloads. For example, some reports use relatively fixed SQL statements, while other workloads may need custom SQL queries.
We created a new type of node, Compute Nodes (CN), from BEs and transformed StarRocks’ original architecture into a serverless architecture with decoupled storage and compute.
After separating storage and compute, we used Kubernetes (K8s) as the cloud O&M base for StarRocks. We developed StarRocks Custom Resource Definition (CRD) and StarRocks Controller based on K8s. StarRocks CRD allows us to deploy Operator and StarRocks by using declarative YAML files. StarRocks Controller helps us change to the desired status.
How It Works
Lifecycles of SQL Queries in StarRocks
In StarRocks, after an SQL request arrives, the optimizer in FE generates a distributed execution plan tree. It transforms the plan into plan fragments that can be directly executed by BEs. Then, the plan fragments are sent to the corresponding BEs.
We use simple aggregation as an example to illustrate this process. Before the separation of storage and compute, this process consists of the following steps:
- StarRocks reads data from corresponding BEs based on the location of data and performs initial aggregation. If data is located in external tables, the data is also read and aggregated by using these BEs.
- StarRocks generates a partition ID for each BE, groups the aggregated data based on the hash values of the aggregation keys, and sends the grouped data to BEs based on partition IDs.
- BEs receive the grouped data by using exchange nodes and aggregate the data for a second time.
- The aggregated data is sent to the BE which is elected as the result sink for final aggregation.
Separating Compute Nodes
To separate compute capabilities from BEs, we performed the following steps:
- Create a dummy storage engine to store only cluster IDs of StarRocks.
- Mock InternalService and HttpService of BEs for CNs, and allow CNs to start from empty storage paths.
- Modify startup parameters and add startup scripts to allow CNs to run simultaneously with BEs.
Then, the compute capabilities of BEs are separated as stateless CNs. The entire SQL processing logic is changed after we configure scheduling parameters for CNs, as shown in the following figure.
We still use simple aggregation to illustrate the procedure, which is much simpler:
- An OLAP scan node (OlapScanNode) reads data from BEs where tablets reside and performs initial aggregation. An HDFS scan node (HdfsScanNode) distributes data evenly to each CN for data reading and initial aggregation.
- StarRocks generates a partition ID for each CN, groups the data aggregated by BEs and CNs based on the hash values of the aggregation keys, and sends the grouped data to the corresponding CNs based on partition IDs. CNs receive the grouped data by using exchange nodes (ExchangeNode) and aggregate the data for a second time.
- A CN is elected as the result sink to perform final aggregation.
Operation & Maintenance on Tencent Kubernetes Engine (TKE)
Traffic volume surges during festivals, promotional activities, and holidays, which brings a heavy load to the data analytics platform. After CNs are separated, they can be integrated with compute platforms of Tencent by using Tencent Kubernetes Engine (TKE). This enables scalable computing. Inspired by the concept of cloud native, we deployed CNs in a containerized manner and leveraged TKE for rapid container creation and scaling.
StarRocks Operator is used to monitor events related to custom resources in K8s clusters, such as resource creation, modification, and deletion. StarRocks Operator consists of the following two modules:
- StarRocks CRD: CRD is used to define the resource types of CN groups and deploy and manage CNs in K8s clusters by using declarative configurations. This facilitates the management of resource types and CN states.
- StarRocks Controller: Controller creates deployments by using declarative configurations to help manage cluster states. When we scale nodes by using declarative configurations, StarRocks Controller helps transition clusters to the desired state.
After containers are started, they automatically call FE APIs to register CNs in clusters. After CNs are deployed to K8s, we can use K8s Horizontal Pod Autoscaler (HPA) to horizontally scale CNs. This ensures automatic addition and deletion of nodes. HPA works based on monitored CPU and memory metrics. StarRocks Controller checks whether the metric values change every 15 seconds. If the conditions for scaling are met, StarRocks Controller sends requests to K8s to change the number of CNs. To prevent frequent scaling, a limit is imposed on HPA to allow only one request every 5 minutes.
Tiered Storage of Cold and Hot data
After CNs are separated from BEs, CNs can be scaled on demand by using K8s HPA(Horizontal Pod Autoscaler). To enhance the storage capability of BEs, we, together with the StarRocks community, developed tiered storage for cold and hot data. We changed the storage of cold data from BEs to less expensive stores, such as HDFS or COS. This not only reduces costs but also facilitates business development. We expect better integration of data lakes with data warehouses.
The Sinking of Cold Data
We use BEs to store hot business data. Cold business data is stored in more cost-effective stores such as HDFS and COS.
Tiered storage can efficiently use compute resources and isolate workloads in the following two scenarios:
- In BI scenarios, hot data is frequently accessed. The data is stored in BEs, and query results can be returned within seconds.
- Custom SQL queries are usually performed on cold data. We can use StarRocks Operator to generate a CN cluster for such queries.
The CN cluster can be scaled according to analytics workloads with just a few clicks. After custom SQL queries are complete, the CN cluster can be released. This ensures efficient use of compute resources.
Our solution combines Apache Iceberg and HDFS or COS. Partition-based data sinking can be achieved in four steps:
- Generate an ORC file on a BE and load the file to COS or HDFS by using Broker.
- Extract statistics from the ORC file, and add a partition by using the Iceberg API.
- Create a scheduled task to generate information about the partition where the data is to be stored. After the data arrives at the partition, it can be queried and analyzed by StarRocks.
- When a query is executed, the system automatically determines whether the data exists in the local or external storage based on the metadata of partitions and generates different scan nodes. To obtain data from both local and external storage, we can use a Union operator to combine the results for full data.
We conduct a stress test to test the performance after tiered storage is applied. The result shows that the performance of decoupled storage and compute is 50–100x inferior to that of unified storage and compute in typical SQL scenarios. After analyzing the profiles of queries, we come up with the following optimizations:
- When an FE generates execution plans, it calls the planFiles operation of Iceberg multiple times to obtain statistics for cost-based optimizer (CBO) and the range location of the HDFS node. These operations involve obtaining Iceberg metadata and interactions with remote storage, increasing the time consumed by plan generation. To address this issue, we add the Iceberg FileIO cache mechanism with minimal changes to the dependency packages. The data of Iceberg metadata.json, manifest, and manifest-list files are cached to accelerate the generation of execution plans.
- Frequent refreshes of Iceberg tables add a 100-ms delay. We ensure that a table is refreshed only once for a single SQL query and not refreshed in other conditions.
- When we debug an SQL statement that is used to query both local and external data, we detect that the query is more time-consuming than separately querying the data. A 5x to 8x performance gap exists. This is because the generated Union plan is not optimal. To be specific, if an execution plan contains only one scan node that involves an Agg or Project operator, the operator is pushed down to the fragments of the scan node. When a Union plan is generated, an exchange node is added to the physical execution plan. In this scenario, full data is transmitted for each execution, consuming a lot of time. To optimize the execution plan, we push the first stage of aggregation from the Agg operator down to the Union operator. This reduces a significant amount of data transmitted by the exchange node, and achieves a performance that is close to directly querying external tables.
After the optimization measures were taken, we performed a comparison test against a single table that contains 1 TB of data and 26 billion rows, where 12 CNs and 12 BEs are allocated to each of the decoupled storage and compute solution and unified one. Typical SQL statements are used in the comparison. The test result shows the performance gap is narrowed from 50–100x to 5–10x. The initial goal of maintaining a balance between performance and costs is achieved.
The Future of StarRocks’ Cloud-Native Architecture
Separation of storage and compute marks the first step for StarRocks moving towards a cloud-native architecture.
- The independent, stateless CNs support scalability.
- Storage resources can be scaled on top of object storage.
- CNs can run queries on either hot storage (BEs) or cold storage (object storage).
- The data sinking mechanism supports data dumps between hot and cold storage.
In the long run, StarRocks will continuously improve the cloud-native architecture based on the community roadmap.
- Implement the multi-cluster solution to enable clusters to fulfill specific tasks. For example, a dedicated ETL cluster can be deployed to run large-scale extract, transform, load (ETL) jobs at night.
- Improve separation of storage and compute based on primary keys to power real-time updates within seconds.
- Improve the caching mechanism of decoupled storage and compute to deliver a query performance comparable to that of the unified one. Local caches are to be added on CNs to reduce the latency of remote data access.
- BEs will gradually evolve into a global cache shared by multiple clusters. This provides a universal query acceleration layer with complete operator pushdown capability for the serverless architecture.
- Separation of storage and compute will be applied to FEs to achieve a metadata management architecture more suitable for large, cloud-native data warehouses.
The preceding figure shows two types of architectures with decoupled storage and compute that will be supported by StarRocks:
- The left architecture is similar to the architecture of Snowflake. In this architecture, local caches exist at the compute layer, which can ensure high performance for cache hits. However, when a scaling operation is performed on the cluster, the cache data will be redistributed and remotely loaded, which affects cluster performance. This architecture is more suitable for business scenarios where high performance rather than high flexibility takes priority.
- In the right architecture, a shared global cache is added between the compute and storage layers. This provides compute capabilities, such as the operator pushdown capability, for all CNs at the upper layer. This way, scale-out can be realized within seconds, and stable performance can be ensured during scaling. In addition, compute resources can be allocated for each request in time and then released immediately after the request is complete, which helps realize auto-scaling and cuts costs. This architecture is suitable for business scenarios in which flexibility, as well as high performance, is required.
The two architectures help StarRocks meet all business requirements and achieve a new paradigm of unified data analytics.
Contributing to the StarRocks Community
We chose StarRocks as the OLAP analytics engine of Tencent Games’ public data platform. We also benefited from the open-source culture of the StarRocks community, which embraces openness, inclusiveness, and collaboration. While achieving service goals within Tencent Games, we have actively participated in the joint development of many features of StarRocks, such as the development of UDFs, query optimization of Iceberg external tables, and the development of StarRocks Operator.
As more businesses are implemented on StarRocks, and our understanding of StarRocks is improving, we will invest more efforts in improving the execution planning, materialized views, grouping logic of CNs, and cloud-native data warehouse services, fostering long-term collaboration with the StarRocks community.
A Solution Built for Enterprises
As an open-source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on our GitHub repository and participate in discussions in our Slack channel.
But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.
CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer, accompanied by enterprise-scale features and support.
When it comes to deployment, CelerData provides two options:
- CelerData Enterprise — which is deployed in your data center or private VPC in the cloud
- CelerData Cloud — which is a SaaS cloud offering managed by CelerData.
You can learn more about CelerData by visiting CelerData.Com.