Tencent Games X StarRocks: The Road to Cloud Native

Background

StarRocks’ Solution for Separation of Storage And Compute

How Did the Solution Come About?

How It Works

  1. StarRocks reads data from corresponding BEs based on the location of data and performs initial aggregation. If data is located in external tables, the data is also read and aggregated by using these BEs.
  2. StarRocks generates a partition ID for each BE, groups the aggregated data based on the hash values of the aggregation keys, and sends the grouped data to BEs based on partition IDs.
  3. BEs receive the grouped data by using exchange nodes and aggregate the data for a second time.
  4. The aggregated data is sent to the BE which is elected as the result sink for final aggregation.
  1. Create a dummy storage engine to store only cluster IDs of StarRocks.
  2. Mock InternalService and HttpService of BEs for CNs, and allow CNs to start from empty storage paths.
  3. Modify startup parameters and add startup scripts to allow CNs to run simultaneously with BEs.
  1. An OLAP scan node (OlapScanNode) reads data from BEs where tablets reside and performs initial aggregation. An HDFS scan node (HdfsScanNode) distributes data evenly to each CN for data reading and initial aggregation.
  2. StarRocks generates a partition ID for each CN, groups the data aggregated by BEs and CNs based on the hash values of the aggregation keys, and sends the grouped data to the corresponding CNs based on partition IDs. CNs receive the grouped data by using exchange nodes (ExchangeNode) and aggregate the data for a second time.
  3. A CN is elected as the result sink to perform final aggregation.
  • StarRocks CRD: CRD is used to define the resource types of CN groups and deploy and manage CNs in K8s clusters by using declarative configurations. This facilitates the management of resource types and CN states.
  • StarRocks Controller: Controller creates deployments by using declarative configurations to help manage cluster states. When we scale nodes by using declarative configurations, StarRocks Controller helps transition clusters to the desired state.

Tiered Storage of Cold and Hot data

The Sinking of Cold Data

  • In BI scenarios, hot data is frequently accessed. The data is stored in BEs, and query results can be returned within seconds.
  • Custom SQL queries are usually performed on cold data. We can use StarRocks Operator to generate a CN cluster for such queries.

Tiered Storage

  1. Generate an ORC file on a BE and load the file to COS or HDFS by using Broker.
  2. Extract statistics from the ORC file, and add a partition by using the Iceberg API.
  3. Create a scheduled task to generate information about the partition where the data is to be stored. After the data arrives at the partition, it can be queried and analyzed by StarRocks.
  4. When a query is executed, the system automatically determines whether the data exists in the local or external storage based on the metadata of partitions and generates different scan nodes. To obtain data from both local and external storage, we can use a Union operator to combine the results for full data.

Performance Enhancements

  1. When an FE generates execution plans, it calls the planFiles operation of Iceberg multiple times to obtain statistics for cost-based optimizer (CBO) and the range location of the HDFS node. These operations involve obtaining Iceberg metadata and interactions with remote storage, increasing the time consumed by plan generation. To address this issue, we add the Iceberg FileIO cache mechanism with minimal changes to the dependency packages. The data of Iceberg metadata.json, manifest, and manifest-list files are cached to accelerate the generation of execution plans.
  2. Frequent refreshes of Iceberg tables add a 100-ms delay. We ensure that a table is refreshed only once for a single SQL query and not refreshed in other conditions.
  3. When we debug an SQL statement that is used to query both local and external data, we detect that the query is more time-consuming than separately querying the data. A 5x to 8x performance gap exists. This is because the generated Union plan is not optimal. To be specific, if an execution plan contains only one scan node that involves an Agg or Project operator, the operator is pushed down to the fragments of the scan node. When a Union plan is generated, an exchange node is added to the physical execution plan. In this scenario, full data is transmitted for each execution, consuming a lot of time. To optimize the execution plan, we push the first stage of aggregation from the Agg operator down to the Union operator. This reduces a significant amount of data transmitted by the exchange node, and achieves a performance that is close to directly querying external tables.

The Future of StarRocks’ Cloud-Native Architecture

  • The independent, stateless CNs support scalability.
  • Storage resources can be scaled on top of object storage.
  • CNs can run queries on either hot storage (BEs) or cold storage (object storage).
  • The data sinking mechanism supports data dumps between hot and cold storage.
  • Implement the multi-cluster solution to enable clusters to fulfill specific tasks. For example, a dedicated ETL cluster can be deployed to run large-scale extract, transform, load (ETL) jobs at night.
  • Improve separation of storage and compute based on primary keys to power real-time updates within seconds.
  • Improve the caching mechanism of decoupled storage and compute to deliver a query performance comparable to that of the unified one. Local caches are to be added on CNs to reduce the latency of remote data access.
  • BEs will gradually evolve into a global cache shared by multiple clusters. This provides a universal query acceleration layer with complete operator pushdown capability for the serverless architecture.
  • Separation of storage and compute will be applied to FEs to achieve a metadata management architecture more suitable for large, cloud-native data warehouses.
  • The left architecture is similar to the architecture of Snowflake. In this architecture, local caches exist at the compute layer, which can ensure high performance for cache hits. However, when a scaling operation is performed on the cluster, the cache data will be redistributed and remotely loaded, which affects cluster performance. This architecture is more suitable for business scenarios where high performance rather than high flexibility takes priority.
  • In the right architecture, a shared global cache is added between the compute and storage layers. This provides compute capabilities, such as the operator pushdown capability, for all CNs at the upper layer. This way, scale-out can be realized within seconds, and stable performance can be ensured during scaling. In addition, compute resources can be allocated for each request in time and then released immediately after the request is complete, which helps realize auto-scaling and cuts costs. This architecture is suitable for business scenarios in which flexibility, as well as high performance, is required.

Contributing to the StarRocks Community

A Solution Built for Enterprises

--

--

A modern open-source OLAP database enabling blazing fast and unified analytics. https://github.com/StarRocks/starrocks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store