Using Apache Paimon + StarRocks High-speed Batch and Streaming Lakehouse Analysis

11 min readMay 19, 2024

Originally published on Apache Flink Community April 30, 2024

This article is organized from the sharing of FFA 2023 by Riyu Wang, the senior development engineer of Alibaba Cloud Intelligence. The content is divided into the following four parts:

Development history of StarRocks + Apache Paimon lakehouse analysis
Main scenarios and technical principles of StarRocks + Apache Paimon lakehouse analysis
Performance test of StarRocks + Apache Paimon lakehouse analysis capabilities
Future planning for StarRocks + Apache Paimon lakehouse analysis capabilities

1. Development History of StarRocks + Apache Paimon Lakehouse Analysis

The development of StarRocks can be outlined into three key stages:

In the 1.x stage, the primary focus was on performance, which played a crucial role in establishing StarRocks as a popular choice. Today, in the majority of external table analysis scenarios, various formats like Hive, Hudi, and Iceberg, experience an improvement of over 3 to 5 times.
The 2.x stage was mainly concentrated on centralizing speed. This version sees all external table formats being accessed within StarRocks.
The 3.x stage emphasizes the fusion of lakehouse characteristics.

The evolution of StarRocks across these three major versions primarily consists of continual optimization of underlying performance and the significant characteristics of a lakehouse.

Initially, StarRocks gained popularity for its high-speed performance, boasting a vectorized execution engine, exceptional CBO optimizers, and diverse runtime filters.

StarRocks 1.x, which was open-sourced in 2021, supports external tables from Hive, Elasticsearch, and MySQL, as well as its own StarRocks external tables for query purposes.

StarRocks 2.x has made breakthroughs in the field of StarRocks lakehouse analysis, including the following points:

Catalog data directories are supported. In StarRocks 1.x, to query external tables, you need to create external tables based on the corresponding tables, just like ordinary tables which specify various columns and types. However, in the big data field, there are many databases and tables. If you create external tables one by one, the O&M costs and labor costs are high and it is not easy to maintain. Therefore, to introduce Catalog directories in StarRocks 2.x, you only need to specify one metadata address, such as Hive Metastore, DLF on Alibaba Cloud. StarRocks can directly synchronize all database and table information based on metadata.
StarRocks 2.x has supported three of the four types of data lakes, including Iceberg, Hudi, and Delta Lake. At that time, Apache Paimon was still incubating.
StarRocks 2.x has introduced JNI Connector to solve the problem of reading Hudi MOR table at the beginning because the big data ecology is basically by Java, but the main language of StarRocks is C++. The interaction between C++ and Java requires data conversion in memory, so it encapsulates abstract JNI Connector and comprehensively encapsulates all memory conversion and communication between C++ and Java.
Materialized external table views are supported. Many scenarios are based on StarRocks materialized views.
The reading of JSON, Map, Struct, Array, and other complex data types is supported. The performance has also been improved, including a large number of IO performance optimizations, such as Reader merging, Rowgroup merging, Data Cache, and delayed materialization.
The execution engine has been greatly modified to support the Pipeline execution engine, making its execution sequence more pipelined and having a very good balance between CPU and IO.
More complex statistical types are supported, including histograms.

In the StarRocks 3.x era, the main breakthroughs are:

It introduces support for the Apache Paimon data lake format and further optimizes support for the JNI Connector complex types. Apache Paimon is also developed in Java, so it uses the JNI Connector.
It supports the Spill of the operator dropping disk, which provides the basis for using StarRocks to perform ETL directly in the future. The amount of ETL data is usually large. Spill allows you to use only a small number of resources, such as a few nodes, to process large amounts of data when the memory is limited.
It improves the compatibility of Trino. Previously, there were certain differences between Presto or Trino and StarRocks MySQL. If you want to migrate to StarRocks, you need to modify the SQL job. After StarRocks 3.x introduces the compatibility of Trino, you only need to set the Session variable to directly run the original Trino job on StarRocks.
It supports partition refresh of materialized views. Before this, the entire table was refreshed. Now, the refreshed partitions can be automatically sensed, and only changed partitions can be refreshed. This reduces the resource consumption of materialized view refresh.
In addition to reading, StarRocks now supports Hive and Iceberg data writing.

2. Main Scenarios and Technical Principles of StarRocks + Apache Paimon Lakehouse Analysis

2.1 Main Scenarios

Compatibility of Trino

It supports the keywords, syntax, and function escaping of Trino SQL. During the usage, you can directly convert Trino SQL to StarRocks AST and generate an execution plan. You only need to execute the statement set sql_dialect = “trino” in StarRocks to enable it. In this way, all jobs can be directly moved to StarRocks without modifying any line of code. This feature has been officially implemented in the actual production environment, including partner and customer cases, and has realized more than 90% compatibility.

Federated Analytics

Since different lines of business use different lake formats and even have their own data sources, for some complex queries, you need to join different data sources for joint queries. Taking the use of Apache Paimon as an example, you can directly use CREATE EXTERNAL CATALOG Apache Paimon_fs_catalog to specify the address. Then, you can directly query the Apache Paimon data source table. Other data formats of tables are similar. When federated queries are required, you can directly join different types of tables, including StarRocks internal tables and external tables.

Transparent Acceleration

It mainly depends on the materialized view function. The reason for using materialized views is that StarRocks has its format, and many specific optimizations are performed based on its format. These performance optimizations depend on specific indexes and storage formats which need to be managed by StarRocks. With transparent acceleration, you can use the materialized view function of external tables to build materialized views of tables and partitions that need to be accelerated into StarRocks format to accelerate queries. In the case on the right, a materialized view of external tables is created and three tables are joined. However, in actual business, only one or two tables may be queried instead of all three tables. StarRocks can automatically rewrite SQL statements to hit materialized views to accelerate queries. This solves the pain points that BI statements are not easy to modify and SQL statements are not easy to tune. This process is imperceptible to users.

Data Modeling

It relies on the nested function of StarRocks materialized views, because the materialized views of StarRocks are not only accelerated for the source table but also they can create materialized views based on materialized views. That is the so-called nested form. This is the same as usual data modeling. The bottom layer is the object storage services, including HDFS, OSS, and HDFS-OSS. In data lake scenarios, the ODS layer mostly uses the lake format, such as Apache Paimon. After that, all layers above such as DWD, DWS, and ADS can continue to use materialized views to build. DWD is a direct materialized external table view, and DWS and ADS are nested materialized views. In this way, all layers can be connected and a system can be used to complete the entire data modeling work. In addition, you can directly use the transparent acceleration feature of materialized views regardless of the query layer.

Hot and Cold Fusion

Although the business is accumulating data all the time, what is often needed is only to query the data in the last few days or months, which is called hot data. The earlier and infrequently queried data is called cold data. When you make data houses, most of them are partitioned by time, such as day, month, and year. When creating a StarRocks materialized view, you can specify a TTL which is the effective time of the materialized view. For example, TTL = 3 months indicates that the partitions in the last three months are materialized, while other partitions are not materialized. All processes are automatic. In this way, cold data can be stored in low-cost OSS for a long time, and hot data can be queried in StarRocks more quickly. Of course, not all jobs only query hot data. Many jobs also query cold data. When you write SQL statements for your business, you do not need to pay attention to whether the data is cold or hot. For example, if you specify that the data for five months is queried, taking TTL = 3 months as the example, StarRocks automatically reads the data for three months from the materialized view. The data for the remaining two months is queried from the external table. You can write query statements for the external table in the same way as a normal query.

2.2 Technical Principles

The JNI Connector is a major feature contributed to the community by Alibaba Cloud in 2022. It now supports not only Apache Paimon, but also Hudi, AVRO, and RCFile formats. Since the BE side of StarRocks is written in C++ and all lake formats are written in Java, the JNI Connector can encapsulate the conversion processes between C++ Memory and Java Memory. Before this, it was necessary to write a layer of JNI program for each new format, which was difficult to debug and involved a significant workload. The JNI Connector abstracts all common functions and exposes three Java Reader interfaces: open, getNext, and close. When connecting to a new format, you only need to write a piece of Java code to implement these three interfaces and provide the basic and necessary information to the JNI Connector. The conversion of the remaining field types or the conversion between various memories, such as String and int, will be completed automatically. The main purpose of the JNI Connector is to quickly access various Java data sources to facilitate developer programming. Today, the JNI Connector has supported many formats, ranging from the original Hudi MOR Table to the current Apache Paimon, AVR0, Sequence, RCfile, and all complex types including Struct, Map, and Array. Access to new data sources can achieve zero intrusion of BE code without considering the specific implementation of C++.

Principles and process of specific memory conversion:

The core principle is that in the Java off-heap memory, for different field types, data is constructed in Java, which refers to the layout of StarRocks in BE storage. For example, fixed-length fields (4 bytes) such as int fields are stored in sequence in BE, and the Null indicator is used to specify whether the corresponding row is null. If it is null, it is not necessary to read. If it is not null, another 4 bytes are read, and each indicator is a byte. In this way, in the Java off-heap memory, after being constructed according to this layout, the block memory can be directly used by C++ by directly calling memory copy at the BE side.

The variable-length field has more offset indicators than the fixed-length field. Because the variable-length field cannot know the length of each row of fields in advance, the offset indicator needs to be used to store the starting address of each field. If the starting address of the first field is offset 0 and the starting address of the second field is offset 1, the total length of the first data is offset 1 — offset 0+1. Two links are given below, including the implementation of the latest Apache Paimon read and Hudi MOR read support.

3. Performance Test of StarRocks + Apache Paimon Lakehouse Analysis Capabilities

Test environment:

The EMR version of the test environment is EMR-5.15.1, with 1 master node and 3 core nodes. The configurations are the same, all of which are 16 vCPU 64 GiB. The Trino version is the 422 version newly released two months ago, and the StarRocks version is the main branch.

Test software configuration:

For the configuration of the test software, only the memory is modified because the memory machine is 64 GiB. Trino is changed to -Xmx50G, and StarRocks also uses -Xmx50G.

Test procedures:

When testing TPCH 100 GB, each query is tested three times. Take the average value without any analysis in advance or preheat, and put the data on HDFS.

Test results:

Looking at the chart, we can see that the comparison is particularly obvious. With the joint efforts of StarRocks kernel optimization and Apache Paimon, the StarRocks + Apache Paimon Lakehouse analysis capability is 15 times the query performance of the previous version.4. Future Planning for StarRocks + Apache Paimon Lakehouse Analysis Capabilities

4. Future Planning for StarRocks + Apache Paimon Lakehouse Analysis Capabilities

System table queries are supported:

Apache Paimon includes system tables with various metadata such as snapshots, files, and tugs, which are provided through system tables. Basic system table queries will be supported.

Time travel and snapshot queries are supported:

Data backtracking capability is supported.

The sink capability of the Apache Paimon external tables is supported:

Apache Paimon external tables are written using StarRocks.

The processing efficiency of date and datetime types is optimized:

Falls under the scope of the JNI Connector, as Apache Paimon is now read through the JNI Connector. JNI Connector currently needs to process a layer of string conversion between date and datetime, incurring some unnecessary overhead.

Access to a unified catalog:

By accessing a unified catalog, even if many user libraries have tables with different types, only one catalog needs to be created, reducing the need for separate catalogs for each type of table.

The metadata cache is used to accelerate queries:

Currently, all FEs do not perform metadata cache when processing Apache Paimon. Metadata cache can significantly reduce the number of IO requests and speed up the process.

Join Us on Slack

If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry. You can also visit the StarRocks forum for more information.