StarRocks Best Practice Guide 2/5: Data Modeling

5 min readJun 7, 2024

Over my years as a DBA and StarRocks contributor, I’ve gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I’ve found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.

In my previous article I shared some tips on StarRocks deployment, in this one, I’ll be explaining the finer points of data modeling.

PART 02: Data Modeling

Table Creation Standards

Only UTF8 encoding is supported.
Column renaming is not supported (support coming soon).
Maximum VARCHAR length: 1048576
KEY columns cannot use FLOAT, DOUBLE types.
Data directory names, database names, table names, view names, usernames, role names are case-sensitive; column names and partition names are not case-sensitive.
In the primary key table, the total size of primary key columns should not exceed 128 bytes.

What table type should I use?

Use the duplicated key table to retain detailed records (fact table), .
use the primary key table for tables that have a unique column that is non-null, have frequent write and read, and non-primary key columns utilize indexes,
Use the unique key table for tables with a unique column that may be null, have frequent write, and infrequent read.
use the aggregate table to retain only aggregated data.
See the doc for a detailed comparison between different table types.

Choosing Sorting Columns and Prefix Indexes

For versions before StarRocks 3.0, the primary key table specifies sorting columns through PRIMARY KEY. For StarRocks 3.0 and later, sorting columns in primary key models are specified through ORDER BY.
Prefix index is a sparse index introduced on top of sorting columns to further improve query efficiency. All prefix indexes are loaded into memory for optimal performance. When using prefix indexes, consider:
Columns frequently used as query conditions should be sorting columns. For instance, placing user_id first if queries often filter by user_id.
Sorting columns should ideally be 3–5 columns. Too many can increase sorting overhead and slow down data ingestion.
Prefix indexes should not exceed 36 bytes and must not include more than 3 columns. VARCHAR columns will be truncated in prefix indexes, and FLOAT or DOUBLE types are not allowed.

Therefore, in combination with actual business query scenarios, when determining the Key columns and the order of fields, the advantages of prefix indexes should be fully considered. Order the Key columns that are frequently queried at the front as much as possible, and try to choose data types like date or integer types such as int for these fields

Example:

CREATE TABLE site_access (
    site_id BIGINT DEFAULT '10',
    city_code INT,
    site_name VARCHAR(50),
    pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(site_id, city_code, site_name)
DISTRIBUTED BY HASH(site_id);

In the ‘site_access’ table, the prefix index consists of site_id (8 Bytes) + city_code (4 Bytes) + site_name (first 24 Bytes).

select sum(pv) from site_access where site_id = 123 and city_code = 2;

If the query condition only includes the columns site_id and city_code, as shown below, it can significantly reduce the number of data rows that need to be scanned:

select sum(pv) from site_access where site_id = 123 and city_code = 2;

If the query condition only includes the column site_id, as shown below, it can specifically target rows containing only site_id:

select sum(pv) from site_access where site_id = 123;

If the query condition only includes the column city_code, as shown below, all data rows need to be scanned, and the sorting effect is greatly reduced:

select sum(pv) from site_access where city_code = 2;

If the ratio of combined queries of site_id and city_code to queries of city_code alone is about the same, consider creating a synchronized materialized view to adjust the column order to enhance query performance. In this case, we place the city_code column first in the materialized view.

create materialized view site_access_city_code_mv asselect
    city_code,
    site_id,
    site_name,
    pv
from
    site_access;

Bad case:

CREATE TABLE site_access_bad
(
    site_name VARCHAR(20),
    site_id BIGINT DEFAULT '10',
    city_code INT,
    pv BIGINT DEFAULT '0'
)
PRIMARY KEY(site_id)
DISTRIBUTED BY HASH(site_id)
ORDER BY(site_id,city_code);

In the ‘site_access_bad’ table, the prefix index only includes site_name.

Partitioning Choices

Recommended: Use a non-changing time column for WHERE filtering for partition creation.
Recommended: Choose dynamic partitions for scenarios with data expiration needs.
Required: Create partitions for data with distinct hot and cold characteristics, e.g., frequently updated data from the past week could be partitioned daily.
Required: Each partition must not exceed 100GB.
Required: Tables over 50GB or 5 million rows should have partitions.
Recommended: Create partitions as needed; avoid creating a large number of empty partitions to prevent excessive FE memory usage from metadata.
Currently supports time (Range, Expression partitions), string (List partitions), and number (Range, List partitions).
By default, up to 1024 partitions are supported, and adjustable via settings, though typically unnecessary.

Bucketing Choices

Required: Production must use 3 replicas.

Bucket Count Decision:

Required: Estimate each bucket as 1GB, with raw data estimated at 10GB (compression ratio of 7:1 to 10:1). If the calculated bucket count is less than the number of BE nodes, the final bucket count should match the number of BE nodes, e.g., 6 BE nodes would result in 6 buckets.
Required: Do not use dynamic bucketing for non-partitioned tables; estimate bucket count based on actual data volume.
Required: Avoid dynamic bucketing if partitions within a partitioned table vary significantly in data size.

Bucketing and how to avoid data skew :

Recommended: If a bucketing column is frequently used in WHERE conditions and has low duplication (e.g., user IDs, transaction IDs), then it can be used as a bucketing column.
Recommended: If queries often include city_id and site_id together, and city_id has low cardinality, simply using city_id as a bucketing column could lead to significant data skew. In this case, consider combining city_id and site_id as bucketing fields. However, this approach may compromise on bucketing benefits when queries only include city_id.
Recommended: If no suitable fields are available for dispersing data as bucketing fields, consider using random bucketing, although this eliminates the benefits of bucket trimming.
Required: For joins involving two or more tables each with over a thousand rows, consider using Colocate join.

Field Types

Recommended: Avoid using null properties.
Required: Ensure the correct data types for time and numeric columns are used. Using inappropriate data types, such as storing time data like “2024–01–01 00:00:00” in VARCHAR, can significantly increase computational costs and hinder internal optimizations like Zonemap indexing.

Indexing Choices

Bitmap Index:

Suitable for columns with cardinality between 10,000 and 100,000.
Appropriate for columns queried with equality conditions (=) or within a specific range ([NOT] IN).
Not supported for FLOAT, DOUBLE, BOOLEAN, and DECIMAL types.

Bloom Filter Index:

Suitable for columns with a cardinality over 100,000 with low duplication.
Appropriate for queries with in and = conditions.
Not supported for TINYINT, FLOAT, DOUBLE, and DECIMAL types.
In detail and primary key models, all columns can have Bitmap indexes; in aggregate and update models, only Key columns support Bitmap indexing.

This sums up my advice for data modeling, but there’s a lot more to share. Head on over to my third article in this series that will take a look at data ingestion with StarRocks.

StarRocks Best Practices 1/5: Capacity Planning and Deployment

StarRocks Best Practice Guide 3/5: Data Ingestion

StarRocks Best Practice Guide 4/5: Query

StarRocks Best Practice Guide 5/5: Monitoring

Join Us on Slack

If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry. You can also visit the StarRocks forum for more information.