StarRocks Best Practice Guide 3/5: Data Ingestion
Over my years as a DBA and StarRocks contributor, I’ve gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I’ve found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.
In my previous article I shared some tips on StarRocks data modeling, in this one, I’ll be explaining data ingestion.
PART 03: Data Ingestion
Usage Recommendations
- Required: Do not use
INSERT INTO VALUES()
for production data ingestion. - Recommended: A minimum interval of 5 seconds between ingestion batches.
- Recommended: For update scenarios in primary key tables, consider enabling persistent index, Note use this only if you have high-performance storage such as NVME SSD drives.
- Recommended: For scenarios with frequent ETL operations (insert into select), consider enabling the Spill to disk feature to prevent exceeding memory limits.
- Recommended: To batch ingest a partitioned table, especially ingest a large volume of historical data from Iceberg/Hudi/Hive, it’s better to perform ingestion partition by partition, to avoid generating small files.
Data Lifecycle
- Recommended: Use
TRUNCATE
to delete data rather thanDELETE
. - Required: Full update syntax is only available in version 3.0 and later of the primary key model; high-concurrency updates are prohibited, and it is recommended that each update operation be spaced by at least one minute.
- Required: If using
DELETE
to remove data, it must include aWHERE
clause, and concurrent deletes are prohibited, e.g., avoid executing 1000 separateDELETE FROM tbl1 WHERE id=1
statements; instead, useDELETE FROM tbl1 WHERE id IN (1,2,3,...,1000)
. - Required: The
DROP
operation by default moves to the FE trash and is retained for 86400 seconds (1 day), during which it can be recovered to prevent accidental deletions. This behavior is controlled by thecatalog_trash_expire_second
parameter. After one day, files move to the BE's trash directory, retained for 259200 seconds (3 days) by default.
This sums up my advice for data ingestion, but there’s a lot more to share. Head on over to my fourth article in this series that will take a look at queries with StarRocks.
Join Us on Slack
If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry. You can also visit the StarRocks forum for more information.