StarRocks Best Practice Guide 3/5: Data Ingestion

StarRocks Engineering
2 min readJun 9, 2024

--

Over my years as a DBA and StarRocks contributor, I’ve gained a lot of experience working alongside a diverse array of community members and picked up plenty of best practices. In this time, I’ve found five specific models that stand out as absolutely critical: deployment, data modeling, data ingestion, querying, and monitoring.

In my previous article I shared some tips on StarRocks data modeling, in this one, I’ll be explaining data ingestion.

PART 03: Data Ingestion

Usage Recommendations

  • Required: Do not use INSERT INTO VALUES() for production data ingestion.
  • Recommended: A minimum interval of 5 seconds between ingestion batches.
  • Recommended: For update scenarios in primary key tables, consider enabling persistent index, Note use this only if you have high-performance storage such as NVME SSD drives.
  • Recommended: For scenarios with frequent ETL operations (insert into select), consider enabling the Spill to disk feature to prevent exceeding memory limits.
  • Recommended: To batch ingest a partitioned table, especially ingest a large volume of historical data from Iceberg/Hudi/Hive, it’s better to perform ingestion partition by partition, to avoid generating small files.

Data Lifecycle

  • Recommended: Use TRUNCATE to delete data rather than DELETE.
  • Required: Full update syntax is only available in version 3.0 and later of the primary key model; high-concurrency updates are prohibited, and it is recommended that each update operation be spaced by at least one minute.
  • Required: If using DELETE to remove data, it must include a WHERE clause, and concurrent deletes are prohibited, e.g., avoid executing 1000 separate DELETE FROM tbl1 WHERE id=1 statements; instead, use DELETE FROM tbl1 WHERE id IN (1,2,3,...,1000).
  • Required: The DROP operation by default moves to the FE trash and is retained for 86400 seconds (1 day), during which it can be recovered to prevent accidental deletions. This behavior is controlled by the catalog_trash_expire_second parameter. After one day, files move to the BE's trash directory, retained for 259200 seconds (3 days) by default.

This sums up my advice for data ingestion, but there’s a lot more to share. Head on over to my fourth article in this series that will take a look at queries with StarRocks.

Join Us on Slack

If you’re interested in the StarRocks project, have questions, or simply seek to discover solutions or best practices, join our StarRocks community on Slack. It’s a great place to connect with project experts and peers from your industry. You can also visit the StarRocks forum for more information.

--

--