Data Cache warmup preloads remote data into the local cache, minimizing the performance impact of cold-start queries. This feature works with both shared-data clusters and data lake integrations.
Prerequisites
The Data Cache feature is enabled, and the current user has the SELECT privilege on the target table.
Use cases
Consider using Data Cache warmup in the following scenarios:
-
Sufficient cache disk capacity: The cache disk has more capacity than the data to warm up. If disk capacity is insufficient, newly loaded data continuously evicts already cached data, undermining the warmup.
-
Stable cache access patterns: Avoid sudden bursts of cold queries or large-scale data writes during warmup. These activities can cause warmed-up data to be evicted prematurely.
-
POC performance testing: Warm up test data to isolate StarRocks query performance from external storage I/O.
-
Scheduled BI reports: Schedule a warmup task before users access BI reports at a fixed time each day to ensure stable query performance.
Usage
StarRocks provides the CACHE SELECT statement to warm up the Data Cache. CACHE SELECT is a synchronous operation that warms up one table at a time.
Syntax:
CACHE SELECT <column_name> [, ...]
FROM [<catalog_name>.][<db_name>.]<table_name> [WHERE <boolean_expression>]
[PROPERTIES("verbose"="true")]
Parameters:
|
Parameter |
Description |
|
|
The columns to cache. Use an asterisk ( |
|
|
The name of the external catalog. Required only for external tables in a data lake. Can be omitted if you have switched to the target catalog by using |
|
|
The name of the database. Can be omitted if you have switched to the target database. |
|
|
The name of the target table. |
|
|
The filter condition that specifies the warmup scope. |
|
|
Currently, only the |
A successful execution returns the following metrics:
|
Metric |
Description |
|
|
Total data read from the Data Cache across all nodes. |
|
|
Total data written to the Data Cache across all nodes. |
|
|
The average time taken by each node to write remote data to the Data Cache. |
|
|
The percentage of Data Cache space used across the cluster after the warmup. Use this metric to determine if the cache space is sufficient. |
Warm up an entire table
Warm up all data in the external table lineitem:
CACHE SELECT * FROM hive_catalog.test_db.lineitem;
Warm up by column and filter
Specify column names and WHERE predicates for fine-grained warmups. This reduces unnecessary data fetching and lowers disk I/O and CPU consumption.
Example: Warm up specific columns in a specific partition of an external table:
mysql> CACHE SELECT l_orderkey FROM hive_catalog.test_db.lineitem WHERE l_shipdate='1994-10-28';
+-----------------+------------------+----------------------+-------------------+
| READ_CACHE_SIZE | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+-----------------+------------------+----------------------+-------------------+
| 957MB | 713.5MB | 3.6ms | 97.33% |
+-----------------+------------------+----------------------+-------------------+
1 row in set (9.07 sec)
Example: Warm up specific columns in the internal table lineorder for a shared-data architecture:
mysql> CACHE SELECT lo_orderkey FROM ssb.lineorder;
+-----------------+------------------+----------------------+-------------------+
| READ_CACHE_SIZE | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+-----------------+------------------+----------------------+-------------------+
| 118MB | 558.9MB | 200.6ms | 4.66% |
+-----------------+------------------+----------------------+-------------------+
1 row in set (29.88 sec)
Verbose mode
By default, CACHE SELECT returns aggregated summary metrics. Add PROPERTIES("verbose"="true") to get detailed metrics for each BE node. This also returns the extra field AVG_READ_CACHE_TIME, which is the average read time for cache hits on each node.
mysql> CACHE SELECT * FROM hive_catalog.test_db.lineitem PROPERTIES("verbose"="true");
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
| IP | READ_CACHE_SIZE | AVG_READ_CACHE_TIME | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
| 172.26.80.233 | 376MB | 127.8micros | 0B | 0s | 3.85% |
| 172.26.80.231 | 272.5MB | 121.8micros | 20.7MB | 146.5micros | 3.91% |
| 172.26.80.232 | 355.5MB | 147.7micros | 0B | 0s | 3.91% |
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
3 rows in set (0.54 sec)
Schedule CACHE SELECT
Use CACHE SELECT with SUBMIT TASK to perform periodic warmups.
Example: Warm up the lineitem table every 5 minutes:
mysql> SUBMIT TASK always_cache SCHEDULE EVERY(INTERVAL 5 MINUTE) AS CACHE SELECT l_orderkey
FROM hive_catalog.test_db.lineitem
WHERE l_shipdate='1994-10-28';
+--------------+-----------+
| TaskName | Status |
+--------------+-----------+
| always_cache | SUBMITTED |
+--------------+-----------+
1 row in set (0.03 sec)
Manage CACHE SELECT tasks
View tasks
SELECT * FROM default_catalog.information_schema.tasks;
View task run history
SELECT * FROM default_catalog.information_schema.task_runs;
The EXTRA_MESSAGE field records the warmup metrics for the CACHE SELECT task.
Delete a task
DROP TASK <task_name>;
Best practices
Scenario 1: POC performance testing
Preload test tables into the Data Cache with CACHE SELECT to establish a more accurate performance baseline by eliminating external storage I/O interference.
Scenario 2: Scheduled warmup for BI reports
If users view BI reports at 08:00 every day, submit a warmup task that runs at 07:00 daily:
SUBMIT TASK BI SCHEDULE START('2024-02-03 07:00:00') EVERY(INTERVAL 1 DAY)
AS CACHE SELECT * FROM hive_catalog.test_db.lineitem
WHERE l_shipdate='1994-10-28';
Scenario 3: Reducing the system impact of warmup tasks
Specify a resource group and the degree of parallelism (DOP) in the properties parameter of SUBMIT TASK to minimize the impact on regular business queries:
SUBMIT TASK cache_select PROPERTIES("pipeline_dop"="1", "resource_group"="warmup") SCHEDULE EVERY(INTERVAL 1 DAY)
AS CACHE SELECT * FROM hive_catalog.test_db.lineitem WHERE l_shipdate>='1994-10-28';
Limitations
-
CACHE SELECTwarms up only one table at a time. It does not support operators such asORDER BY,LIMIT, orGROUP BY. -
This feature works with external tables in both shared-data and shared-nothing architectures and supports remote files in TEXT, ORC, and Parquet formats.
-
To check the remaining Data Cache capacity for data lake scenarios, run
SHOW BACKENDS\GorSHOW COMPUTE NODES\Gto determine if LRU eviction might occur. For a shared-data architecture, check cache usage via monitoring metrics.