Data Cache warmup

更新时间:
复制 MD 格式

Data Cache warmup preloads remote data into the local cache, minimizing the performance impact of cold-start queries. This feature works with both shared-data clusters and data lake integrations.

Prerequisites

The Data Cache feature is enabled, and the current user has the SELECT privilege on the target table.

Use cases

Consider using Data Cache warmup in the following scenarios:

  • Sufficient cache disk capacity: The cache disk has more capacity than the data to warm up. If disk capacity is insufficient, newly loaded data continuously evicts already cached data, undermining the warmup.

  • Stable cache access patterns: Avoid sudden bursts of cold queries or large-scale data writes during warmup. These activities can cause warmed-up data to be evicted prematurely.

  • POC performance testing: Warm up test data to isolate StarRocks query performance from external storage I/O.

  • Scheduled BI reports: Schedule a warmup task before users access BI reports at a fixed time each day to ensure stable query performance.

Usage

StarRocks provides the CACHE SELECT statement to warm up the Data Cache. CACHE SELECT is a synchronous operation that warms up one table at a time.

Syntax:

CACHE SELECT <column_name> [, ...]
FROM [<catalog_name>.][<db_name>.]<table_name> [WHERE <boolean_expression>]
[PROPERTIES("verbose"="true")]

Parameters:

Parameter

Description

column_name

The columns to cache. Use an asterisk (*) to specify all columns.

catalog_name

The name of the external catalog. Required only for external tables in a data lake. Can be omitted if you have switched to the target catalog by using SET CATALOG.

db_name

The name of the database. Can be omitted if you have switched to the target database.

table_name

The name of the target table.

boolean_expression

The filter condition that specifies the warmup scope.

PROPERTIES

Currently, only the verbose property is supported. When set to true, the command returns detailed warmup metrics for each BE node.

A successful execution returns the following metrics:

Metric

Description

READ_CACHE_SIZE

Total data read from the Data Cache across all nodes.

WRITE_CACHE_SIZE

Total data written to the Data Cache across all nodes.

AVG_WRITE_CACHE_TIME

The average time taken by each node to write remote data to the Data Cache.

TOTAL_CACHE_USAGE

The percentage of Data Cache space used across the cluster after the warmup. Use this metric to determine if the cache space is sufficient.

Warm up an entire table

Warm up all data in the external table lineitem:

CACHE SELECT * FROM hive_catalog.test_db.lineitem;

Warm up by column and filter

Specify column names and WHERE predicates for fine-grained warmups. This reduces unnecessary data fetching and lowers disk I/O and CPU consumption.

Example: Warm up specific columns in a specific partition of an external table:

mysql> CACHE SELECT l_orderkey FROM hive_catalog.test_db.lineitem WHERE l_shipdate='1994-10-28';
+-----------------+------------------+----------------------+-------------------+
| READ_CACHE_SIZE | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+-----------------+------------------+----------------------+-------------------+
| 957MB           | 713.5MB          | 3.6ms                | 97.33%            |
+-----------------+------------------+----------------------+-------------------+
1 row in set (9.07 sec)

Example: Warm up specific columns in the internal table lineorder for a shared-data architecture:

mysql> CACHE SELECT lo_orderkey FROM ssb.lineorder;
+-----------------+------------------+----------------------+-------------------+
| READ_CACHE_SIZE | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+-----------------+------------------+----------------------+-------------------+
| 118MB           | 558.9MB          | 200.6ms              | 4.66%             |
+-----------------+------------------+----------------------+-------------------+
1 row in set (29.88 sec)

Verbose mode

By default, CACHE SELECT returns aggregated summary metrics. Add PROPERTIES("verbose"="true") to get detailed metrics for each BE node. This also returns the extra field AVG_READ_CACHE_TIME, which is the average read time for cache hits on each node.

mysql> CACHE SELECT * FROM hive_catalog.test_db.lineitem PROPERTIES("verbose"="true");
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
| IP            | READ_CACHE_SIZE | AVG_READ_CACHE_TIME | WRITE_CACHE_SIZE | AVG_WRITE_CACHE_TIME | TOTAL_CACHE_USAGE |
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
| 172.26.80.233 | 376MB           | 127.8micros         | 0B               | 0s                   | 3.85%             |
| 172.26.80.231 | 272.5MB         | 121.8micros         | 20.7MB           | 146.5micros          | 3.91%             |
| 172.26.80.232 | 355.5MB         | 147.7micros         | 0B               | 0s                   | 3.91%             |
+---------------+-----------------+---------------------+------------------+----------------------+-------------------+
3 rows in set (0.54 sec)

Schedule CACHE SELECT

Use CACHE SELECT with SUBMIT TASK to perform periodic warmups.

Example: Warm up the lineitem table every 5 minutes:

mysql> SUBMIT TASK always_cache SCHEDULE EVERY(INTERVAL 5 MINUTE) AS CACHE SELECT l_orderkey
FROM hive_catalog.test_db.lineitem
WHERE l_shipdate='1994-10-28';
+--------------+-----------+
| TaskName     | Status    |
+--------------+-----------+
| always_cache | SUBMITTED |
+--------------+-----------+
1 row in set (0.03 sec)

Manage CACHE SELECT tasks

View tasks

SELECT * FROM default_catalog.information_schema.tasks;

View task run history

SELECT * FROM default_catalog.information_schema.task_runs;

The EXTRA_MESSAGE field records the warmup metrics for the CACHE SELECT task.

Delete a task

DROP TASK <task_name>;

Best practices

Scenario 1: POC performance testing

Preload test tables into the Data Cache with CACHE SELECT to establish a more accurate performance baseline by eliminating external storage I/O interference.

Scenario 2: Scheduled warmup for BI reports

If users view BI reports at 08:00 every day, submit a warmup task that runs at 07:00 daily:

SUBMIT TASK BI SCHEDULE START('2024-02-03 07:00:00') EVERY(INTERVAL 1 DAY)
AS CACHE SELECT * FROM hive_catalog.test_db.lineitem
WHERE l_shipdate='1994-10-28';

Scenario 3: Reducing the system impact of warmup tasks

Specify a resource group and the degree of parallelism (DOP) in the properties parameter of SUBMIT TASK to minimize the impact on regular business queries:

SUBMIT TASK cache_select PROPERTIES("pipeline_dop"="1", "resource_group"="warmup") SCHEDULE EVERY(INTERVAL 1 DAY)
AS CACHE SELECT * FROM hive_catalog.test_db.lineitem WHERE l_shipdate>='1994-10-28';

Limitations

  • CACHE SELECT warms up only one table at a time. It does not support operators such as ORDER BY, LIMIT, or GROUP BY.

  • This feature works with external tables in both shared-data and shared-nothing architectures and supports remote files in TEXT, ORC, and Parquet formats.

  • To check the remaining Data Cache capacity for data lake scenarios, run SHOW BACKENDS\G or SHOW COMPUTE NODES\G to determine if LRU eviction might occur. For a shared-data architecture, check cache usage via monitoring metrics.