Data Cache stores hot data from external storage (HDFS, object storage) on local BE nodes, reducing redundant remote I/O and accelerating data lake queries.
Overview
In data lake analytics, StarRocks frequently scans data files on HDFS or object storage. For ad-hoc queries and other scenarios that repeatedly access the same data, remote I/O can become a performance bottleneck.
Available since v2.5, Data Cache splits remote data into blocks and caches them on local BE nodes. This eliminates redundant remote reads and accelerates hot data queries.
Data Cache applies only to external table queries (excluding JDBC external tables) and External Catalog queries. It does not apply to native table queries. Since v3.4.0, storage-compute separation tables and data lake queries share the same Data Cache instance. For v3.4.0 or later, refer to the Data Cache for storage-compute separation documentation.
Limitations
-
Data Cache for data lakes is supported in StarRocks v2.5 and later, and is enabled by default starting from v3.3.0.
-
It applies only to queries on external tables and External Catalogs, not to queries on native tables.
-
Supported External Catalog types: those using the StarRocks Native File Reader (Hive, Iceberg, Hudi, Delta Lake, and Paimon). Catalogs that access data via JNI (such as JDBC Catalog) are not supported.
Some catalogs automatically switch access methods based on conditions. For example, a Paimon Catalog may fall back to JNI access depending on compaction status, in which case Data Cache is not used.
Cache media
StarRocks uses the memory and disks of BE/CN nodes as cache media and supports the following caching modes:
-
Memory-only cache: Fastest speed, but limited by memory capacity.
-
Two-tier cache (memory + disk): Expands cache capacity, balancing performance and cost.
Disk cache performance depends directly on disk I/O speed. Use a high-performance cloud disk (such as ESSD PL1). For moderate-performance disks, add multiple disks to distribute the I/O load.
Cache eviction mechanism
The memory and disk tiers evict data independently. In a two-tier cache, eviction works as follows:
-
The system reads from memory first. On a memory miss, it reads from disk and promotes the data into memory.
-
Data evicted from memory is written to disk. Data evicted from disk is discarded.
StarRocks Data Cache supports two eviction policies: LRU and SLRU (Segmented LRU). SLRU is the default policy.
The SLRU policy requires StarRocks v3.4 or later.
SLRU divides cache space into an eviction segment and a probationary segment, both using LRU internally. New data enters the eviction segment. If accessed again, it is promoted to the probationary segment. Data evicted from the probationary segment moves back to the eviction segment; data evicted from the eviction segment is removed from cache. Compared to LRU, SLRU better protects frequently accessed data from being flushed by large one-time scans.
Enable and disable Data Cache
Data Cache is controlled by the system variable enable_scan_datacache and the BE parameter datacache_enable, both enabled by default since v3.3.0. Check the current status:
SHOW VARIABLES LIKE 'enable_scan_datacache';
Key configuration parameters:
|
Parameter |
Type |
Default |
Description |
|
|
BE parameter |
|
Enables or disables Data Cache. |
|
|
System variable |
|
Enables Data Cache for query acceleration. |
|
|
BE parameter |
|
Maximum memory cache size. Recommended: 10%–20% of total memory. |
|
|
BE parameter |
|
Maximum disk cache size per disk. Accepts a percentage (e.g., |
|
|
BE parameter |
|
Enables automatic disk capacity adjustment for Data Cache. When enabled, cache capacity adapts to current disk usage. Even if |
To disable Data Cache, run the following command:
SET GLOBAL enable_scan_datacache = false;
Cache population rules
Since v3.3.2, StarRocks applies the following rules to improve cache hit rates:
-
Non-
SELECTqueries (such asANALYZE TABLEandINSERT INTO SELECT) do not populate the cache. -
Queries that scan all partitions of a table do not populate the cache (unless the table has only a single partition).
-
Queries that scan all columns of a table do not populate the cache (unless the table has only a single column).
-
Queries on tables other than Hive, Paimon, Delta Lake, Hudi, or Iceberg do not populate the cache.
Use EXPLAIN VERBOSE to check whether a query populates the cache:
EXPLAIN VERBOSE SELECT col1 FROM hudi_table;
If the query plan shows dataCacheOptions={populate: false}, the query will not populate the cache. To force population:
SET populate_datacache_mode = 'always';
I/O adaptiveness
When I/O load on the cache disk is high, Data Cache automatically routes some requests to remote storage, using both local and remote I/O to increase throughput and reduce long-tail latency. This feature is enabled by default. To enable it manually:
SET GLOBAL enable_datacache_io_adaptor = true;
Monitor Data Cache
Query information_schema.be_datacache_metrics to view cache capacity and usage per BE node:
SELECT * FROM information_schema.be_datacache_metrics;
|
Field |
Description |
|
|
BE node ID. |
|
|
BE node status. |
|
|
Configured disk cache capacity in bytes. |
|
|
Current disk cache usage in bytes. |
|
|
Configured memory cache capacity in bytes. |
|
|
Current memory cache usage in bytes. |
|
|
Memory used by system metadata in bytes. |
|
|
Disk cache paths and their sizes. |
Monitor SQL cache hits
Monitor these query profile metrics to evaluate Data Cache performance:
|
Metric |
Description |
|
|
Data read from memory and disk cache. |
|
|
Data loaded from external storage into cache. |
|
|
Total data read from all sources (memory, disk, and external storage). |