Data Cache FAQ for data lakes

更新时间:
复制 MD 格式

This topic answers common questions about Data Cache for data lakes — including how to verify cache hits, diagnose cache misses, control population behavior, and clear cached data.

Use Data Cache

Supported catalog types

Data Cache supports external catalog types that use the StarRocks Native File Reader — including Hive, Iceberg, Hudi, Delta Lake, and Paimon (Parquet, ORC, and CSV readers). Catalogs that access data via Java Native Interface (JNI), such as JDBC catalogs, are not supported.

Note

Some catalogs choose between the StarRocks Native File Reader and JNI access based on conditions such as file type and data state. For example, for a Paimon catalog, StarRocks automatically selects the access method based on the data's compaction state. Data Cache acceleration is unavailable for queries that use JNI access.

Verify cache hits

Check the following metrics in the Query Profile:

  • DataCacheReadBytes: Amount of data read from the local cache.

  • DataCacheReadCounter: Number of times the local cache was hit.

If both metrics are greater than 0, the query has hit the local cache.

- DataCacheReadBytes: 518.73 MB
  - __MAX_OF_DataCacheReadBytes: 4.73 MB
  - __MIN_OF_DataCacheReadBytes: 16.00 KB
- DataCacheReadCounter: 684
- DataCacheWriteBytes: 7.65 GB
- DataCacheWriteCounter: 7.887K (7887)

Troubleshoot cache misses

Follow these steps to diagnose cache misses:

  1. Check that the external catalog type supports Data Cache. Only catalogs that use the StarRocks Native File Reader — such as Hive, Iceberg, Hudi, Delta Lake, and Paimon — are supported.

  2. Run EXPLAIN VERBOSE to check whether the query meets the cache population rules:

EXPLAIN VERBOSE SELECT col1 FROM hudi_table;

If the output contains dataCacheOptions={populate: false}, the query will not populate the cache. This can happen when the query scans all partitions, for example.

To force cache population for such queries, set the following session variable:

SET populate_datacache_mode = 'always';

Cache hits

Cache population behavior

By default, Data Cache uses asynchronous population to minimize impact on query latency. In this mode, cache writing happens in the background after data is read, so a single query may cache only part of the accessed data. Run the query multiple times to fully populate the cache.

To change this behavior, use one of the following options:

  • Switch to synchronous population by setting enable_datacache_async_populate_mode=false. Data is cached immediately during the first read, so a single query achieves full cache coverage — at the cost of higher first-query latency.

  • Pre-warm the target data using CACHE SELECT. This loads data into the cache before queries run, with no latency impact on live queries, but requires a separate warm-up step.

Remote access with a full cache

The I/O adaption feature is enabled by default. When disk I/O load is high, the system automatically routes some cache requests to remote storage to prevent long-tail disk latency from degrading overall query performance. This behavior is by design.

To disable I/O adaption, run:

SET enable_datacache_io_adaptor = false;

Miscellaneous

Clear cached data

Data Cache has no built-in command to clear the cache. Use one of the following workarounds:

Option 1 (Recommended): Delete all data — including block files and metadata directories — from the datacache directory on the Backend (BE) or Compute Node (CN) nodes, then restart those nodes.

Option 2: Clear the cache without restarting nodes by temporarily shrinking the disk cache quota to 0. After the system clears the data automatically, restore the original quota:

UPDATE be_configs SET VALUE="0" WHERE NAME="datacache_disk_size" AND BE_ID=10005;
UPDATE be_configs SET VALUE="2T" WHERE NAME="datacache_disk_size" AND BE_ID=10005;
Warning

When running these statements, verify the WHERE clause carefully to avoid affecting other nodes or parameters.

Improve Data Cache performance

Data Cache replaces remote storage access with local memory or disk access, so performance depends directly on the local cache media. If you experience high cache access latency, consider the following options:

  • Use high-performance NVMe (Non-Volatile Memory Express) disks as cache disks.

  • Add more disks to distribute the I/O load if high-performance disks are not available.

  • Increase server memory on the BE or CN nodes to leverage the operating system's page cache, reducing direct disk reads.