File cache

更新时间:
复制 MD 格式

File cache accelerates repeated queries on external data sources by storing recently accessed remote files on local backend (BE) nodes. Instead of fetching the same data from Hadoop Distributed File System (HDFS) or object storage on every query, ApsaraDB for SelectDB reads it from the local cache — reducing remote I/O costs and improving hot data query performance.

How it works

When a query accesses a remote file, ApsaraDB for SelectDB splits it into blocks by I/O size and writes them to the local BE node at the path cache_path/hash(filepath).substr(0, 3)/hash(filepath)/offset. Block metadata is stored on the BE node.

On the next query for the same data, ApsaraDB for SelectDB checks whether the required blocks are already in the local cache. It reads matching blocks locally and fetches only the remaining blocks remotely — then caches those new blocks as well.

If the cache reaches its size limit, LRU eviction removes the least recently accessed blocks. If the BE node restarts, ApsaraDB for SelectDB scans the cache_path directory and restores block metadata automatically.

Enable file cache

File cache is disabled by default.

Important

Configuring only the FE node or only the BE node silently disables the feature. Both sides must be configured for caching to take effect.

File cache takes effect only on file-based external tables, such as Hive and Hudi. It has no effect on internal tables or non-file external tables such as Java Database Connectivity (JDBC)-connected databases and Elasticsearch.

Configure the FE node

Enable file cache for the current session:

SET enable_file_cache = true;

Or enable it globally across all sessions:

SET GLOBAL enable_file_cache = true;

Configure the BE node

Add the following parameters to conf/be.conf on each BE node, then restart the BE node for the changes to take effect.

ParameterDefaultDescription
enable_file_cachetrueEnables file cache on the BE node.
file_cache_pathCache directory configuration in JSON format. Specify path, total_size (maximum cache size in bytes), and query_limit (maximum cache bytes per query).
file_cache_min_file_segment_size1048576Minimum block size in bytes. Must be greater than 4096.
file_cache_max_file_segment_size4194304Maximum block size in bytes. Must be greater than 4096.
enable_file_cache_query_limitfalseEnforces the per-query cache limit defined in file_cache_path.
clear_file_cachefalseDeletes existing cached data when the BE node restarts.

Example `file_cache_path` configuration:

[
    {
        "path": "/path/to/file_cache1",
        "total_size": 53687091200,
        "query_limit": "10737418240"
    },
    {
        "path": "/path/to/file_cache2",
        "total_size": 53687091200,
        "query_limit": "10737418240"
    },
    {
        "path": "/path/to/file_cache3",
        "total_size": 53687091200,
        "query_limit": "10737418240"
    }
]

Configure multiple cache directories to distribute cache storage across different disks.

Monitor cache effectiveness

Enable profiling in the current session to view cache metrics for a query:

SET enable_profile = true;

Run your query, then open the Queries tab on the FE node web page to view the job profile. The FileCache section reports the following metrics:

- FileCache:
  - IOHitCacheNum:           552
  - IOTotalNum:              835
  - ReadFromFileCacheBytes:  19.98 MB
  - ReadFromWriteCacheBytes: 0.00
  - ReadTotalBytes:          29.52 MB
  - WriteInFileCacheBytes:   915.77 MB
  - WriteInFileCacheNum:     283
MetricDescription
IOTotalNumTotal number of remote storage access attempts.
IOHitCacheNumNumber of accesses served from the local cache.
ReadFromFileCacheBytesBytes read from the local cache.
ReadFromWriteCacheBytesBytes read from the write cache (data cached during the current query).
ReadTotalBytesTotal bytes read (cache + remote).
SkipCacheBytesBytes re-fetched remotely because the cache file failed to be created or was deleted.
WriteInFileCacheBytesBytes written to the local cache.
WriteInFileCacheNumNumber of blocks written to the cache. Average block size = WriteInFileCacheBytes / WriteInFileCacheNum.

The cache is fully hit when both conditions are true:

  • IOHitCacheNum / IOTotalNum = 1

  • ReadFromFileCacheBytes / ReadTotalBytes = 1

If your hit rate is low, check that:

  • The queried data is in a file-based external table (Hive or Hudi).

  • Both the FE and BE nodes have file cache enabled.

  • The cache storage is large enough to hold your hot data set without frequent LRU evictions.