File cache accelerates repeated queries on external data sources by storing recently accessed remote files on local backend (BE) nodes. Instead of fetching the same data from Hadoop Distributed File System (HDFS) or object storage on every query, ApsaraDB for SelectDB reads it from the local cache — reducing remote I/O costs and improving hot data query performance.
How it works
When a query accesses a remote file, ApsaraDB for SelectDB splits it into blocks by I/O size and writes them to the local BE node at the path cache_path/hash(filepath).substr(0, 3)/hash(filepath)/offset. Block metadata is stored on the BE node.
On the next query for the same data, ApsaraDB for SelectDB checks whether the required blocks are already in the local cache. It reads matching blocks locally and fetches only the remaining blocks remotely — then caches those new blocks as well.
If the cache reaches its size limit, LRU eviction removes the least recently accessed blocks. If the BE node restarts, ApsaraDB for SelectDB scans the cache_path directory and restores block metadata automatically.
Enable file cache
File cache is disabled by default.
Configuring only the FE node or only the BE node silently disables the feature. Both sides must be configured for caching to take effect.
File cache takes effect only on file-based external tables, such as Hive and Hudi. It has no effect on internal tables or non-file external tables such as Java Database Connectivity (JDBC)-connected databases and Elasticsearch.
Configure the FE node
Enable file cache for the current session:
SET enable_file_cache = true;Or enable it globally across all sessions:
SET GLOBAL enable_file_cache = true;Configure the BE node
Add the following parameters to conf/be.conf on each BE node, then restart the BE node for the changes to take effect.
| Parameter | Default | Description |
|---|---|---|
enable_file_cache | true | Enables file cache on the BE node. |
file_cache_path | — | Cache directory configuration in JSON format. Specify path, total_size (maximum cache size in bytes), and query_limit (maximum cache bytes per query). |
file_cache_min_file_segment_size | 1048576 | Minimum block size in bytes. Must be greater than 4096. |
file_cache_max_file_segment_size | 4194304 | Maximum block size in bytes. Must be greater than 4096. |
enable_file_cache_query_limit | false | Enforces the per-query cache limit defined in file_cache_path. |
clear_file_cache | false | Deletes existing cached data when the BE node restarts. |
Example `file_cache_path` configuration:
[
{
"path": "/path/to/file_cache1",
"total_size": 53687091200,
"query_limit": "10737418240"
},
{
"path": "/path/to/file_cache2",
"total_size": 53687091200,
"query_limit": "10737418240"
},
{
"path": "/path/to/file_cache3",
"total_size": 53687091200,
"query_limit": "10737418240"
}
]Configure multiple cache directories to distribute cache storage across different disks.
Monitor cache effectiveness
Enable profiling in the current session to view cache metrics for a query:
SET enable_profile = true;Run your query, then open the Queries tab on the FE node web page to view the job profile. The FileCache section reports the following metrics:
- FileCache:
- IOHitCacheNum: 552
- IOTotalNum: 835
- ReadFromFileCacheBytes: 19.98 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- WriteInFileCacheBytes: 915.77 MB
- WriteInFileCacheNum: 283| Metric | Description |
|---|---|
IOTotalNum | Total number of remote storage access attempts. |
IOHitCacheNum | Number of accesses served from the local cache. |
ReadFromFileCacheBytes | Bytes read from the local cache. |
ReadFromWriteCacheBytes | Bytes read from the write cache (data cached during the current query). |
ReadTotalBytes | Total bytes read (cache + remote). |
SkipCacheBytes | Bytes re-fetched remotely because the cache file failed to be created or was deleted. |
WriteInFileCacheBytes | Bytes written to the local cache. |
WriteInFileCacheNum | Number of blocks written to the cache. Average block size = WriteInFileCacheBytes / WriteInFileCacheNum. |
The cache is fully hit when both conditions are true:
IOHitCacheNum / IOTotalNum = 1ReadFromFileCacheBytes / ReadTotalBytes = 1
If your hit rate is low, check that:
The queried data is in a file-based external table (Hive or Hudi).
Both the FE and BE nodes have file cache enabled.
The cache storage is large enough to hold your hot data set without frequent LRU evictions.