Lake cache
AnalyticDB for MySQL provides the lake cache feature, which caches frequently accessed files from OSS on high-performance NVMe SSDs to accelerate OSS data reads. This feature is ideal for scenarios that require high bandwidth and involve repetitive data reads, for example, when multiple analysts need to query the same dataset. This topic describes the benefits, use cases, and usage of lake cache.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
Overview
How it works
lake cache works as follows:
-
The lake cache client forwards read requests for OSS data to the lake cache accelerator. The client then connects to a lake cache master node to request file metadata.
-
The lake cache master node returns the file metadata to the lake cache client.
-
Based on the metadata, the lake cache client sends a request to a lake cache worker to retrieve OSS data:
-
If the target file is in the cache space of the lake cache worker, the file is returned directly to the client.
-
If the target file is not in the cache space, the lake cache accelerator retrieves the file from OSS. The accelerator then returns the file to the client and caches it for future requests.
-
Benefits
-
Millisecond-level latency
The lake cache accelerator uses NVMe SSDs to deliver millisecond-level read latency.
-
High throughput
The accelerator's bandwidth scales linearly with the cache space size, offering burst throughput of up to several hundred GB/s.
-
High throughput density
The accelerator delivers high throughput for small datasets, meeting burst read demands for hot data.
-
Elastic scaling
You can manually scale the cache space up or down based on your business needs to prevent resource waste and reduce costs. The cache space can be scaled from 10 GB to 200,000 GB.
-
Decoupled storage and computing
Unlike cache on compute nodes, the lake cache accelerator is an independent component, and its size can be adjusted online.
-
Data consistency
When a file in OSS is updated, the lake cache accelerator automatically detects the change and caches the new version, ensuring that the compute engine always reads the latest data.
Performance metrics and cache eviction policy
|
Parameter |
Description |
|
Accelerator bandwidth |
Bandwidth is calculated with the formula: For example, if the lake cache accelerator has a 10 TB cache space, the read bandwidth is (5 × 10) GB/s = 50 GB/s. |
|
Lake cache accelerator cache space |
The value can range from 10 GB to 200,000 GB. The lake cache accelerator provides throughput for cached data based on the configured cache space size. Each terabyte (TB) of cache space provides 5 GB/s of bandwidth. The throughput provided by the accelerator is in addition to and not limited by the standard OSSOSS To request a larger capacity, submit a ticket. |
|
Cache eviction policy |
When the cache is full, the system uses the Least Recently Used (LRU) policy to evict data. This policy removes the least recently accessed data first to maximize cache efficiency. |
Performance
AnalyticDB for MySQL was tested based on the TPC-H benchmark to compare two methods: enabling the lake cache and directly accessing OSS storage space. In this test, enabling the lake cache feature improved data access efficiency by 2.7 times. For detailed test results, see the following:
|
Type |
Cache space |
Dataset size |
Spark resources |
Query duration |
|
lake cache enabled |
12 TB |
TPC-H 10 TB dataset |
Medium (2 cores, 8 GB) |
7219s |
|
Direct OSS access |
N/A |
TPC-H 10 TB dataset |
Medium (2 cores, 8 GB) |
19578s |
Billing
The lake cache space is billed on a pay-as-you-go basis. For more information, see Pricing for Enterprise Edition and Basic Edition and Pricing for Data Lakehouse Edition.
Usage notes
-
lake cache is available only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, US (Silicon Valley), US (Virginia), and Indonesia (Jakarta).
ImportantIf you want to use the lake cache feature in other regions, submit a ticket.
-
If a cache hardware failure occurs, data queries continue to run without interruption or errors, although performance may degrade. Cached data is reloaded from OSS, and query speed is restored after the process is complete.
-
When the configured cache space is full, the accelerator replaces less frequently accessed files with more frequently accessed ones based on the cache eviction policy. To prevent files from being evicted, scale up the cache space.
Enable lake cache
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
-
Go to the Cluster Information page. In the Configuration Information section, click Configure next to Lake Cache.
-
In the Lake Cache dialog box, turn on the Lake Cache switch and configure the Disk Cache Settings.
-
Click OK.
After you enable Lake Cache, you can follow the same steps to view the configured cache space size.
Use lake cache
After you enable lake cache, when you read OSS data, you can accelerate OSS data reads by configuring the spark.adb.lakecache.enabled parameter in your Spark job configuration. The following is an example:
Spark SQL
-- This is an example of using lake cache. Modify the code to run your Spark program.
SET spark.adb.lakecache.enabled=true;
-- Add your SQL statements here.
SHOW databases;
Spark JAR
{
"comments": [
"-- This is an example of using lake cache. Modify the code to run your Spark program."
],
"args": ["oss://testBucketName/data/readme.txt"],
"name": "spark-oss-test",
"file": "oss://testBucketName/data/example.py",
"conf": {
"spark.adb.lakecache.enabled": "true"
}
}
If you want to use the lake cache accelerator with the XIHE engine, submit a ticket.
Monitor lake cache
After you enable lake cache, you can use the CloudMonitor console to verify that your Spark applications are using the cache and to view metrics such as data read volume. To do so, perform the following steps:
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, choose .
-
Hover over the AnalyticDB for MySQL card and click AnalyticDB for MySQL 3.0 - Data Lakehouse Edition.
-
Find the target cluster and click Monitoring Charts in the Actions column.
-
Click the LakeCache Metrics tab to view the details.
The following table describes the monitoring metrics.
Metric
Description
LakeCache Cache Hit Ratio (%)
The percentage of read requests fulfilled by the cache. Formula: (Cache Hits / Total Read Requests) × 100%.
LakeCache Cache Usage (B)
The amount of used cache space, in bytes.
Total Data Read from LakeCache (B)
The total amount of data read from the cache space, in bytes.