Accelerate Small File Reads with OSS Connector for AI/ML and OSS Accelerator

更新时间:
复制 MD 格式

Random I/O on millions of small files (such as ImageNet images) is a major bottleneck for GPU utilization during AI/ML training. Combining OSS Connector for AI/ML with OSS Accelerator significantly boosts data loading speed.

How the acceleration works

OSS Connector for AI/ML and OSS Accelerator optimize at different layers — application and storage, respectively — and produce a compounding effect when combined:

  • OSS Connector for AI/ML (application layer): Uses asynchronous I/O and multi-threaded prefetching to convert serial file requests into highly concurrent parallel streams. The next batch's data is prefetched in the background, eliminating CPU/GPU idle time spent waiting on I/O.

  • OSS Accelerator (storage layer): Caches hot data on high-performance media using a cold/warm model. The first read (cold cache) loads data from the OSS origin into the cache, so performance is comparable to standard OSS access. Subsequent reads (warm cache) are served directly from the cache at roughly 2.8x lower P50 latency (approximately 12 ms vs. 35 ms). In model training, where the same dataset is read across multiple epochs, acceleration takes effect from the second epoch onward.

  • Combined effect: The Accelerator absorbs the high volume of concurrent requests from the Connector at millisecond speed, removing OSS origin latency as a bottleneck. Traffic bursts during training startup or epoch transitions are smoothed by the cache, ensuring consistently high throughput.

Prerequisites

  • An OSS bucket with your training dataset uploaded.

  • An AccessKey ID and AccessKey Secret. For more information, see Create an AccessKey.

  • An ECS instance. Compute-optimized or network-optimized instances (such as ecs.g7.32xlarge) are recommended. Alibaba Cloud Linux 3/4 is the recommended operating system. For more information, see Create an instance with Custom Launch.

Enable OSS Accelerator for your bucket

  1. Log on to the OSS console and click the target bucket name.

  2. In the left-side navigation pane, choose Bucket Settings > OSS Accelerator.

  3. Click Create OSS Accelerator and configure the following parameters:

    Parameter

    Description

    Zone

    Select the same zone as your ECS instance. In this example, China (Beijing) Zone H is used.

    Capacity

    Must be greater than or equal to the total size of your dataset. In this example, 20 TB is used.

    Acceleration Policy

    Select Accelerate Specified Path and enter the dataset prefix, or select Accelerate Entire Bucket.

    Important

    The ECS instance and OSS Accelerator should be in the same zone. Cross-zone access introduces additional network latency that degrades acceleration performance.

  4. After creation, the endpoint to use with OSS Connector for AI/ML follows the format oss-cache-<zone>.aliyuncs.com. For example, if the zone is cn-beijing-h, the corresponding endpoint is oss-cache-cn-beijing-h.aliyuncs.com.

Install and configure OSS Connector for AI/ML

  1. Log on to the ECS instance and install OSS Connector for AI/ML (PyTorch edition).

    pip install osstorchconnector
  2. Configure access credentials. Replace <yourAccessKeyId> and <yourAccessKeySecret> with your actual AccessKey information.

    mkdir -p /root/.alibabacloud
    cat > /root/.alibabacloud/credentials << 'EOF'
    {
        "AccessKeyId": "<yourAccessKeyId>",
        "AccessKeySecret": "<yourAccessKeySecret>"
    }
    EOF

    The credentials file must use JSON format. For more configuration options, see Configure OSS Connector for AI/ML.

  3. Create a configuration file for OSS Connector to tune prefetch parameters for small file reads.

    mkdir -p /etc/oss-connector/
    cat > /etc/oss-connector/config.json << 'EOF'
    {
        "logLevel": 1,
        "logPath": "/var/log/oss-connector/connector.log",
        "auditPath": "/var/log/oss-connector/audit.log",
        "datasetConfig": {
            "prefetchMB": 1024,
            "prefetchConcurrency": 16,
            "prefetchWorker": 2,
            "prefetchUnitMB": 1,
            "timeoutMs": 10000
        },
        "checkpointConfig": {
            "prefetchConcurrency": 24,
            "prefetchWorker": 4,
            "uploadConcurrency": 64
        }
    }
    EOF

    Key parameters:

    Parameter

    Recommended value

    Description

    prefetchMB

    1024

    Prefetch buffer size in MB. A 1 GB buffer can hold approximately 8,700 files at 115 KB each. Increase this value for larger files.

    prefetchConcurrency

    16

    Number of concurrent prefetch operations. Takes full advantage of high-bandwidth instances.

    prefetchUnitMB

    1

    Size of each prefetch unit in MB. Set to 1 MB for small files. Increase to match the file size for larger files.

Run a performance comparison test

Create a test script named test_accelerator.py to read the dataset through both the OSS internal endpoint and the OSS Accelerator endpoint, then compare the results.

from osstorchconnector import OssMapDataset
import torch
from torch.utils.data import DataLoader
import time
import numpy as np

# === Configuration ===
# First run: use OSS internal endpoint. Second run: switch to OSS Accelerator endpoint.
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
# ENDPOINT = "http://oss-cache-cn-beijing-h.aliyuncs.com"

CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://<yourBucketName>/<yourDatasetPrefix>/"
REGION = "cn-beijing"

def collate_fn(batch):
    results = []
    for item in batch:
        start = time.perf_counter()
        content = item.read()
        read_time = time.perf_counter() - start
        results.append({'size': item.size, 'read_time': read_time})
    return results

dataset = OssMapDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
    region=REGION
)

NUM_WORKERS = 8
BATCH_SIZE = 256
dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    collate_fn=collate_fn,
    pin_memory=False,
    shuffle=True
)

# === Run test ===
all_batch_durations = []
total_size = 0
file_count = 0
start_time = time.perf_counter()
last_receive_time = start_time

print(f"Starting test: {NUM_WORKERS} workers, batch_size={BATCH_SIZE}")
print(f"Endpoint: {ENDPOINT}\n")

try:
    for batch in dataloader:
        current_receive_time = time.perf_counter()
        batch_duration = current_receive_time - last_receive_time
        all_batch_durations.append(batch_duration)
        last_receive_time = current_receive_time

        batch_total_size = sum(data['size'] for data in batch)
        total_size += batch_total_size
        file_count += len(batch)

        if file_count % 10000 == 0:
            elapsed = current_receive_time - start_time
            throughput = total_size / elapsed / (1024 ** 2)
            print(f"  {file_count:,} files processed | Throughput: {throughput:.2f} MB/s")

except KeyboardInterrupt:
    print("\nTest interrupted")

# === Results ===
end_time = time.perf_counter()
total_elapsed = end_time - start_time
num_batches = len(all_batch_durations)

if num_batches > 0:
    durations_ms = np.array(all_batch_durations) * 1000
    print(f"\n{'='*50}")
    print(f"Test Results")
    print(f"{'='*50}")
    print(f"Total files:      {file_count:,}")
    print(f"Total data:       {total_size / (1024**2):,.2f} MB")
    print(f"Total time:       {total_elapsed:.2f} s")
    print(f"Avg throughput:   {total_size / total_elapsed / (1024**2):,.2f} MB/s")
    print(f"\nBatch latency (ms): Avg={np.mean(durations_ms):.2f}  P50={np.percentile(durations_ms, 50):.2f}  P95={np.percentile(durations_ms, 95):.2f}")
else:
    print("No batches received")

Replace OSS_URI and REGION in the script with your actual values, then follow the steps below.

1. Run with the OSS internal endpoint (baseline)

Ensure ENDPOINT in the script is set to the OSS internal address, then run the test and record the total time:

python test_accelerator.py

2. Switch to the OSS Accelerator endpoint

Change ENDPOINT to the accelerator address:

ENDPOINT = "http://oss-cache-cn-beijing-h.aliyuncs.com"

The first read through the accelerator loads data from the OSS origin into the cache (cold cache), so performance is similar to direct OSS access. Run the test twice: the first run warms the cache, and the second run shows the actual acceleration.

# First run: warm up the cache
python test_accelerator.py

# Second run: cache hits, acceleration takes effect
python test_accelerator.py

Compare the total time of the baseline run against the second accelerator run to see the speedup for small file workloads.

Performance comparison results

The following results were measured on an ecs.g7.32xlarge instance using the ImageNet dataset (1.28 million files, 137 GB total, ~115 KB average file size):

Metric

Without Accelerator

With Accelerator

Total time

246.89 s

101.49 s

Average throughput

567.44 MB/s

1,380.41 MB/s

Speedup

2.44x

Actual performance varies by instance type, dataset size, and file size. Run your own tests for accurate benchmarks.