Use an OSS accelerator to speed up model training-Object Storage Service(OSS)-阿里云帮助中心

Performance benefits

An OSS accelerator offers significant performance advantages over a standard OSS bucket. Its low latency helps achieve higher throughput with fewer workers. In our tests, the OSS accelerator boosted training efficiency by 40% to 400%, substantially reducing compute resource consumption and costs.

Performance test results

Important

The following performance test results are for reference only. The actual acceleration may vary depending on factors such as dataset size, hardware configuration, model complexity, and other hyperparameter settings.

We tested the model training performance of a standard OSS bucket against an OSS bucket with an accelerator. The dataset used for this test consists of a training set with 1.28 million images and a validation set with 50,000 images. Based on the machine specifications (4 vCPUs, 15 GiB memory, 1*Tesla T4), we designed multiple concurrency configurations and tested data loading with both standard OSS and the accelerator.

Batch size	Workers	Epoch time (min)
Batch size	Workers	Standard OSS	OSS accelerator
64	6	63.18	34.70
	4	54.96	34.68
	2	146.05	34.66
32	6	82.19	37.11
	4	108.33	37.13
	2	137.87	37.30
16	6	68.93	41.58
	4	132.97	41.69
	2	206.32	41.69

Solution overview

This solution involves three steps:

Create an Elastic GPU Service instance: Create an Elastic GPU Service instance appropriate for your model training workload.
Create an OSS bucket and enable the OSS accelerator: Create an OSS bucket to store your data, enable the OSS accelerator, and retrieve the bucket's internal and accelerated endpoints for the training task.
Train the model: After preparing the resources, preprocess the original dataset and upload it to OSS. During training, use the OSS accelerator to load the dataset and start training.

Procedure

Step 1: Create an Elastic GPU Service instance

The following steps show how to create and connect to an Elastic GPU Service instance for model training. The instance uses the ecs.gn6i-c4g1.xlarge instance type, the Ubuntu 22.04 operating system, and CUDA version 12.4.1. When you customize the instance configuration, make sure to select the latest CUDA version.

1. Create an Elastic GPU Service instance

Go to the Elastic Compute Service (ECS) instance buy page.
Click the Custom Launch tab.
Configure the billing method, region, network, zone, instance type, image, and other parameters as needed, and then create the instance. For more information about each parameter, see Parameters.

Important
The OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure your Elastic GPU Service instance is in one of these regions. This guide uses the China (Hangzhou) region as an example.
- In this guide, the ECS instance type is ecs.gn6i-c4g1.xlarge. This is for reference only.
  
  On the instance type selection page, click the All Generations tab, enter the instance type in the search box, and select GPU-accelerated Compute gn6i. This instance type provides 4 vCPUs, 15 GiB of memory, and 1 * NVIDIA T4 GPU (16 GB VRAM).
- Select an operating system image. This guide uses Ubuntu 22.04 as an example. Select the Auto-install GPU Driver checkbox and select CUDA Version 12.4.1. This automatically installs the CUDA environment at startup, eliminating the need for manual configuration.
  
  For the driver version, select 550.90.07. For the CUDNN version, select 9.2.0.82.

2. Connect to the Elastic GPU Service instance

In the ECS console, go to the Instance page. Find the ECS instance you created by region and instance ID, and click Connect in the Actions column.
In the Remote connection dialog box, click Sign in now for Workbench.
In the Instance Login dialog box, set Authentication to the method you chose when creating the instance. For example, select SSH Key Authentication, enter the username, and upload the private key file from the key pair you created. Click Log On to log on to the ECS instance.

Note
The private key file is automatically downloaded to your local machine when you create a key pair. Check your browser's download history to find the .pem private key file.

A successful logon displays the following output, and the CUDA driver installation starts automatically. Wait for the installation to complete.

Welcome to Alibaba Cloud Elastic Compute Service !
Last login: Fri Dec 13 14:15:04 2024 from 100.104.86.0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    20  100    20    0     0    645      0 --:--:-- --:--:-- --:--:--   645
CHECKING AUTO INSTALL, DRIVER VERSION=550.90.07 CUDA_VERSION=12.4.1 CUDNN VERSION=9.2.0.82 , INSTALL RDMA=FALSE, INSTALL eRDMA=FALSE, PLEASE WAIT ......
The script automatically downloads and installs a NVIDIA GPU driver and CUDA, CUDNN library. if you choose install RDMA or ERDMA, RDMA or ERDMA software will install.
if you choose install perseus, perseus environment will install as well.
1. The installation takes 15 to 20 minutes, depending on the intranet bandwidth and the quantity of vCPU cores of the instance. Please do not operate the GPU or install any GPU-related software until the GPU driver is installed successfully.
2. After the GPU is installed successfully, the instance will restarts automatically.
CUDA-12.4.1 downloading, it takes 3 minutes or more. Remaining installation time 14 - 19 minutes!

Step 2: Prepare the OSS bucket and accelerator

These steps show how to create a bucket in the same region as the Elastic GPU Service instance to store datasets and enable the OSS accelerator to improve data access speed. Note that no traffic fees are incurred if the ECS instance and the bucket are in the same region and you access the bucket through its internal endpoint.

Create a bucket and get the internal endpoint

Important
The OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that the bucket you create is in one of these regions and is in the same region as your Elastic GPU Service instance. This guide uses the China (Hangzhou) region as an example.
1. In the OSS console, go to the Buckets page and click Create Bucket.
2. In the Create Bucket panel, follow the on-screen prompts to create the bucket.
3. On the Overview page of the bucket, find the Port section and copy the Endpoint for Access from ECS over the VPC (internal network). You will use this later to upload datasets and checkpoints to the bucket.
Enable the OSS accelerator and get its endpoint
1. In the OSS console, go to the Buckets page, select the target bucket, and then in the left-side navigation pane, choose Bucket Settings > OSS Accelerator. This takes you to the OSS Accelerator page.
2. Click Create Accelerator. In the Create Accelerator panel, set the accelerator capacity. This example uses 500 GB. Click Next.
3. Select Paths for the acceleration policy, configure the acceleration path to the dataset directory, and then click OK. Follow the prompts to finish creating the accelerator.
  
  Enter dataset/ for the accelerated path and select synchronous pre-warming. When synchronous pre-warming is enabled, data written by the client through the accelerated endpoint is simultaneously written to both the OSS bucket and the OSS accelerator, ensuring lower latency for subsequent reads.
4. On the OSS accelerator page, copy the accelerated endpoint, which is used to download the dataset during training.

Step 3: Train the model

The following steps guide you through configuring the training environment, uploading the dataset, and using the accelerated endpoint to accelerate model training on the Elastic GPU Service instance.

Note

For the complete code project, see demo.tar.gz.
All of the following steps must be executed with root privileges. Switch to the root user before you begin training.

Configure the training environment
1. Prepare the conda environment and configure dependencies.
  1. Run the following command to download and install conda.
```
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/conda/ && rm /tmp/miniconda.sh && /opt/conda/bin/conda clean -tipy && export PATH=/opt/conda/bin:$PATH  && conda init bash && source ~/.bashrc && conda update conda 
```
  2. Run the vim environment.yaml command to create and open a conda environment file named environment.yaml. Add the following configuration and save the file.
```
name: py312
channels:
  - defaults
  - conda-forge
  - pytorch
dependencies:
  - python=3.12
  - pytorch>=2.5.0
  - torchvision 
  - torchaudio 
  - transformers 
  - torchdata
  - oss2
```
  3. Run the following command to create a conda environment named py312 from the environment file.
```
conda env create -f environment.yaml
```
  4. Run the conda activate py312 command to activate the conda environment named py312.
    
    Important
    Perform the following steps in the activated conda environment.
2. Configure environment variables.
  
  Run the following command to configure the credentials required for uploading the dataset. Replace <ACCESS_KEY_ID> and <ACCESS_KEY_SECRET> with the AccessKey ID and AccessKey secret of your RAM user. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey.
```
export OSS_ACCESS_KEY_ID=<ACCESS_KEY_ID>
export OSS_ACCESS_KEY_SECRET=<ACCESS_KEY_SECRET>
```
3. Configure the OSS Connector.
  1. Run the following command to install the OSS Connector.
```
pip install osstorchconnector
```
  2. Run the following command to create a credentials configuration file.
```
mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentials
```
  3. Run the vim /root/.alibabacloud/credentials command to open the configuration file. Add the following configuration and save the file. For more information about OSS Connector configurations, see Configure OSS Connector for AI/ML.
    
    The following is an example configuration that uses an AccessKey ID and AccessKey secret as credentials. Replace <AccessKeyId> and <AccessKeySecret> with the AccessKey ID and AccessKey secret of your RAM user. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey.
```
{
  "AccessKeyId": "LTAI************************",
  "AccessKeySecret": "At32************************"
}
```
  4. Run the following command to set read-only permissions on the credentials file to protect your AccessKey ID and AccessKey secret.
```
chmod 400 /root/.alibabacloud/credentials
```
  5. Run the following command to create an OSS Connector configuration file.
```
mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.json
```
  6. Run the vim /etc/oss-connector/config.json command to open the configuration file. Add the following configuration and save the file. The default configuration is sufficient in most cases.
```
{
    "logLevel": 1,
    "logPath": "/var/log/oss-connector/connector.log",
    "auditPath": "/var/log/oss-connector/audit.log",
    "datasetConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 2
    },
    "checkpointConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 4,
        "uploadConcurrency": 64
    }
}
                
```

Prepare the data

Upload the training and validation datasets to the target bucket.

Run the following commands to download the training and validation datasets to the ECS instance. Note that this is a test dataset and not for production use.

wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241216/jsnenr/n04487081.tar
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241218/dxrciv/n10148035.tar
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241216/senwji/val.tar

Run the following commands to extract the downloaded datasets and place them in a dataset directory in the current path.

tar -zxvf n10148035.tar && tar -zxvf n04487081.tar && tar -zxvf val.tar
mkdir dataset && mkdir ./dataset/train && mkdir ./dataset/val
mv n04487081 ./dataset/train/ && mv n10148035 ./dataset/train/ && mv IL*.JPEG ./dataset/val/

Run the python3 upload_dataset.py script to upload the extracted datasets to the specified bucket.

# upload_dataset.py
from torchvision import transforms
from PIL import Image
import oss2
import os
from oss2.credentials import EnvironmentVariableCredentialsProvider
# The internal endpoint for the China (Hangzhou) region is used as an example.
OSS_ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com"    # The internal endpoint for OSS access. 
OSS_BUCKET_NAME = "<YourBucketName>"    # The name of the target bucket. 
BUCKET_REGION = "cn-hangzhou"    # The region of the target bucket. 
# OSS_URI_BASE: A custom storage prefix in the OSS bucket.
OSS_URI_BASE = "dataset/imagenet/ILSVRC/Data"
def to_tensor(img_path):
    IMG_DIM_224 = 224
    compose = transforms.Compose([
            transforms.RandomResizedCrop(IMG_DIM_224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    img = Image.open(img_path).convert('RGB')
    img_tensor = compose(img)
    numpy_data = img_tensor.numpy()
    binary_data = numpy_data.tobytes()
    return binary_data
def list_dir(directory):
    for root, _, files in os.walk(directory):
        rel_root = os.path.relpath(root, start=directory)
        for file in files:
            rel_filepath = os.path.join(rel_root, file) if rel_root != '.' else file
            yield rel_filepath
IMG_DIR_BASE = "./dataset" 
"""
    IMG_DIR_BASE is the local path where images are stored. You can use an absolute or relative path.
    The directory structure under this path should match the actual dataset structure, as shown below:
    {IMG_DIR_BASE}/
        train/
            n10148035/
                n10148035_10034.JPEG
                n10148035_10217.JPEG
                ... 
            n11879895/
                n11879895_10016.JPEG
                n11879895_10019.JPEG
                ...
            ...
        val/
            ILSVRC2012_val_00000001.JPEG
            ILSVRC2012_val_00000002.JPEG
            ...
"""
bucket_api = oss2.Bucket(oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider()), OSS_ENDPOINT, OSS_BUCKET_NAME, region=BUCKET_REGION)
for phase in [ "val", "train"]:
    IMG_DIR = "%s/%s" % (IMG_DIR_BASE, phase)
    for _, img_relative_path in enumerate(list_dir(IMG_DIR)):
        img_bin_name = img_relative_path.replace(".JPEG", ".pt")
        object_key = "%s/%s/%s" % (OSS_URI_BASE, phase, img_bin_name)
        bucket_api.put_object(object_key, to_tensor("%s/%s" % (IMG_DIR,img_relative_path)))

Download the image dataset label files to build the classification mapping for the dataset.

wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241220/izpskr/imagenet_class_index.json
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241220/lfilrp/ILSVRC2012_val_labels.json

Training process

Build a utility module to process the ImageNet dataset. This module primarily uses the accelerated endpoint to download the dataset from the OSS accelerator and build a data loader.

oss_dataloader.py

# oss_dataloader.py
import json
import numpy as np
from torch.utils.data import DataLoader
import torch
class ImageCls():
    def __init__(self):
        self.__syn_to_class = {}
        self.__syn_to_label = {}
        with open("imagenet_class_index.json", "rb") as f:
            cls_list = json.load(f)
            for cls, v in cls_list.items():
                syn = v[0]
                label = v[1]
                self.__syn_to_class[syn] = int(cls)
                self.__syn_to_label[int(cls)] = label
    def __len__(self):
        return len(self.__syn_to_label)
    def __getitem__(self, syn):
        cls = self.__syn_to_class[syn]
        return cls
class ImageValSet():
    def __init__(self):
        self.__val_to_syn = {}
        with open("ILSVRC2012_val_labels.json", "rb") as f:
            val_syn_list = json.load(f)
            for val, syn in val_syn_list.items():
                self.__val_to_syn[val] = syn
    def __getitem__(self, val):
        return self.__val_to_syn[val]
imageCls = ImageCls()
imageValSet = ImageValSet()
IMG_DIM_224 = 224
OSS_URI_BASE = "oss://<YourBucketName>/dataset/imagenet/ILSVRC/Data"
# The accelerated endpoint of the OSS accelerator, used to download the dataset. Replace this with your accelerated endpoint. 
ENDPOINT = "cn-hangzhou-j-internal.oss-data-acc.aliyuncs.com" 
def obj_to_tensor(object):
    data = object.read()
    numpy_array_from_binary = np.frombuffer(data, dtype=np.float32).reshape([3, IMG_DIM_224, IMG_DIM_224])
    return torch.from_numpy(numpy_array_from_binary)
def train_tensor_transform(object):
    tensor_from_binary = obj_to_tensor(object)
    key = object.key
    syn = key.split('/')[-2]
    return tensor_from_binary, imageCls[syn]
def val_tensor_transform(object):
    tensor_from_binary = obj_to_tensor(object)
    key = object.key
    image_name = key.split('/')[-1].split('.')[0] + ".JPEG"
    return tensor_from_binary, imageCls[imageValSet[image_name]]
def make_oss_dataloader(dataset, batch_size, num_worker, shuffle):
    image_datasets = {
        'train': dataset.from_prefix(OSS_URI_BASE + "/train/", endpoint=ENDPOINT, transform=train_tensor_transform),
        'val': dataset.from_prefix(OSS_URI_BASE + "/val/", endpoint=ENDPOINT, transform=val_tensor_transform),
    }
    dataloaders = {
        'train': DataLoader(image_datasets['train'], batch_size=batch_size, shuffle=shuffle, num_workers=num_worker),
        'val': DataLoader(image_datasets['val'], batch_size=batch_size, shuffle=shuffle, num_workers=num_worker)
    }
    return dataloaders

Build a utility module for initializing the pre-trained ResNet18 model.

pre_trained_model.py

# pre_trained_model.py
from torchvision import models
import torch.nn as nn
import torch
def make_resnet_model(cls_count=1000):
    device = torch.device("cuda:0")
    model = models.resnet18(pretrained=True)
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, cls_count)
    model = model.to(device)
    if torch.cuda.device_count() > 1:
        model = nn.DataParallel(model)
    return model, device

Build a utility module for training the ResNet model. This module trains the model based on the given model, data loaders, and number of training epochs.

resnet_train.py

# resnet_train.py
from osstorchconnector import OssCheckpoint
import torch.optim as optim
import torch
import torch.nn as nn
OSS_CHECKPOINT_URI = "oss://<YourBucketName>/checkpoints/resnet18.pt"
# The internal endpoint for OSS. 
ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com" 
def train(model, dataloaders, device, epoch_num):
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
    criterion = nn.CrossEntropyLoss()
    best_acc = 0.0
    for epoch in range(epoch_num):
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
            else:
                model.eval()
            running_loss = 0.0
            running_corrects = 0
            # Iterate over the data.
            dataset_size = 0
            for (inputs, labels) in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)
                    # Backpropagate and optimize only in the training phase.
                    if phase == 'train':
                        optimizer.zero_grad()
                        loss.backward()
                        optimizer.step()
                # Record statistics.
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                dataset_size += inputs.size(0)
            if phase == 'train':
                exp_lr_scheduler.step()
            epoch_loss = running_loss / dataset_size
            epoch_acc = running_corrects / dataset_size
            print(f'[Epoch {epoch}/{epoch_num - 1}][{phase}] {dataset_size} imgs {epoch_acc}')
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                # Upload the checkpoint to OSS.
                checkpoint = OssCheckpoint(endpoint=ENDPOINT)
                with checkpoint.writer(OSS_CHECKPOINT_URI) as checkpoint_writer:
                    torch.save(model.state_dict(), checkpoint_writer)

Build the main script file, which integrates the utility modules to start model training.

main.py

# main.py
from oss_dataloader import make_oss_dataloader
from pre_trained_model import make_resnet_model
from osstorchconnector import OssMapDataset
from resnet_train import train
# Basic training parameters
NUM_EPOCHS = 30 # epoch number
BATCH_SIZE = 64 # batch size
NUM_WORKER = 4 # data loader worker number
# Use the pre-trained resnet18 model.
model, device = make_resnet_model()
# Use the OssMapDataset to construct a data loader.
dataloaders = make_oss_dataloader(OssMapDataset, BATCH_SIZE, NUM_WORKER, True)
# Call the main training process.
train(model, dataloaders, device, NUM_EPOCHS)

Run the python3 main.py command to start model training. The following output indicates a successful start:

(py312) root@xxx :~# python3 main.py
/opt/conda/envs/py312/lib/python3.12/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn()
/opt/conda/envs/py312/lib/python3.12/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.7M/44.7M [00:10<00:00, 4.40MB/s]
2024/12/20 13:36:34.853415| INFO |th=00000000000000000|dataset.cpp:222|new_oss_dataset:new oss dataset, uuid: 92251a5f-7f00-45b5-aaba-b8894ab15b0d id: 0, total: 1, pid: 20655, endpoint: cn-hangzhou-internal.oss-data-acc.aliyuncs.com
2024/12/20 13:36:34.853437| INFO |th=00000000000000000|dataset.cpp:230|new_oss_dataset:[cred_path=/root/.alibabacloud/credentials][config_path=/etc/oss-connector/config.json]
2024/12/20 13:36:34.853518| INFO |th=00000000000000000|dataset.cpp:255|new_oss_dataset:set log level: 1
2024/12/20 13:36:34.853591| INFO |th=00000000000000000|dataset.cpp:262|new_oss_dataset:set log path: /var/log/oss-connector/connector_log_0
/root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.)
  return torch.from_numpy(numpy_array_from_binary)
/root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.)
  return torch.from_numpy(numpy_array_from_binary)
/root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.)
  return torch.from_numpy(numpy_array_from_binary)

Verify the result

Go to the Buckets page, select the target bucket, and click Object Management > Objects. Verify that the resnet18.pt file exists in the checkpoints directory, which confirms that the checkpoint was successfully uploaded to OSS.