An OSS accelerator significantly improves dataset loading speed, which accelerates overall model training. This article compares performance with and without an OSS accelerator, demonstrating that data loading efficiency is critical before GPU utilization becomes a bottleneck. It also provides a tutorial on fine-tuning a pre-trained ResNet-18 model on the ImageNet ILSVRC dataset, showing you how to use an OSS accelerator with an Elastic GPU Service instance to improve model training speed.
Performance benefits
An OSS accelerator offers significant performance advantages over a standard OSS bucket. Its low latency helps achieve higher throughput with fewer workers. In our tests, the OSS accelerator boosted training efficiency by 40% to 400%, substantially reducing compute resource consumption and costs.
Solution overview
This solution involves three steps:
-
Create an Elastic GPU Service instance: Create an Elastic GPU Service instance appropriate for your model training workload.
-
Create an OSS bucket and enable the OSS accelerator: Create an OSS bucket to store your data, enable the OSS accelerator, and retrieve the bucket's internal and accelerated endpoints for the training task.
-
Train the model: After preparing the resources, preprocess the original dataset and upload it to OSS. During training, use the OSS accelerator to load the dataset and start training.
Procedure
Step 1: Create an Elastic GPU Service instance
The following steps show how to create and connect to an Elastic GPU Service instance for model training. The instance uses the ecs.gn6i-c4g1.xlarge instance type, the Ubuntu 22.04 operating system, and CUDA version 12.4.1. When you customize the instance configuration, make sure to select the latest CUDA version.
1. Create an Elastic GPU Service instance
-
Click the Custom Launch tab.
-
Configure the billing method, region, network, zone, instance type, image, and other parameters as needed, and then create the instance. For more information about each parameter, see Parameters.
ImportantThe OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure your Elastic GPU Service instance is in one of these regions. This guide uses the China (Hangzhou) region as an example.
-
In this guide, the ECS instance type is ecs.gn6i-c4g1.xlarge. This is for reference only.
On the instance type selection page, click the All Generations tab, enter the instance type in the search box, and select GPU-accelerated Compute gn6i. This instance type provides 4 vCPUs, 15 GiB of memory, and 1 * NVIDIA T4 GPU (16 GB VRAM).
-
Select an operating system image. This guide uses Ubuntu 22.04 as an example. Select the Auto-install GPU Driver checkbox and select CUDA Version 12.4.1. This automatically installs the CUDA environment at startup, eliminating the need for manual configuration.
For the driver version, select 550.90.07. For the CUDNN version, select 9.2.0.82.
-
2. Connect to the Elastic GPU Service instance
-
In the ECS console, go to the Instance page. Find the ECS instance you created by region and instance ID, and click Connect in the Actions column.
-
In the Remote connection dialog box, click Sign in now for Workbench.
-
In the Instance Login dialog box, set Authentication to the method you chose when creating the instance. For example, select SSH Key Authentication, enter the username, and upload the private key file from the key pair you created. Click Log On to log on to the ECS instance.
NoteThe private key file is automatically downloaded to your local machine when you create a key pair. Check your browser's download history to find the
.pemprivate key file. -
A successful logon displays the following output, and the CUDA driver installation starts automatically. Wait for the installation to complete.
Welcome to Alibaba Cloud Elastic Compute Service ! Last login: Fri Dec 13 14:15:04 2024 from 100.104.86.0 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 20 100 20 0 0 645 0 --:--:-- --:--:-- --:--:-- 645 CHECKING AUTO INSTALL, DRIVER VERSION=550.90.07 CUDA_VERSION=12.4.1 CUDNN VERSION=9.2.0.82 , INSTALL RDMA=FALSE, INSTALL eRDMA=FALSE, PLEASE WAIT ...... The script automatically downloads and installs a NVIDIA GPU driver and CUDA, CUDNN library. if you choose install RDMA or ERDMA, RDMA or ERDMA software will install. if you choose install perseus, perseus environment will install as well. 1. The installation takes 15 to 20 minutes, depending on the intranet bandwidth and the quantity of vCPU cores of the instance. Please do not operate the GPU or install any GPU-related software until the GPU driver is installed successfully. 2. After the GPU is installed successfully, the instance will restarts automatically. CUDA-12.4.1 downloading, it takes 3 minutes or more. Remaining installation time 14 - 19 minutes!
Step 2: Prepare the OSS bucket and accelerator
These steps show how to create a bucket in the same region as the Elastic GPU Service instance to store datasets and enable the OSS accelerator to improve data access speed. Note that no traffic fees are incurred if the ECS instance and the bucket are in the same region and you access the bucket through its internal endpoint.
-
Create a bucket and get the internal endpoint
ImportantThe OSS accelerator feature is in public preview in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Ulanqab), China (Shenzhen), and Singapore. Make sure that the bucket you create is in one of these regions and is in the same region as your Elastic GPU Service instance. This guide uses the China (Hangzhou) region as an example.
-
In the OSS console, go to the Buckets page and click Create Bucket.
-
In the Create Bucket panel, follow the on-screen prompts to create the bucket.
-
On the Overview page of the bucket, find the Port section and copy the Endpoint for Access from ECS over the VPC (internal network). You will use this later to upload datasets and checkpoints to the bucket.
-
-
Enable the OSS accelerator and get its endpoint
-
In the OSS console, go to the Buckets page, select the target bucket, and then in the left-side navigation pane, choose . This takes you to the OSS Accelerator page.
-
Click Create Accelerator. In the Create Accelerator panel, set the accelerator capacity. This example uses 500 GB. Click Next.
-
Select Paths for the acceleration policy, configure the acceleration path to the dataset directory, and then click OK. Follow the prompts to finish creating the accelerator.
Enter
dataset/for the accelerated path and select synchronous pre-warming. When synchronous pre-warming is enabled, data written by the client through the accelerated endpoint is simultaneously written to both the OSS bucket and the OSS accelerator, ensuring lower latency for subsequent reads. -
On the OSS accelerator page, copy the accelerated endpoint, which is used to download the dataset during training.
-
Step 3: Train the model
The following steps guide you through configuring the training environment, uploading the dataset, and using the accelerated endpoint to accelerate model training on the Elastic GPU Service instance.
-
For the complete code project, see demo.tar.gz.
-
All of the following steps must be executed with root privileges. Switch to the root user before you begin training.
-
Configure the training environment
-
Prepare the conda environment and configure dependencies.
-
Run the following command to download and install conda.
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh && bash /tmp/miniconda.sh -b -p /opt/conda/ && rm /tmp/miniconda.sh && /opt/conda/bin/conda clean -tipy && export PATH=/opt/conda/bin:$PATH && conda init bash && source ~/.bashrc && conda update conda -
Run the
vim environment.yamlcommand to create and open a conda environment file namedenvironment.yaml. Add the following configuration and save the file.name: py312 channels: - defaults - conda-forge - pytorch dependencies: - python=3.12 - pytorch>=2.5.0 - torchvision - torchaudio - transformers - torchdata - oss2 -
Run the following command to create a conda environment named py312 from the environment file.
conda env create -f environment.yaml -
Run the
conda activate py312command to activate the conda environment named py312.ImportantPerform the following steps in the activated conda environment.
-
-
Configure environment variables.
Run the following command to configure the credentials required for uploading the dataset. Replace
<ACCESS_KEY_ID>and<ACCESS_KEY_SECRET>with the AccessKey ID and AccessKey secret of your RAM user. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey.export OSS_ACCESS_KEY_ID=<ACCESS_KEY_ID> export OSS_ACCESS_KEY_SECRET=<ACCESS_KEY_SECRET> -
Configure the OSS Connector.
-
Run the following command to install the OSS Connector.
pip install osstorchconnector -
Run the following command to create a credentials configuration file.
mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentials -
Run the
vim /root/.alibabacloud/credentialscommand to open the configuration file. Add the following configuration and save the file. For more information about OSS Connector configurations, see Configure OSS Connector for AI/ML.The following is an example configuration that uses an AccessKey ID and AccessKey secret as credentials. Replace
<AccessKeyId>and<AccessKeySecret>with the AccessKey ID and AccessKey secret of your RAM user. For information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey.{ "AccessKeyId": "LTAI************************", "AccessKeySecret": "At32************************" } -
Run the following command to set read-only permissions on the credentials file to protect your AccessKey ID and AccessKey secret.
chmod 400 /root/.alibabacloud/credentials -
Run the following command to create an OSS Connector configuration file.
mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.json -
Run the
vim /etc/oss-connector/config.jsoncommand to open the configuration file. Add the following configuration and save the file. The default configuration is sufficient in most cases.{ "logLevel": 1, "logPath": "/var/log/oss-connector/connector.log", "auditPath": "/var/log/oss-connector/audit.log", "datasetConfig": { "prefetchConcurrency": 24, "prefetchWorker": 2 }, "checkpointConfig": { "prefetchConcurrency": 24, "prefetchWorker": 4, "uploadConcurrency": 64 } }
-
-
-
Prepare the data
-
Upload the training and validation datasets to the target bucket.
-
Run the following commands to download the training and validation datasets to the ECS instance. Note that this is a test dataset and not for production use.
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241216/jsnenr/n04487081.tar wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241218/dxrciv/n10148035.tar wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241216/senwji/val.tar -
Run the following commands to extract the downloaded datasets and place them in a dataset directory in the current path.
tar -zxvf n10148035.tar && tar -zxvf n04487081.tar && tar -zxvf val.tar mkdir dataset && mkdir ./dataset/train && mkdir ./dataset/val mv n04487081 ./dataset/train/ && mv n10148035 ./dataset/train/ && mv IL*.JPEG ./dataset/val/ -
Run the
python3 upload_dataset.pyscript to upload the extracted datasets to the specified bucket.# upload_dataset.py from torchvision import transforms from PIL import Image import oss2 import os from oss2.credentials import EnvironmentVariableCredentialsProvider # The internal endpoint for the China (Hangzhou) region is used as an example. OSS_ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com" # The internal endpoint for OSS access. OSS_BUCKET_NAME = "<YourBucketName>" # The name of the target bucket. BUCKET_REGION = "cn-hangzhou" # The region of the target bucket. # OSS_URI_BASE: A custom storage prefix in the OSS bucket. OSS_URI_BASE = "dataset/imagenet/ILSVRC/Data" def to_tensor(img_path): IMG_DIM_224 = 224 compose = transforms.Compose([ transforms.RandomResizedCrop(IMG_DIM_224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) img = Image.open(img_path).convert('RGB') img_tensor = compose(img) numpy_data = img_tensor.numpy() binary_data = numpy_data.tobytes() return binary_data def list_dir(directory): for root, _, files in os.walk(directory): rel_root = os.path.relpath(root, start=directory) for file in files: rel_filepath = os.path.join(rel_root, file) if rel_root != '.' else file yield rel_filepath IMG_DIR_BASE = "./dataset" """ IMG_DIR_BASE is the local path where images are stored. You can use an absolute or relative path. The directory structure under this path should match the actual dataset structure, as shown below: {IMG_DIR_BASE}/ train/ n10148035/ n10148035_10034.JPEG n10148035_10217.JPEG ... n11879895/ n11879895_10016.JPEG n11879895_10019.JPEG ... ... val/ ILSVRC2012_val_00000001.JPEG ILSVRC2012_val_00000002.JPEG ... """ bucket_api = oss2.Bucket(oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider()), OSS_ENDPOINT, OSS_BUCKET_NAME, region=BUCKET_REGION) for phase in [ "val", "train"]: IMG_DIR = "%s/%s" % (IMG_DIR_BASE, phase) for _, img_relative_path in enumerate(list_dir(IMG_DIR)): img_bin_name = img_relative_path.replace(".JPEG", ".pt") object_key = "%s/%s/%s" % (OSS_URI_BASE, phase, img_bin_name) bucket_api.put_object(object_key, to_tensor("%s/%s" % (IMG_DIR,img_relative_path)))
-
-
Download the image dataset label files to build the classification mapping for the dataset.
wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241220/izpskr/imagenet_class_index.json wget https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/en-US/20241220/lfilrp/ILSVRC2012_val_labels.json
-
-
Training process
-
Build a utility module to process the ImageNet dataset. This module primarily uses the accelerated endpoint to download the dataset from the OSS accelerator and build a data loader.
-
Build a utility module for initializing the pre-trained ResNet18 model.
-
Build a utility module for training the ResNet model. This module trains the model based on the given model, data loaders, and number of training epochs.
-
Build the main script file, which integrates the utility modules to start model training.
-
Run the
python3 main.pycommand to start model training. The following output indicates a successful start:(py312) root@xxx :~# python3 main.py /opt/conda/envs/py312/lib/python3.12/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn() /opt/conda/envs/py312/lib/python3.12/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights. warnings.warn(msg) Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.7M/44.7M [00:10<00:00, 4.40MB/s] 2024/12/20 13:36:34.853415| INFO |th=00000000000000000|dataset.cpp:222|new_oss_dataset:new oss dataset, uuid: 92251a5f-7f00-45b5-aaba-b8894ab15b0d id: 0, total: 1, pid: 20655, endpoint: cn-hangzhou-internal.oss-data-acc.aliyuncs.com 2024/12/20 13:36:34.853437| INFO |th=00000000000000000|dataset.cpp:230|new_oss_dataset:[cred_path=/root/.alibabacloud/credentials][config_path=/etc/oss-connector/config.json] 2024/12/20 13:36:34.853518| INFO |th=00000000000000000|dataset.cpp:255|new_oss_dataset:set log level: 1 2024/12/20 13:36:34.853591| INFO |th=00000000000000000|dataset.cpp:262|new_oss_dataset:set log path: /var/log/oss-connector/connector_log_0 /root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.) return torch.from_numpy(numpy_array_from_binary) /root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.) return torch.from_numpy(numpy_array_from_binary) /root/oss_dataloader.py:51: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /home/conda/feedstock_root/build_artifacts/libtorch_1733624414920/work/torch/csrc/utils/tensor_numpy.cpp:206.) return torch.from_numpy(numpy_array_from_binary)
-
-
Verify the result
Go to the Buckets page, select the target bucket, and click . Verify that the
resnet18.ptfile exists in thecheckpointsdirectory, which confirms that the checkpoint was successfully uploaded to OSS.