AI acceleration: Configure a staging environment-Platform For AI(PAI)-阿里云帮助中心

Before using PAI-TorchAcc for training acceleration, you must prepare a training environment that meets the required specifications. You can create a Data Science Workshop (DSW) instance in PAI or use an existing Elastic Compute Service (ECS) instance for this purpose. This topic describes the environment requirements for TorchAcc training acceleration.

Environment requirements

To use TorchAcc for training acceleration, you must use a GPU-accelerated instance. The required versions and specifications are listed below.

Version requirements
Driver
Version
CUDA driver
11.3 or later
Nvidia driver
470 or later
Specifications
Instance type
Supported
V100M16
Support
V100M32
Supported
GU50
Support
GU100
Support
GU108
Support
A10M24
Support
For more information about instance types, see Appendix: List of public resource specifications.
Image requirements
To use TorchAcc for training acceleration, you must use the specified TorchAcc test runtime image: registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219.
Note
This image environment is available only in the China (Hangzhou) region.

Prepare the staging environment

Using a DSW environment

To test TorchAcc on the PAI platform, you can create a DSW instance and perform tests in a Jupyter Notebook as follows:

In the China (Hangzhou) region, create a resource and add a resource quota that meet the environment requirements. For more information, see Create a resource group and purchase general computing resources and General computing resource quotas.

Go to the workspace that is associated with the resource quota and create a developer instance. Configure the key parameters as follows. For more information, see Create and manage DSW instances.

Parameter	Description
Resource Type	Select Resource Quota.
Resource Quota	Select the resource quota that you created in Step 1.
Resource Specifications	Set CPU (Cores) to 30. Set Memory (GiB) to 180. Set Shared Memory (GiB) to 100. Set GPU (Cards) to 1.
Image	On the Registry Address tab, set the registry address to `registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219`. Note This image environment is available only in the China (Hangzhou) region.

Use your own ECS instance

To use your own ECS resources for TorchAcc training acceleration, you can prepare an ECS instance as follows:

In the China (Hangzhou) region, purchase an ECS instance that meets the environment requirements and install the required versions of the nvidia-smi and CUDA driver. For more information about how to purchase an instance, see Create an instance. The key parameters are as follows:
- Set Instance to ecs.gn6v-c8g1.2xlarge.
- For Image, select Public Image > Alibaba Cloud Linux > Alibaba Cloud Linux 3.2104 LST 64-bit. Select the Install GPU Driver check box and then select the following versions: CUDA Version 11.4.1 > Driver Version 470.161.03 > CUDNN Version 8.2.4.
- For the System Disk, allocate at least 80 GiB of storage capacity.
Install Docker on the ECS instance. For more information, see Install and use Docker and Docker Compose.
Install the NVIDIA Container Toolkit. For more information, see Installing the NVIDIA Container Toolkit.
Select the installation command for your operating system. The steps in this topic use yum or dnf. After the installation is complete, restart the Docker daemon.

You can run the following script to start the TorchAcc runtime image.

DOCKER=registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219
name=TorchAcc_Tutorials

set -x
docker run \
    --name $name \
    --rm -it \
    --privileged \
    --ulimit memlock=-1:-1 \
    --gpus all \
    --shm-size 10G \
    -v /dev/shm:/dev/shm \
    --ipc host \
    --network host \
    --rm \
    --cap-add=CAP_SYS_ADMIN \
    -v /path/to/code:/workspace \
    -w /workspace \
    ${DOCKER} bash

Driver	Version
CUDA driver	11.3 or later
Nvidia driver	470 or later

Instance type	Supported
V100M16	Support
V100M32	Supported
GU50	Support
GU100	Support
GU108	Support
A10M24	Support