Before using PAI-TorchAcc for training acceleration, you must prepare a training environment that meets the required specifications. You can create a Data Science Workshop (DSW) instance in PAI or use an existing Elastic Compute Service (ECS) instance for this purpose. This topic describes the environment requirements for TorchAcc training acceleration.
Environment requirements
To use TorchAcc for training acceleration, you must use a GPU-accelerated instance. The required versions and specifications are listed below.
Version requirements
Driver
Version
CUDA driver
11.3 or later
Nvidia driver
470 or later
Specifications
Instance type
Supported
V100M16
Support
V100M32
Supported
GU50
Support
GU100
Support
GU108
Support
A10M24
Support
For more information about instance types, see Appendix: List of public resource specifications.
Image requirements
To use TorchAcc for training acceleration, you must use the specified TorchAcc test runtime image:
registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219.NoteThis image environment is available only in the China (Hangzhou) region.
Prepare the staging environment
Using a DSW environment
To test TorchAcc on the PAI platform, you can create a DSW instance and perform tests in a Jupyter Notebook as follows:
In the China (Hangzhou) region, create a resource and add a resource quota that meet the environment requirements. For more information, see Create a resource group and purchase general computing resources and General computing resource quotas.
Go to the workspace that is associated with the resource quota and create a developer instance. Configure the key parameters as follows. For more information, see Create and manage DSW instances.
Parameter
Description
Resource Type
Select Resource Quota.
Resource Quota
Select the resource quota that you created in Step 1.
Resource Specifications
Set CPU (Cores) to 30.
Set Memory (GiB) to 180.
Set Shared Memory (GiB) to 100.
Set GPU (Cards) to 1.
Image
On the Registry Address tab, set the registry address to
registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219.NoteThis image environment is available only in the China (Hangzhou) region.
Use your own ECS instance
To use your own ECS resources for TorchAcc training acceleration, you can prepare an ECS instance as follows:
In the China (Hangzhou) region, purchase an ECS instance that meets the environment requirements and install the required versions of the nvidia-smi and CUDA driver. For more information about how to purchase an instance, see Create an instance. The key parameters are as follows:
Set Instance to ecs.gn6v-c8g1.2xlarge.
For Image, select Public Image > Alibaba Cloud Linux > Alibaba Cloud Linux 3.2104 LST 64-bit. Select the Install GPU Driver check box and then select the following versions: CUDA Version 11.4.1 > Driver Version 470.161.03 > CUDNN Version 8.2.4.
For the System Disk, allocate at least 80 GiB of storage capacity.
Install Docker on the ECS instance. For more information, see Install and use Docker and Docker Compose.
Install the NVIDIA Container Toolkit. For more information, see Installing the NVIDIA Container Toolkit.
Select the installation command for your operating system. The steps in this topic use yum or dnf. After the installation is complete, restart the Docker daemon.
You can run the following script to start the TorchAcc runtime image.
DOCKER=registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219 name=TorchAcc_Tutorials set -x docker run \ --name $name \ --rm -it \ --privileged \ --ulimit memlock=-1:-1 \ --gpus all \ --shm-size 10G \ -v /dev/shm:/dev/shm \ --ipc host \ --network host \ --rm \ --cap-add=CAP_SYS_ADMIN \ -v /path/to/code:/workspace \ -w /workspace \ ${DOCKER} bash