Prepare the resource environment for acceleration

更新时间:
复制 MD 格式

Before using PAI-TorchAcc for training acceleration, you must prepare a training environment that meets the required specifications. You can create a Data Science Workshop (DSW) instance in PAI or use an existing Elastic Compute Service (ECS) instance for this purpose. This topic describes the environment requirements for TorchAcc training acceleration.

Environment requirements

To use TorchAcc for training acceleration, you must use a GPU-accelerated instance. The required versions and specifications are listed below.

  • Version requirements

    Driver

    Version

    CUDA driver

    11.3 or later

    Nvidia driver

    470 or later

  • Specifications

    Instance type

    Supported

    V100M16

    Support

    V100M32

    Supported

    GU50

    Support

    GU100

    Support

    GU108

    Support

    A10M24

    Support

    For more information about instance types, see Appendix: List of public resource specifications.

  • Image requirements

    To use TorchAcc for training acceleration, you must use the specified TorchAcc test runtime image: registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219.

    Note

    This image environment is available only in the China (Hangzhou) region.

Prepare the staging environment

Using a DSW environment

To test TorchAcc on the PAI platform, you can create a DSW instance and perform tests in a Jupyter Notebook as follows:

  1. In the China (Hangzhou) region, create a resource and add a resource quota that meet the environment requirements. For more information, see Create a resource group and purchase general computing resources and General computing resource quotas.

  2. Go to the workspace that is associated with the resource quota and create a developer instance. Configure the key parameters as follows. For more information, see Create and manage DSW instances.

    Parameter

    Description

    Resource Type

    Select Resource Quota.

    Resource Quota

    Select the resource quota that you created in Step 1.

    Resource Specifications

    • Set CPU (Cores) to 30.

    • Set Memory (GiB) to 180.

    • Set Shared Memory (GiB) to 100.

    • Set GPU (Cards) to 1.

    Image

    On the Registry Address tab, set the registry address to registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219.

    Note

    This image environment is available only in the China (Hangzhou) region.

Use your own ECS instance

To use your own ECS resources for TorchAcc training acceleration, you can prepare an ECS instance as follows:

  1. In the China (Hangzhou) region, purchase an ECS instance that meets the environment requirements and install the required versions of the nvidia-smi and CUDA driver. For more information about how to purchase an instance, see Create an instance. The key parameters are as follows:

    • Set Instance to ecs.gn6v-c8g1.2xlarge.

    • For Image, select Public Image > Alibaba Cloud Linux > Alibaba Cloud Linux 3.2104 LST 64-bit. Select the Install GPU Driver check box and then select the following versions: CUDA Version 11.4.1 > Driver Version 470.161.03 > CUDNN Version 8.2.4.

    • For the System Disk, allocate at least 80 GiB of storage capacity.

  2. Install Docker on the ECS instance. For more information, see Install and use Docker and Docker Compose.

  3. Install the NVIDIA Container Toolkit. For more information, see Installing the NVIDIA Container Toolkit.

    Select the installation command for your operating system. The steps in this topic use yum or dnf. After the installation is complete, restart the Docker daemon.

  4. You can run the following script to start the TorchAcc runtime image.

    DOCKER=registry.cn-hangzhou.aliyuncs.com/pai-dlc/pai-pytorch-training:torch-1.12-cuda11.3-py38-acc-230219
    name=TorchAcc_Tutorials
    
    set -x
    docker run \
        --name $name \
        --rm -it \
        --privileged \
        --ulimit memlock=-1:-1 \
        --gpus all \
        --shm-size 10G \
        -v /dev/shm:/dev/shm \
        --ipc host \
        --network host \
        --rm \
        --cap-add=CAP_SYS_ADMIN \
        -v /path/to/code:/workspace \
        -w /workspace \
        ${DOCKER} bash