Install and use Deepytorch Training to accelerate model training-Elastic GPU Service(EGS)-阿里云帮助中心

Deepytorch Training is an AI accelerator developed by Alibaba Cloud. It provides significant training acceleration for both traditional and generative AI workloads.

Note

For more information about Deepytorch Training, see What is Deepytorch Training (training acceleration)?.

Prerequisites

You have created an Alibaba Cloud GPU instance that meets the following requirements:

The operating system is Alibaba Cloud Linux, CentOS 7.x, Ubuntu 18.04, or later.
An NVIDIA GPU driver and CUDA are installed and meet the required versions.

When you create a GPU instance, we recommend selecting the Install GPU driver option after you select an image. Then, select the versions of CUDA, the driver, and cuDNN. For more information, see Create a GPU instance.
PyTorch is installed and meets the version requirements.

Supported versions

Deepytorch Training supports multiple versions of PyTorch, CUDA, and Python. The following table lists the version compatibility.

PyTorch version	CUDA runtime version	Python version
1.10.x	11.1/11.3	3.8/3.9
1.11.x	11.3	3.8/3.9/3.10
1.12.x	11.3/11.6	3.8/3.9/3.10
1.13.x	11.6/11.7	3.8/3.9/3.10
2.0.x	11.7/11.8	3.8/3.9/3.10/3.11
2.1.x	11.8/12.1	3.8/3.9/3.10/3.11
2.2.x	11.8/12.1	3.8/3.9/3.10/3.11

Install Deepytorch Training

This example shows how to install version 2.1.0. Run the following command to install Deepytorch Training.

Note

Deepytorch Training is part of the DeepGPU toolkit. DeepGPU automatically selects a compatible Deepytorch Training package based on your software environment.

pip3 install deepgpu==2.1.0

Use Deepytorch Training

To enable Deepytorch Training optimizations, add a single line at the beginning of your training script:

import deepytorch  # Import the deepytorch library

Note

Add import deepytorch before import torch.

Verify the training performance

This example uses the ResNet50 model to demonstrate the acceleration performance of Deepytorch Training.

In this example, PyTorch is 2.2.0, and the GPU instance type is ecs.ebmgn7vx.32xlarge.

Run the following command to navigate to the example code directory:

cd `echo $(python -c "import deepytorch; print(deepytorch)") | cut -d\' -f 4 | sed "s/\_\_init\_\_\.py//"`examples/DDPBenchmark

Train the ResNet50 model.

This example uses a single instance with 8 GPUs and a batch size of 512.
- Train the model with native PyTorch
```
bash run_benchmark.sh 1 0 8
```
  Result with native PyTorch: The training throughput is 1,571 images/second.
```
8 GPUs --      1M/8G:  p50:  0.326s    1572/s  p75:  0.326s       1571/s  p90:  0.326s      1571/s  p95:  0.326s      1571/s
```
- Accelerate model training with Deepytorch Training
```
bash run_benchmark_deepgpu.sh 1 0 8
```
  Result with Deepytorch Training: The training throughput is 2,908 images/second.
```
8 GPUs --    1M/8G:  p50:  0.176s    2912/s  p75:  0.176s    2911/s  p90:  0.176s    2909/s  p95:  0.176s    2908/s
```
Note
- If your instance does not have eight GPUs, change the last number in the command to your GPU count. For example, for an instance with two GPUs:
```
bash run_benchmark_deepgpu.sh 1 0 2
```
- If the training log shows an out-of-memory (OOM) error, reduce the --batch-size value in the run_benchmark.sh and run_benchmark_deepgpu.sh scripts to 256 or 128.
Compare the performance results.

Deepytorch Training increases the training throughput from 1,571 images/second (with native PyTorch) to 2,908 images/second, an 85% performance increase.