Deepytorch Training is an AI accelerator developed by Alibaba Cloud. It provides significant training acceleration for both traditional and generative AI workloads.
For more information about Deepytorch Training, see What is Deepytorch Training (training acceleration)?.
Prerequisites
You have created an Alibaba Cloud GPU instance that meets the following requirements:
-
The operating system is Alibaba Cloud Linux, CentOS 7.x, Ubuntu 18.04, or later.
-
An NVIDIA GPU driver and CUDA are installed and meet the required versions.
When you create a GPU instance, we recommend selecting the Install GPU driver option after you select an image. Then, select the versions of CUDA, the driver, and cuDNN. For more information, see Create a GPU instance.
-
PyTorch is installed and meets the version requirements.
Supported versions
Deepytorch Training supports multiple versions of PyTorch, CUDA, and Python. The following table lists the version compatibility.
|
PyTorch version |
CUDA runtime version |
Python version |
|
1.10.x |
11.1/11.3 |
3.8/3.9 |
|
1.11.x |
11.3 |
3.8/3.9/3.10 |
|
1.12.x |
11.3/11.6 |
3.8/3.9/3.10 |
|
1.13.x |
11.6/11.7 |
3.8/3.9/3.10 |
|
2.0.x |
11.7/11.8 |
3.8/3.9/3.10/3.11 |
|
2.1.x |
11.8/12.1 |
3.8/3.9/3.10/3.11 |
|
2.2.x |
11.8/12.1 |
3.8/3.9/3.10/3.11 |
Install Deepytorch Training
This example shows how to install version 2.1.0. Run the following command to install Deepytorch Training.
Deepytorch Training is part of the DeepGPU toolkit. DeepGPU automatically selects a compatible Deepytorch Training package based on your software environment.
pip3 install deepgpu==2.1.0
Use Deepytorch Training
To enable Deepytorch Training optimizations, add a single line at the beginning of your training script:
import deepytorch # Import the deepytorch library
Add import deepytorch before import torch.
Verify the training performance
This example uses the ResNet50 model to demonstrate the acceleration performance of Deepytorch Training.
In this example, PyTorch is 2.2.0, and the GPU instance type is ecs.ebmgn7vx.32xlarge.
-
Run the following command to navigate to the example code directory:
cd `echo $(python -c "import deepytorch; print(deepytorch)") | cut -d\' -f 4 | sed "s/\_\_init\_\_\.py//"`examples/DDPBenchmark -
Train the ResNet50 model.
This example uses a single instance with 8 GPUs and a batch size of 512.
-
Train the model with native PyTorch
bash run_benchmark.sh 1 0 8Result with native PyTorch: The training throughput is 1,571 images/second.
8 GPUs -- 1M/8G: p50: 0.326s 1572/s p75: 0.326s 1571/s p90: 0.326s 1571/s p95: 0.326s 1571/s -
Accelerate model training with Deepytorch Training
bash run_benchmark_deepgpu.sh 1 0 8Result with Deepytorch Training: The training throughput is 2,908 images/second.
8 GPUs -- 1M/8G: p50: 0.176s 2912/s p75: 0.176s 2911/s p90: 0.176s 2909/s p95: 0.176s 2908/s
Note-
If your instance does not have eight GPUs, change the last number in the command to your GPU count. For example, for an instance with two GPUs:
bash run_benchmark_deepgpu.sh 1 0 2 -
If the training log shows an out-of-memory (OOM) error, reduce the
--batch-sizevalue in therun_benchmark.shandrun_benchmark_deepgpu.shscripts to 256 or 128.
-
-
Compare the performance results.
Deepytorch Training increases the training throughput from 1,571 images/second (with native PyTorch) to 2,908 images/second, an 85% performance increase.