Deepytorch Training (training acceleration) introduction, benefits, and features-Elastic GPU Service(EGS)-阿里云帮助中心

Deepytorch Training is an AI training accelerator from Alibaba Cloud that accelerates training for traditional and generative AI scenarios. This topic describes the concepts, benefits, and features of Deepytorch Training.

Introduction to Deepytorch Training

Deepytorch Training accelerates training for traditional and generative AI scenarios. It integrates performance optimizations for distributed communication and computation graph compilation to significantly improve end-to-end training performance while maintaining accuracy. This improvement leads to lower costs and faster iterations. Deepytorch Training is also fully compatible with the open-source ecosystem, which allows AI developers to easily integrate the accelerator into their business code and benefit from the acceleration.

Benefits

Significant improvement in training performance

Deepytorch Training integrates performance optimization features for distributed communication and computation graph compilation. These features significantly improve end-to-end training performance and speed up model training iterations. This reduction in resource costs and iteration time provides a cost-effective experience.

Example: This example shows the end-to-end training performance improvement for generative AI and traditional AI scenarios when using the ecs.ebmgn7vx.32xlarge instance type.

Note

Compared to native PyTorch, Deepytorch Training significantly improves end-to-end training performance.

Scenario	Model name	Nodes × GPUs	Configuration information	Training performance improvement rate
Generative AI	Llama2-13B	2 × 8	ZeRO stage 2 micro batch size=4 finetune alpaca_gpt4_en	48% improvement
	Qwen2-14B	2 × 8	ZeRO stage 2 micro batch size=4 finetune alpaca_gpt4_en	21% improvement
	LLaMa-65B	2 × 8	ZeRO stage 3 micro batch size=8 activation recomputing params offload	30% improvement
	stable diffusion v2.1	1 × 1	dreambooth batch size=5 fp16	22% improvement
Traditional AI	ResNet50	2 × 8	micro batch size=512 mixed precision=amp	89% improvement
Traditional AI	BERT	2 × 8	micro batch size=32 mixed precision=amp	42% improvement

Easy to use
- Deepytorch Training is fully compatible with the open-source ecosystem, including mainstream PyTorch versions and distributed training frameworks, such as DeepSpeed, PyTorch FSDP, and Megatron-LM.
- You can use Deepytorch Training by adding one line of code to the beginning of your Python training script.
```
import deepytorch
```

Features

Deepytorch Training significantly accelerates both communication and computation in AI training.

Communication optimization features

Single-node optimization
Single-node optimization focuses on communication for instance types with different hardware topologies, such as instances with PCIe and NVLink interconnects.
- PCIe interconnect topology optimization: On instances with this topology, multiple GPU cards share PCIe bandwidth, and communication is often limited by the physical bandwidth. The CPU-Reduce algorithm optimizes communication for PCIe interconnect topologies. This algorithm is a pipelined gradient reduction algorithm based on the parameter server (PS) pattern that reduces communication time. It builds a GPU-to-CPU-to-GPU pipeline and distributes the gradient reduction computation across multiple devices to reduce communication bottlenecks.
  For example, in scenarios where the communication data volume exceeds 4 MB, the PCIe interconnect topology optimization improves performance by more than 20% compared to native NCCL.
- NVLink interconnect topology optimization: The default Binary-Tree algorithm used by NCCL does not fully utilize multi-channel performance on V100 instances. To optimize communication for NVLink interconnect topologies, Deepytorch Training extends different N-Trees topology combinations within a single node. This extension allows for topology tuning and full utilization of multi-channel performance.
  For example, in scenarios where the communication data volume exceeds 128 MB, the NVLink interconnect topology optimization improves performance by more than 20% compared to native NCCL.
Multi-node optimization
Multi-node optimization includes communication operator compilation optimization, TCP multi-stream optimization, and multi-node CPU-Reduce optimization.
- Communication operator compilation optimization: This optimization is designed for different Alibaba Cloud instance types and their various network interface card (NIC) to GPU topologies. Compared to algorithms such as Allreduce, Allgather, or Reduce-scatter that are based on a global topology, the Hybrid+ algorithm supports hierarchical communication across single and multiple nodes. It fully utilizes the high-speed internal bandwidth of a single node while reducing the amount of communication between multiple nodes. This optimization improves performance by more than 50% compared to native NCCL.
- Communication multi-stream optimization: If network bandwidth is not fully utilized, the cross-node performance of upper-layer collective communication algorithms is not optimal. The TCP/IP-based multi-stream feature enhances the concurrent communication capability of distributed training. This can improve multi-node training performance by 5% to 20%.
- Multi-node CPU-Reduce: This optimization inherits the efficient asynchronous pipeline of the single-node CPU-Reduce and applies a pipelined design to cross-node Socket communication. This approach pipelines the entire multi-node communication process, which effectively reduces communication latency and improves overall training performance.
  For example, in multi-node training scenarios for Transformer-based models with large communication volumes, the multi-node CPU-Reduce optimization can further improve end-to-end performance by more than 20%.

Computation optimization features

Deepytorch Training significantly optimizes computation in various AI scenarios. Its features are described in the following sections:

For large language model (LLM) fine-tuning with non-fixed-length sequences, Deepytorch Training effectively reduces the model's computational load and seamlessly improves training performance under various ZeRO configurations.
For Stable Diffusion training scenarios, Deepytorch Training provides customized performance optimization solutions that seamlessly improve training performance under various training configurations.
For the PyTorch compilation module, Deepytorch Training enhances performance and robustness. It automatically selects the optimal policy, which significantly accelerates the training of traditional AI models.

Installation and usage

Using the Deepytorch Training tool to optimize model training significantly improves training performance. For more information, see Install and use Deepytorch Training.

References

DeepNCCL is installed by default when you install Deepytorch. DeepNCCL is an AI communication acceleration library for multi-GPU interconnects. It enables more efficient multi-GPU communication and seamlessly accelerates tasks such as distributed training and multi-card inference. For more information, see What is DeepNCCL AI communication acceleration library?.
In addition to the Deepytorch Training accelerator, DeepGPU also provides the Deepytorch Inference accelerator. You can use the Deepytorch Inference tool to provide high-performance inference acceleration for PyTorch models. It uses just-in-time (JIT) compilation technology to optimize models for inference and achieve efficient and fast acceleration. For more information, see What is Deepytorch Inference (inference acceleration)? and Install and use Deepytorch Inference.