DeepNCCL is an AI communication acceleration library developed for Alibaba Cloud's heterogeneous computing products. It improves the communication efficiency of multi-GPU interconnects in distributed training or multi-GPU inference tasks. This topic describes the architecture, optimization principles, and performance of DeepNCCL.
Product introduction
DeepNCCL is based on the NVIDIA Collective Communications Library (NCCL). It calls NCCL communication operators to provide more efficient communication for multi-GPU interconnects and seamlessly accelerate tasks such as distributed training or multi-GPU inference.
The following figure shows the architecture of DeepNCCL.
Architecture layer | Description | |
AI models | DeepNCCL delivers universal performance for AI scenarios. It is applicable to AI models such as large language models (LLMs) and Stable Diffusion text-to-image models. | |
AI framework layer | The AI framework layer supports the following AI frameworks and features:
| |
DeepNCCL communication acceleration | Interface layer | The interface layer uses DeepncclWrapper to encapsulate nccl-base functions. This provides universal support for communication algorithms. Supported NCCL communication algorithms include all-reduce, reduce-scatter, and all-gather. |
Collective algorithm layer | The collective algorithm layer uses collective communication compilation technology to build adaptive topology algorithms for different instance types. This achieves full NCCL Runtime compatibility and seamless optimization of the collective communication topology. | |
Network layer | The network layer provides seamless communication optimization by adapting to Alibaba Cloud's network infrastructure, such as VPC, RDMA, or elastic Remote Direct Memory Access (eRDMA). | |
Optimization principles
The DeepNCCL communication acceleration library significantly optimizes communication for AI distributed training or multi-GPU inference tasks.
Single-machine optimization
Single-node optimization focuses on improving communication for instance types with different hardware topologies. The following sections describe optimizations for PCIe and NVLink interconnects.
PCIe interconnect topology optimization: In this instance type, multiple GPU cards share PCIe bandwidth. Communication is often limited by the physical bandwidth. For PCIe interconnect topologies, the CPU-Reduce algorithm, which is a pipelined parameter server (PS) mode gradient reduction algorithm, can theoretically reduce communication time. This algorithm builds a pipeline from the GPU to the CPU and then back to the GPU. It distributes the gradient reduction computation across multiple devices to reduce communication bottlenecks.
For example, in scenarios where the data volume exceeds 4 MB, the PCIe interconnect topology optimization provides a performance improvement of over 20% compared to native NCCL.
NVLink interconnect topology optimization: The default Binary-Tree algorithm used by NCCL does not fully utilize the multi-channel performance on V100 instance types. For NVLink interconnect topologies, this optimization extends the combinations of different N-Trees topology structures within a single node. This allows for topology tuning and leverages multi-channel performance.
For example, in scenarios where the data volume exceeds 128 MB, the NVLink interconnect topology optimization provides a performance improvement of over 20% compared to native NCCL.
Multi-node optimization
Multi-node optimization includes communication operator compilation optimization, TCP multi-stream optimization, and multi-node CPU-Reduce optimization.
Communication operator compilation optimization: The Hybrid+ algorithm supports hierarchical communication for single and multiple nodes. This is different from algorithms such as Allreduce, Allgather, or Reduce-scatter, which are based on a global topology structure. The Hybrid+ algorithm is designed for the unique characteristics of different Alibaba Cloud instance types and the various topologies that connect network interface cards and GPUs. It fully utilizes the high internal bandwidth of a single node while reducing the amount of communication between multiple nodes. This optimization improves performance by over 50% compared to native NCCL.
Communication multi-stream optimization: Network bandwidth is often underutilized. This prevents the cross-node performance of upper-layer collective communication algorithms from reaching its full potential. Using the TCP/IP-based multi-stream feature improves the concurrent communication capability of distributed training. This can increase multi-node training performance by 5% to 20%.
Multi-node CPU-Reduce: This optimization inherits the efficient asynchronous pipeline of the single-node CPU-Reduce. It also formats the cross-node Socket communication as a pipeline. This creates a pipelined process for all multi-node communication, which reduces communication latency and improves overall training performance.
For example, in multi-node training scenarios for Transformer-based models with large data volumes, the multi-node CPU-Reduce optimization can improve end-to-end performance by over 20%.
Performance
The DeepNCCL communication acceleration library improves performance for single-node Allreduce, multi-node Allreduce, multi-node Reduce-scatter, and multi-node Allgather operations.
Feature | Supported scope | Performance improvement |
Single-node Allreduce optimization | Supports 8-card A10 instance types, such as ecs.ebmgn7ix.32xlarge. | Compared to native NCCL, DeepNCCL improves single-node Allreduce communication performance by 10% to 100% when the data volume is between 512 B and 2 MB. |
Multi-node Allreduce optimization | Supports V100 or A10 instance types, such as ecs.gn6v-c10g1.20xlarge or ecs.ebmgn7ix.32xlarge. |
|
Multi-node Reduce-scatter optimization | Supports V100 or A10 instance types, such as ecs.gn6v-c10g1.20xlarge or ecs.ebmgn7ix.32xlarge. |
|
Multi-node Allgather optimization | Supports V100 or A10 instance types, such as ecs.gn6v-c10g1.20xlarge or ecs.ebmgn7ix.32xlarge. |
|
References
You can install the DeepNCCL communication library on Elastic GPU Service instances to accelerate distributed training or inference performance. For more information, see Install and use DeepNCCL.