What is the DeepNCCL AI communication acceleration library?

更新时间:
复制 MD 格式

DeepNCCL is an AI communication acceleration library developed for Alibaba Cloud's heterogeneous computing products. It improves the communication efficiency of multi-GPU interconnects in distributed training or multi-GPU inference tasks. This topic describes the architecture, optimization principles, and performance of DeepNCCL.

Product introduction

DeepNCCL is based on the NVIDIA Collective Communications Library (NCCL). It calls NCCL communication operators to provide more efficient communication for multi-GPU interconnects and seamlessly accelerate tasks such as distributed training or multi-GPU inference.

The following figure shows the architecture of DeepNCCL.

image

Architecture layer

Description

AI models

DeepNCCL delivers universal performance for AI scenarios. It is applicable to AI models such as large language models (LLMs) and Stable Diffusion text-to-image models.

AI framework layer

The AI framework layer supports the following AI frameworks and features:

  • Common AI frameworks, such as PyTorch, TensorFlow, and Mxnet.

  • Parallel frameworks built on AI frameworks, such as Megatron, DeepSpeed, and Colossal-AI.

  • DeepNCCL uses the underlying Deepytorch to ensure full compatibility with the PyTorch framework and provide seamless performance optimization for distributed training. It also adds extra tuning methods such as fusion optimization.

    Note

    For more information about Deepytorch, see What is Deepytorch Training (training acceleration)?.

DeepNCCL communication acceleration

Interface layer

The interface layer uses DeepncclWrapper to encapsulate nccl-base functions. This provides universal support for communication algorithms. Supported NCCL communication algorithms include all-reduce, reduce-scatter, and all-gather.

Collective algorithm layer

The collective algorithm layer uses collective communication compilation technology to build adaptive topology algorithms for different instance types. This achieves full NCCL Runtime compatibility and seamless optimization of the collective communication topology.

Network layer

The network layer provides seamless communication optimization by adapting to Alibaba Cloud's network infrastructure, such as VPC, RDMA, or elastic Remote Direct Memory Access (eRDMA).

Optimization principles

The DeepNCCL communication acceleration library significantly optimizes communication for AI distributed training or multi-GPU inference tasks.

Single-machine optimization

Single-node optimization focuses on improving communication for instance types with different hardware topologies. The following sections describe optimizations for PCIe and NVLink interconnects.

  • PCIe interconnect topology optimization: In this instance type, multiple GPU cards share PCIe bandwidth. Communication is often limited by the physical bandwidth. For PCIe interconnect topologies, the CPU-Reduce algorithm, which is a pipelined parameter server (PS) mode gradient reduction algorithm, can theoretically reduce communication time. This algorithm builds a pipeline from the GPU to the CPU and then back to the GPU. It distributes the gradient reduction computation across multiple devices to reduce communication bottlenecks.

    For example, in scenarios where the data volume exceeds 4 MB, the PCIe interconnect topology optimization provides a performance improvement of over 20% compared to native NCCL.

  • NVLink interconnect topology optimization: The default Binary-Tree algorithm used by NCCL does not fully utilize the multi-channel performance on V100 instance types. For NVLink interconnect topologies, this optimization extends the combinations of different N-Trees topology structures within a single node. This allows for topology tuning and leverages multi-channel performance.

    For example, in scenarios where the data volume exceeds 128 MB, the NVLink interconnect topology optimization provides a performance improvement of over 20% compared to native NCCL.

Multi-node optimization

Multi-node optimization includes communication operator compilation optimization, TCP multi-stream optimization, and multi-node CPU-Reduce optimization.

  • Communication operator compilation optimization: The Hybrid+ algorithm supports hierarchical communication for single and multiple nodes. This is different from algorithms such as Allreduce, Allgather, or Reduce-scatter, which are based on a global topology structure. The Hybrid+ algorithm is designed for the unique characteristics of different Alibaba Cloud instance types and the various topologies that connect network interface cards and GPUs. It fully utilizes the high internal bandwidth of a single node while reducing the amount of communication between multiple nodes. This optimization improves performance by over 50% compared to native NCCL.

  • Communication multi-stream optimization: Network bandwidth is often underutilized. This prevents the cross-node performance of upper-layer collective communication algorithms from reaching its full potential. Using the TCP/IP-based multi-stream feature improves the concurrent communication capability of distributed training. This can increase multi-node training performance by 5% to 20%.

  • Multi-node CPU-Reduce: This optimization inherits the efficient asynchronous pipeline of the single-node CPU-Reduce. It also formats the cross-node Socket communication as a pipeline. This creates a pipelined process for all multi-node communication, which reduces communication latency and improves overall training performance.

    For example, in multi-node training scenarios for Transformer-based models with large data volumes, the multi-node CPU-Reduce optimization can improve end-to-end performance by over 20%.

Performance

The DeepNCCL communication acceleration library improves performance for single-node Allreduce, multi-node Allreduce, multi-node Reduce-scatter, and multi-node Allgather operations.

Feature

Supported scope

Performance improvement

Single-node Allreduce optimization

Supports 8-card A10 instance types, such as ecs.ebmgn7ix.32xlarge.

Compared to native NCCL, DeepNCCL improves single-node Allreduce communication performance by 10% to 100% when the data volume is between 512 B and 2 MB.

Multi-node Allreduce optimization

Supports V100 or A10 instance types, such as

ecs.gn6v-c10g1.20xlarge or

ecs.ebmgn7ix.32xlarge.

  • Compared to native NCCL, DeepNCCL improves dual-node Allreduce communication performance by 10% to 20% when the data volume is between 128 MB and 256 MB.

  • DeepNCCL currently supports distributed training on up to 8 nodes.

Multi-node Reduce-scatter optimization

Supports V100 or A10 instance types, such as

ecs.gn6v-c10g1.20xlarge or ecs.ebmgn7ix.32xlarge.

  • Compared to native NCCL, DeepNCCL improves dual-node Reduce-scatter communication performance by 30%.

  • DeepNCCL currently supports the following multi-node configurations: 2 nodes with 2 cards, 2 nodes with 4 cards, 2 nodes with 8 cards, 4 nodes with 8 cards, 7 nodes with 8 cards, or 8 nodes with 8 cards.

Multi-node Allgather optimization

Supports V100 or A10 instance types, such as

ecs.gn6v-c10g1.20xlarge or

ecs.ebmgn7ix.32xlarge.

  • Compared to native NCCL, DeepNCCL improves dual-node Allgather communication performance by 80%.

  • DeepNCCL currently supports distributed training on up to 20 nodes.

References

You can install the DeepNCCL communication library on Elastic GPU Service instances to accelerate distributed training or inference performance. For more information, see Install and use DeepNCCL.