RDMA: High-performance networking for distributed training

更新时间:
复制 MD 格式

In large-model AI parallel computing, you can optimize performance by reducing communication overhead, overlapping computation with communication, and improving communication efficiency. Configure high-performance networks on Lingjun resources to achieve these goals.

Limitations

These configurations apply only to training jobs that use Lingjun resources.

Configure high-performance network variables

PAI on Lingjun resources has RDMA enabled and optimal NCCL environment variables pre-configured. You can use these defaults or adjust them based on your training framework, communication framework, and model.

Default variables

The following table lists the pre-configured default variables for different Lingjun specifications.

Lingjun specification

NCCL variable

  • ml.gu7xf.c96m1600.8-gu108

  • ml.gu7xf.8xlarge-gu108

  • ml.gu7ef.c96m1600.8-gu100

  • ml.gu8xf.8xlarge-gu108

export NCCL_IB_TC=136
export NCCL_IB_SL=5
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5
export NCCL_IB_TIMEOUT=22
export NCCL_IB_QPS_PER_CONNECTION=8
export NCCL_NET_PLUGIN=none

See Environment variable descriptions for details about each NCCL environment variable.

Environment variables

The following table describes key NCCL environment variables. For other variables, see the NCCL documentation.

Variable

Description

NCCL_IB_TC

Matches Alibaba Cloud network mapping rules. Incorrect or missing values degrade performance.

NCCL_IB_GID_INDEX

Missing or incorrect values cause an NCCL error.

NCCL_SOCKET_IFNAME

The network interface used to establish connections, which varies by Lingjun specification. Missing or incorrect values can cause NCCL connection failures.

NCCL_DEBUG

Sets the log level. Set this to INFO for detailed NCCL logs that simplify troubleshooting.

NCCL_IB_HCA

The network interface controller (NIC) for RDMA communication. The required value varies by compute node. Incorrect or missing values degrade performance.

NCCL_IB_TIMEOUT

Increases the RDMA connection timeout to improve fault tolerance. Incorrect or missing values can interrupt training jobs.

NCCL_IB_QPS_PER_CONNECTION

Increasing the number of Queue Pairs (QPs) per connection can improve throughput.

Configure an image

For training jobs on Lingjun resources, you can use an official DLC image or a custom image.

Official images

Three official GPU training images are available: deepspeed-training:23.06-gpu-py310-cu121-ubuntu22.04, megatron-training:23.06-gpu-py310-cu121-ubuntu22.04, and nemo-training:23.06-gpu-py310-cu121-ubuntu22.04. All are hosted in the China (Ulanqab) region and use Ubuntu 22.04, Python 3.10, and CUDA 12.1. They include PyTorch 2.1, Megatron-LM 23.06, DeepSpeed 0.9.5, Transformers 4.29.2, and Nemo 1.19.0.

Custom image

Environment requirements

  • CUDA >= 11.2

  • NCCL >= 2.12.10

  • Python 3

Install RDMA library

For custom images, you must manually install the RDMA library. Add the following commands to your Dockerfile:

RUN apt-get update && \
    apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
RUN cd /tmp/ && \
    wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
    cd nic-lib-rdma-core-installer-ubuntu && \
    echo Y | /bin/bash install.sh && \
    cd .. && \
    rm -rf nic-lib-rdma-core-installer-ubuntu && \
    rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz

Related topics

To learn how to submit training jobs on Lingjun resources, see Create a training job.