eRDMA: High-performance networking for distributed training

更新时间:
复制 MD 格式

Elastic Remote Direct Memory Access (eRDMA) is a proprietary elastic RDMA network developed by Alibaba Cloud. Some GPU instance types within PAI general computing resources support eRDMA. When you submit a DLC job with a specific image on these GPU instance types, the system automatically mounts an eRDMA network interface card in the container to accelerate distributed training.

Limitations

  • This feature is available only for training jobs on subscription-based general computing resources.

  • NCCL version 2.19 or later is required.

  • The following table lists the GPU instance types that support eRDMA on the PAI DLC platform and the corresponding number of eRDMA network interface cards.

    GPU instance type

    eRDMA NICs

    ecs.ebmgn7v.32xlarge

    2

    ecs.ebmgn8v.48xlarge

    2

    ecs.ebmgn8is.32xlarge

    2

    ecs.ebmgn8i.32xlarge

    4

    ecs.gn8is.2xlarge

    1

    ecs.gn8is.4xlarge

    1

    ecs.gn8is-2x.8xlarge

    1

    ecs.gn8is-4x.16xlarge

    1

    ecs.gn8is-4x.16xlarge

    1

Platform-preset environment variables

When you run jobs on eRDMA-supported instance types in general computing resources, PAI automatically enables eRDMA and sets default NCCL environment variables. You can adjust these variables based on your training framework, communication framework, and model characteristics. However, for optimal performance, we strongly recommend using the preset default variables.

Common environment variables

Variable

Value

PYTHONUNBUFFERED

1

TZ

Set to the region where the job runs. A typical value is "Asia/Shanghai".

eRDMA network variables

Note

A hyphen (-) indicates that the environment variable is not applicable in this environment.

Variable

Value

NCCL_DEBUG

INFO

NCCL_SOCKET_IFNAME

eth0

NCCL_IB_TC

-

NCCL_IB_SL

-

NCCL_IB_GID_INDEX

1

NCCL_IB_HCA

erdma

NCCL_IB_TIMEOUT

-

NCCL_IB_QPS_PER_CONNECTION

8

NCCL_MIN_NCHANNELS

16

NCCL_NET_PLUGIN

none

Configure a custom image

When you submit a training job on eRDMA-supported general computing resources, you can build and use a custom image.

Environment requirements

  • CUDA >= 12.1

  • NCCL >= 2.19

  • Python 3

Install the eRDMA library

For example, to install the eRDMA library on Ubuntu 22.04:

# Add the PGP key.
wget -qO - http://mirrors.cloud.aliyuncs.com/erdma/GPGKEY | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg
# Add the apt source.
mkdir -p /etc/apt/sources.list.d
echo "deb [ ] http://mirrors.cloud.aliyuncs.com/erdma/apt/ubuntu jammy/erdma main" | sudo tee /etc/apt/sources.list.d/erdma.list
# Update and install the eRDMA user-space driver packages.
sudo apt update
sudo apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

For installation instructions on other distributions, see Use eRDMA in Docker containers.

Sample Dockerfile

# Replace ${user_docker_image_url} with your existing Docker image URL.
FROM ${user_docker_image_url}
# If the RDMA library is already installed in the image, you must uninstall it first.
RUN rm /etc/apt/sources.list.d/mellanox_mlnx_ofed.list && \
    apt remove -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1
RUN wget -qO - http://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
    echo "deb [ ] http://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" | tee /etc/apt/sources.list.d/erdma.list && \
    apt update && apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1

Run an NCCL test with an MPIJob

Submit a training job with the MPIJob framework and configure the following key parameters. For more information about other parameters, see Quickly submit an MPIJob training job.

Parameter

Description

Environment Information

Node Image

On the Image URL tab, enter the URL of your custom image.

You can use the NCCL test image provided by PAI-DLC, which has the eRDMA dependencies pre-installed: dsw-registry-vpc.<RegionID>.cr.aliyuncs.com/pai/nccl-tests:12.2.2-cudnn8-devel-ubuntu22.04-nccl2.19.3-1-85f9143.

Replace <RegionID> with your region ID. For example, the region ID for China (Ulanqab) is cn-wulanchabu. For more information about region IDs, see Regions and availability zones.

Start Command

# The -np 16 and -npernode 8 flags specify two nodes with eight GPUs each, for a total of 16 GPUs.
mpirun --allow-run-as-root -np 16 -npernode 8 --bind-to none -mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x UCX_NET_DEVICES=eth0 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=1 -x NCCL_IB_QPS_PER_CONNECTION=8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_MIN_NCHANNELS=16 -x NCCL_ALGO=Ring -x PATH /opt/nccl-tests/build/all_reduce_perf -b 32K -e 4G -f 2 -g 1 -t 1 -n 20

Resource Information

Resource Source

Select Resource Quota.

Resource Quota

Select a general computing resource quota, such as one with the ecs.ebmgn8v.48xlarge specification. For more information about how to create a resource quota, see General computing resource quotas.

Framework

Select MPIJob.

Task Resources

Configure the following parameters:

  • Number of nodes: 2

  • GPU (cards): 8

  • CPU (cores): 64

  • Memory (GiB): 256

  • Shared Memory (GiB): 256

The following output shows a sample NCCL test result for eRDMA network bandwidth:

dlci              -worker-0:27:100 [2] NCCL INFO comm 0x5604ef4799e0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId a2000 commId 0xbbe9b8703193e2e4 - Init COMPLETE
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong
#        (B)    (elements)                                (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
   32768         8192     float     sum      -1    169.2    0.19    0.36      0    165.1    0.20    0.37      0
   65536        16384     float     sum      -1    181.7    0.36    0.68      0    170.0    0.39    0.72      0
  131072        32768     float     sum      -1    196.6    0.67    1.25      0    199.5    0.66    1.23      0
  262144        65536     float     sum      -1    227.8    1.15    2.16      0    226.0    1.16    2.17      0
  524288       131072     float     sum      -1    345.6    1.52    2.84      0    346.8    1.51    2.83      0
 1048576       262144     float     sum      -1   1258.2    0.83    1.56      0   1317.5    0.80    1.49      0
 2097152       524288     float     sum      -1   1367.7    1.53    2.88      0   1313.0    1.60    2.99      0
 4194304      1048576     float     sum      -1   1502.1    2.79    5.24      0   1504.8    2.79    5.23      0
 8388608      2097152     float     sum      -1   1763.1    4.76    8.92      0   1793.4    4.68    8.77      0
16777216      4194304     float     sum      -1   2838.3    5.91   11.08      0   2867.2    5.85   10.97      0
33554432      8388608     float     sum      -1   4952.3    6.78   12.70      0   4962.5    6.76   12.68      0
67108864     16777216     float     sum      -1   9027.4    7.43   13.94      0   8976.9    7.48   14.02      0
134217728     33554432     float     sum      -1    17641    7.61   14.27      0    17664    7.60   14.25      0
268435456     67108864     float     sum      -1    34935    7.68   14.41      0    34922    7.69   14.41      0
536870912    134217728     float     sum      -1    69392    7.74   14.51      0    69274    7.75   14.53      0
1073741824    268435456     float     sum      -1   138739    7.74   14.51      0   138833    7.73   14.50      0
2147483648    536870912     float     sum      -1   277591    7.74   14.51      0   277373    7.74   14.52      0
4294967296   1073741824     float     sum      -1   554176    7.75   14.53      0   554452    7.75   14.52      0
dlcinqih2e19f6nn-worker-0:28:28 [3] NCCL INFO comm 0x55bc7e54fe90 rank 3 nranks 16 cudaDev 3 busId c6000 - Destroy COMPLETE