Elastic Remote Direct Memory Access (eRDMA) is a proprietary elastic RDMA network developed by Alibaba Cloud. Some GPU instance types within PAI general computing resources support eRDMA. When you submit a DLC job with a specific image on these GPU instance types, the system automatically mounts an eRDMA network interface card in the container to accelerate distributed training.
Limitations
-
This feature is available only for training jobs on subscription-based general computing resources.
-
NCCL version 2.19 or later is required.
-
The following table lists the GPU instance types that support eRDMA on the PAI DLC platform and the corresponding number of eRDMA network interface cards.
GPU instance type
eRDMA NICs
ecs.ebmgn7v.32xlarge
2
ecs.ebmgn8v.48xlarge
2
ecs.ebmgn8is.32xlarge
2
ecs.ebmgn8i.32xlarge
4
ecs.gn8is.2xlarge
1
ecs.gn8is.4xlarge
1
ecs.gn8is-2x.8xlarge
1
ecs.gn8is-4x.16xlarge
1
ecs.gn8is-4x.16xlarge
1
Platform-preset environment variables
When you run jobs on eRDMA-supported instance types in general computing resources, PAI automatically enables eRDMA and sets default NCCL environment variables. You can adjust these variables based on your training framework, communication framework, and model characteristics. However, for optimal performance, we strongly recommend using the preset default variables.
Common environment variables
|
Variable |
Value |
|
PYTHONUNBUFFERED |
1 |
|
TZ |
Set to the region where the job runs. A typical value is "Asia/Shanghai". |
eRDMA network variables
A hyphen (-) indicates that the environment variable is not applicable in this environment.
|
Variable |
Value |
|
NCCL_DEBUG |
INFO |
|
NCCL_SOCKET_IFNAME |
eth0 |
|
NCCL_IB_TC |
- |
|
NCCL_IB_SL |
- |
|
NCCL_IB_GID_INDEX |
1 |
|
NCCL_IB_HCA |
erdma |
|
NCCL_IB_TIMEOUT |
- |
|
NCCL_IB_QPS_PER_CONNECTION |
8 |
|
NCCL_MIN_NCHANNELS |
16 |
|
NCCL_NET_PLUGIN |
none |
Configure a custom image
When you submit a training job on eRDMA-supported general computing resources, you can build and use a custom image.
Environment requirements
-
CUDA >= 12.1
-
NCCL >= 2.19
-
Python 3
Install the eRDMA library
For example, to install the eRDMA library on Ubuntu 22.04:
# Add the PGP key.
wget -qO - http://mirrors.cloud.aliyuncs.com/erdma/GPGKEY | sudo gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg
# Add the apt source.
mkdir -p /etc/apt/sources.list.d
echo "deb [ ] http://mirrors.cloud.aliyuncs.com/erdma/apt/ubuntu jammy/erdma main" | sudo tee /etc/apt/sources.list.d/erdma.list
# Update and install the eRDMA user-space driver packages.
sudo apt update
sudo apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1
For installation instructions on other distributions, see Use eRDMA in Docker containers.
Sample Dockerfile
# Replace ${user_docker_image_url} with your existing Docker image URL.
FROM ${user_docker_image_url}
# If the RDMA library is already installed in the image, you must uninstall it first.
RUN rm /etc/apt/sources.list.d/mellanox_mlnx_ofed.list && \
apt remove -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1
RUN wget -qO - http://mirrors.aliyun.com/erdma/GPGKEY | gpg --dearmour -o /etc/apt/trusted.gpg.d/erdma.gpg && \
echo "deb [ ] http://mirrors.aliyun.com/erdma/apt/ubuntu jammy/erdma main" | tee /etc/apt/sources.list.d/erdma.list && \
apt update && apt install -y libibverbs1 ibverbs-providers ibverbs-utils librdmacm1
Run an NCCL test with an MPIJob
Submit a training job with the MPIJob framework and configure the following key parameters. For more information about other parameters, see Quickly submit an MPIJob training job.
|
Parameter |
Description |
|
|
Environment Information |
Node Image |
On the Image URL tab, enter the URL of your custom image. You can use the NCCL test image provided by PAI-DLC, which has the eRDMA dependencies pre-installed: Replace |
|
Start Command |
|
|
|
Resource Information |
Resource Source |
Select Resource Quota. |
|
Resource Quota |
Select a general computing resource quota, such as one with the ecs.ebmgn8v.48xlarge specification. For more information about how to create a resource quota, see General computing resource quotas. |
|
|
Framework |
Select MPIJob. |
|
|
Task Resources |
Configure the following parameters:
|
|
The following output shows a sample NCCL test result for eRDMA network bandwidth:
dlci -worker-0:27:100 [2] NCCL INFO comm 0x5604ef4799e0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId a2000 commId 0xbbe9b8703193e2e4 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
32768 8192 float sum -1 169.2 0.19 0.36 0 165.1 0.20 0.37 0
65536 16384 float sum -1 181.7 0.36 0.68 0 170.0 0.39 0.72 0
131072 32768 float sum -1 196.6 0.67 1.25 0 199.5 0.66 1.23 0
262144 65536 float sum -1 227.8 1.15 2.16 0 226.0 1.16 2.17 0
524288 131072 float sum -1 345.6 1.52 2.84 0 346.8 1.51 2.83 0
1048576 262144 float sum -1 1258.2 0.83 1.56 0 1317.5 0.80 1.49 0
2097152 524288 float sum -1 1367.7 1.53 2.88 0 1313.0 1.60 2.99 0
4194304 1048576 float sum -1 1502.1 2.79 5.24 0 1504.8 2.79 5.23 0
8388608 2097152 float sum -1 1763.1 4.76 8.92 0 1793.4 4.68 8.77 0
16777216 4194304 float sum -1 2838.3 5.91 11.08 0 2867.2 5.85 10.97 0
33554432 8388608 float sum -1 4952.3 6.78 12.70 0 4962.5 6.76 12.68 0
67108864 16777216 float sum -1 9027.4 7.43 13.94 0 8976.9 7.48 14.02 0
134217728 33554432 float sum -1 17641 7.61 14.27 0 17664 7.60 14.25 0
268435456 67108864 float sum -1 34935 7.68 14.41 0 34922 7.69 14.41 0
536870912 134217728 float sum -1 69392 7.74 14.51 0 69274 7.75 14.53 0
1073741824 268435456 float sum -1 138739 7.74 14.51 0 138833 7.73 14.50 0
2147483648 536870912 float sum -1 277591 7.74 14.51 0 277373 7.74 14.52 0
4294967296 1073741824 float sum -1 554176 7.75 14.53 0 554452 7.75 14.52 0
dlcinqih2e19f6nn-worker-0:28:28 [3] NCCL INFO comm 0x55bc7e54fe90 rank 3 nranks 16 cudaDev 3 busId c6000 - Destroy COMPLETE