Elastic Remote Direct Memory Access (eRDMA) lets containerized applications bypass the OS kernel and access physical eRDMA devices directly for faster data transfer. Use a prebuilt container image to configure eRDMA in Docker on a GPU-accelerated instance.
If your services require large-scale RDMA networking, create and attach an Elastic RDMA Interface to a supported GPU-accelerated instance type. eRDMA overview.
Before you begin
Obtain the eRDMA container image details, including compatible GPU-accelerated instance types and the image address.
-
Log on to the Container Registry console.
-
In the left-side navigation pane, click Artifact Center.
-
In the Repository Name search box, search for
erdmaand select theegs/erdmaimage.The eRDMA container image is updated approximately every three months.
Image name
Version information
Image address
Supported instances
Benefits
eRDMA
-
Python: 3.10.12
-
CUDA: 12.4.1
-
cuDNN: 9.1.0.70
-
NCCL: 2.21.5
-
Base image: Ubuntu 22.04
egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.4.1-cudnn9-ubuntu22.04
The eRDMA container image is available only for ebmgn7ex, ebmgn7ix, and all 8th-generation GPU-accelerated instances (such as ebmgn8is and gn8is).
-
Directly access the Alibaba Cloud eRDMA network from within a container.
-
Alibaba Cloud provides compatible eRDMA, drivers, and CUDA to ensure the feature works out of the box.
eRDMA
-
Python: 3.10.12
-
CUDA: 12.1.1
-
cuDNN: 8.9.0.131
-
NCCL: 2.17.1
-
Base image: Ubuntu 22.04
egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04
-
Procedure
Install Docker on a GPU-accelerated instance and enable eRDMA to access eRDMA devices from containers. This example uses Ubuntu 20.04.
-
Create a GPU-accelerated instance and configure eRDMA.
Enable eRDMA on a GPU-accelerated instance.
Create a GPU-accelerated instance with an Elastic RDMA Interface in the ECS console and select the Install GPU Driver and Install eRDMA software stack options.
NoteThe Tesla driver, CUDA, cuDNN library, and eRDMA software stack are installed automatically, which is faster than manual installation.

-
Connect to the GPU-accelerated instance.
-
Install Docker on the instance.
sudo apt-get update sudo apt-get -y install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update sudo apt-get install -y docker-ce docker-ce-cli containerd.io -
Verify the Docker installation:
docker -v -
Install nvidia-container-toolkit.
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit -
Enable Docker to start at boot and restart the service.
sudo systemctl enable docker sudo systemctl restart docker -
Pull the eRDMA container image.
sudo docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04 -
Start the eRDMA container.
sudo docker run -d -t --network=host --gpus all \ --privileged \ --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \ --name erdma \ -v /root:/root \ egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04
Verification
This example uses two GPU-accelerated instances (host1 and host2), both with Docker installed and an eRDMA container running.
-
Verify the eRDMA devices in the containers on both host1 and host2.
-
Enter the container:
sudo docker exec -it erdma bash -
Check the eRDMA devices in the container.
ibv_devinfoIf both eRDMA devices show
PORT_ACTIVE, they are working correctly.
-
-
Run nccl-tests from the containers on host1 and host2.
-
Download the nccl-tests source code.
git clone https://github.com/NVIDIA/nccl-tests.git -
Compile nccl-tests.
apt update apt install openmpi-bin libopenmpi-dev -y cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/cuda MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi -
Set up passwordless SSH between host1 and host2 on port 12345.
After setup, run
ssh -p 12345 ipfrom a container to test connectivity.-
On host1, generate an SSH key and copy the public key to host2.
ssh-keygen ssh-copy-id -i ~/.ssh/id_rsa.pub ${host2} -
On host2, install SSH and start the server on port
12345.apt-get update && apt-get install ssh -y mkdir /run/sshd /usr/sbin/sshd -p 12345 -
On host1, test the passwordless connection to host2.
ssh root@{host2} -p 12345
-
-
On host1, run the all_reduce_perf test.
mpirun --allow-run-as-root -np 16 -npernode 8 -H 172.16.15.237:8,172.16.15.235:8 \ --bind-to none -mca btl_tcp_if_include eth0 \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_IB_DISABLE=0 \ -x NCCL_IB_GID_INDEX=1 \ -x NCCL_NET_GDR_LEVEL=5 \ -x NCCL_DEBUG=INFO \ -x NCCL_ALGO=Ring -x NCCL_P2P_LEVEL=3 \ -x LD_LIBRARY_PATH -x PATH \ -mca plm_rsh_args "-p 12345" \ /workspace/nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -n 20Expected output:

-
-
On the host (outside the container), monitor eRDMA network traffic.
eadm stat -d erdma_0 -lTraffic on the eRDMA network confirms that eRDMA is active.

Related documents
-
Enable high-speed RDMA interconnectivity between GPU-accelerated instances in a VPC. Enable eRDMA on a GPU-accelerated instance.
-
Configure eRDMA in Docker for high-performance networking on GPU-accelerated instances. Enable eRDMA in a Docker container.