Configure eRDMA in Docker on a GPU instance

更新时间:
复制 MD 格式

Elastic Remote Direct Memory Access (eRDMA) lets containerized applications bypass the OS kernel and access physical eRDMA devices directly for faster data transfer. Use a prebuilt container image to configure eRDMA in Docker on a GPU-accelerated instance.

Note

If your services require large-scale RDMA networking, create and attach an Elastic RDMA Interface to a supported GPU-accelerated instance type. eRDMA overview.

Before you begin

Obtain the eRDMA container image details, including compatible GPU-accelerated instance types and the image address.

  1. Log on to the Container Registry console.

  2. In the left-side navigation pane, click Artifact Center.

  3. In the Repository Name search box, search for erdma and select the egs/erdma image.

    The eRDMA container image is updated approximately every three months.

    Image name

    Version information

    Image address

    Supported instances

    Benefits

    eRDMA

    • Python: 3.10.12

    • CUDA: 12.4.1

    • cuDNN: 9.1.0.70

    • NCCL: 2.21.5

    • Base image: Ubuntu 22.04

    egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.4.1-cudnn9-ubuntu22.04

    The eRDMA container image is available only for ebmgn7ex, ebmgn7ix, and all 8th-generation GPU-accelerated instances (such as ebmgn8is and gn8is).

    • Directly access the Alibaba Cloud eRDMA network from within a container.

    • Alibaba Cloud provides compatible eRDMA, drivers, and CUDA to ensure the feature works out of the box.

    eRDMA

    • Python: 3.10.12

    • CUDA: 12.1.1

    • cuDNN: 8.9.0.131

    • NCCL: 2.17.1

    • Base image: Ubuntu 22.04

    egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Procedure

Install Docker on a GPU-accelerated instance and enable eRDMA to access eRDMA devices from containers. This example uses Ubuntu 20.04.

  1. Create a GPU-accelerated instance and configure eRDMA.

    Enable eRDMA on a GPU-accelerated instance.

    Create a GPU-accelerated instance with an Elastic RDMA Interface in the ECS console and select the Install GPU Driver and Install eRDMA software stack options.

    Note

    The Tesla driver, CUDA, cuDNN library, and eRDMA software stack are installed automatically, which is faster than manual installation.

    Dingtalk_20241203101142.jpg

  2. Connect to the GPU-accelerated instance.

    Connect to a Linux instance by using a password or key.

  3. Install Docker on the instance.

    sudo apt-get update
    sudo apt-get -y install ca-certificates curl
    
    sudo install -m 0755 -d /etc/apt/keyrings
    sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
    sudo chmod a+r /etc/apt/keyrings/docker.asc
    
    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu \
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    
    sudo apt-get update
    sudo apt-get install -y docker-ce docker-ce-cli containerd.io
  4. Verify the Docker installation:

    docker -v
  5. Install nvidia-container-toolkit.

    curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
        sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
        sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit
  6. Enable Docker to start at boot and restart the service.

    sudo systemctl enable docker
    sudo systemctl restart docker
  7. Pull the eRDMA container image.

    sudo docker pull egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04
  8. Start the eRDMA container.

     sudo docker run -d -t --network=host --gpus all \
      --privileged \
      --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
      --name erdma \
      -v /root:/root \
      egs-registry.cn-hangzhou.cr.aliyuncs.com/egs/erdma:cuda12.1.1-cudnn8-ubuntu22.04

Verification

This example uses two GPU-accelerated instances (host1 and host2), both with Docker installed and an eRDMA container running.

  1. Verify the eRDMA devices in the containers on both host1 and host2.

    1. Enter the container:

      sudo docker exec -it erdma bash
    2. Check the eRDMA devices in the container.

      ibv_devinfo

      If both eRDMA devices show PORT_ACTIVE, they are working correctly.

      eRDMA device status

  2. Run nccl-tests from the containers on host1 and host2.

    1. Download the nccl-tests source code.

      git clone https://github.com/NVIDIA/nccl-tests.git
    2. Compile nccl-tests.

      apt update
      apt install openmpi-bin libopenmpi-dev -y
      cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local/cuda MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
    3. Set up passwordless SSH between host1 and host2 on port 12345.

      After setup, run ssh -p 12345 ip from a container to test connectivity.

      1. On host1, generate an SSH key and copy the public key to host2.

        ssh-keygen
        ssh-copy-id -i ~/.ssh/id_rsa.pub ${host2}
      2. On host2, install SSH and start the server on port 12345.

        apt-get update && apt-get install ssh -y
        mkdir /run/sshd
        /usr/sbin/sshd -p 12345 
      3. On host1, test the passwordless connection to host2.

        ssh root@{host2}  -p 12345
    4. On host1, run the all_reduce_perf test.

      mpirun --allow-run-as-root -np 16 -npernode 8 -H 172.16.15.237:8,172.16.15.235:8 \
       --bind-to none -mca btl_tcp_if_include eth0 \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_IB_DISABLE=0 \
       -x NCCL_IB_GID_INDEX=1 \
       -x NCCL_NET_GDR_LEVEL=5 \
       -x NCCL_DEBUG=INFO \
       -x NCCL_ALGO=Ring -x NCCL_P2P_LEVEL=3 \
       -x LD_LIBRARY_PATH -x PATH \
       -mca plm_rsh_args "-p 12345" \
       /workspace/nccl-tests/build/all_reduce_perf -b 1G -e 1G -f 2 -g 1 -n 20

      Expected output:

      nccl-tests output

  3. On the host (outside the container), monitor eRDMA network traffic.

    eadm stat -d erdma_0 -l

    Traffic on the eRDMA network confirms that eRDMA is active.

    eRDMA traffic monitoring

Related documents