在ACK上的eRDMA节点运行GDR应用

GPU Direct RDMA(GDR)是NVIDIA提出的一项应用于高性能计算和深度学习的技术。它允许GPU直接与其他支持RDMA(远程直接内存访问)的设备(如其他GPU或某些加速器)进行数据交换,而无需通过CPU中转。本文介绍如何在ACK中的eRDMA节点上运行GDR应用。

前提条件

  • 以hostNetwork模式安装Arena。详细信息,请参见配置Arena客户端

  • 已通过kubectl工具连接集群。详细信息,请参见获取集群KubeConfig并通过kubectl工具连接集群

  • 已在集群中部署eRDMA的Device Plugin。详细信息,请参见后续操作

    展开查看部署eRDMA的Device Plugin的YAML文件

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rdma-devices
      namespace: kube-system
    data:
      config.json: |
        {
            "mode" : "hca",
            "deviceType" : "eRDMA"
        }
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: rdma-sriov-dp-ds
      namespace: kube-system
      labels:
        app: rdma-device-plugin
    spec:
      selector:
        matchLabels:
          app: rdma-device-plugin
      template:
        metadata:
          labels:
            app: rdma-device-plugin
            name: rdma-sriov-dp-ds
        spec:
          hostNetwork: true
          nodeSelector:
            aliyun.accelerator/erdma: "true"
          tolerations:
          - key: CriticalAddonsOnly
            operator: Exists
          containers:
          - image: registry-cn-beijing.ack.aliyuncs.com/acs/k8s-rdma-sriov-dev-plugin:v1.0.0-b3dcbc5-aliyun
            name: k8s-rdma-sriov-dp-ds
            imagePullPolicy: Always
            resources:
              limits:
                memory: "300Mi"
                cpu: "300m"
              requests:
                memory: "300Mi"
                cpu: "300m"
            securityContext:
              privileged: true
            volumeMounts:
              - name: device-plugin
                mountPath: /var/lib/kubelet/device-plugins
              - name: config
                mountPath: /k8s-rdma-sriov-dev-plugin
          volumes:
            - name: device-plugin
              hostPath:
                path: /var/lib/kubelet/device-plugins
            - name: config
              configMap:
                name: rdma-devices
                items:
                - key: config.json
                  path: config.json
    

操作步骤

  1. 使用Arena提交任务。

    arena submit mpijob \
      --name=mpi-allreduce-sync-erdma \
      --rdma \
      -e NCCL_DEBUG=TRACE \
      -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
      -e OMPI_ALLOW_RUN_AS_ROOT=1 \
      --gpus=8 \
      --memory=16Gi \
      --hostNetwork true \
      --cpu=4 \
      --workers=2 \
      --image=registry.cn-beijing.aliyuncs.com/acs/horovod:0.28.1-tf2.9.2-torch1.12.1-py3.8-erdma \
      --toleration all \
      "mpirun -np 2 \
      --allow-run-as-root \
      --mca btl_tcp_if_include bond0 \
      --mca oob_tcp_if_include bond0 \
      --mca pml ob1 \
      --mca btl ^openib \
      python /examples/pytorch/pytorch_synthetic_benchmark.py"

    预期输出:

    iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0>
    iZ2zeg0kcgyxepyc5r63kgZ:17:28 [0] NCCL INFO Using network IB
    iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO NET/IB : Using [0]rocep26s0:1/RoCE ; OOB eth0:192.168.8.128<0>
    iZ2zeg0kcgyxepyc5r63kgZ:18:27 [1] NCCL INFO Using network IB

    任务的NCCL初始化日志表明,NCCL检测到了eRDMA设备使用了RoCE模式,通过eRDMA IB设备进行训练的网络通信。

  2. 在宿主机上查询eRDMA网卡情况。

    $ ibv_devinfo
    hca_id:	rocep156s0
    	transport:			eRDMA
    	fw_ver:				0.2.0
    	node_guid:			0216:3eff:fe2c:b8f3
    	sys_image_guid:			0216:3eff:fe2c:b8f3
    	vendor_id:			0x1ded
    	vendor_part_id:			4223
    	hw_ver:				0x0
    	phys_port_cnt:			1
    		port:	1
    			state:			PORT_DOWN (1)
    			max_mtu:		1024 (3)
    			active_mtu:		1024 (3)
    			sm_lid:			0
    			port_lid:		0
    			port_lmc:		0x00
    			link_layer:		Ethernet
    
    hca_id:	rocep26s0
    	transport:			eRDMA
    	fw_ver:				0.2.0
    	node_guid:			0216:3eff:fe10:f8b0
    	sys_image_guid:			0216:3eff:fe10:f8b0
    	vendor_id:			0x1ded
    	vendor_part_id:			4223
    	hw_ver:				0x0
    	phys_port_cnt:			1
    		port:	1
    			state:			PORT_ACTIVE (4)
    			max_mtu:		1024 (3)
    			active_mtu:		1024 (3)
    			sm_lid:			0
    			port_lid:		0
    			port_lmc:		0x00
    			link_layer:		Ethernet
  3. 在宿主机上使用eadm工具监测eRDMA流量。

    $ eadm stat -d rocep26s0 -l
    Monitoring rocep26s0...    (press CTRL-C to stop)
    
     15:59:56  rx:           0 B/s     0 p/s          tx:           0 B/s     0 p/s
    
    
     rocep26s0  /  traffic statistics
    
                               rx         |       tx
    --------------------------------------+------------------
      bytes                    11.06 KiB  |       11.18 KiB
    --------------------------------------+------------------
              max            15.43 KiB/s  |     15.10 KiB/s
          average             4.03 KiB/s  |      4.07 KiB/s
              min                  0 B/s  |           0 B/s
    --------------------------------------+------------------
      packets                    8406769  |         8546764
    --------------------------------------+------------------
              max              38990 p/s  |       37488 p/s
          average               2988 p/s  |        3038 p/s
              min                  0 p/s  |           0 p/s
    --------------------------------------+------------------
      time                 33.78 minutes

    以上信息表明,可以监控到实时的eRDMA通信流量。