为进一步优化神龙架构GPU服务器的网络性能,阿里云推出了GPU计算型超级计算集群实例规格族,即sccgn系列实例。该系列机型具备了超强的计算能力和网络通信能力。本文为您介绍sccgn系列实例的使用。

前提条件

已在创建GPU计算型超级计算集群实例的选择镜像阶段,选中了安装RDMA软件栈。如果您的业务需要使用GPUDirect RDMA功能,您还需要同时选中安装GPU驱动,快速安装所需软件栈及工具包。

GPUDirect RDMA是英伟达自Kepler级别GPU和CUDA 5.0以来引入的一种数据直通技术,它使用PCIe数据总线的标准功能,为GPU和第三方对等设备之间提供直接的数据通路。第三方设备的示例包括:GPU之间、网络接口、视频采集设备、存储适配器。更多信息,请参见英伟达官方文档

如果您所安装的网卡驱动为OFED开源版本(下载地址),您可以在安装网卡驱动后,再安装GPU驱动及CUDA。
说明 CUDA 11.4和R470之后版本已包含了nvidia_peermem模块,您无需单独安装nv_peer_mem模块。更多信息,请参见nv_peer_memory

背景信息

sccgn系列机型同时配备了GPU计算卡及Mellanox高性能网卡,具备超强的计算能力和网络通信能力。适用于如深度学习、高性能计算等高强度计算和密集通信兼备的应用场景。

功能通过性验证及带宽验证

  • 功能通过性验证

    该验证用于检查RDMA软件栈安装是否正确,配置是否得当。执行下面命令将会输出检查的结果,检查时可能遇到的问题,请参见常见问题

    rdma_qos_check -V

    如果回显类似如下内容,表示RDMA软件栈已正确安装。

    ===========================================================
    *    rdma_qos_check
    -----------------------------------------------------------
    * ITEM          DETAIL                               RESULT
    ===========================================================
    * link_up       eth1: yes                                ok
    * mlnx_device   eth1: 1                                  ok
    * drv_ver       eth1: 5.2-2.2.3                          ok
    ...
    * pci           0000:c5:00.1                             ok
    * pci           0000:e1:00.0                             ok
    * pci           0000:e1:00.1                             ok
    ===========================================================
  • 带宽验证

    该验证用于验证RDMA网络带宽是否符合对应硬件的预期表现。

    • 服务器端命令
      ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0

      类似回显信息如下:

      ---------------------------------------------------------------------------------------
                          RDMA_Read BW Test
       Dual-port       : OFF        Device         : mlx5_bond_0
       Number of qps   : 20        Transport type : IB
       Connection type : RC        Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       CQ Moderation   : 100
       Mtu             : 1024[B]
       Link type       : Ethernet
       GID index       : 3
       Outstand reads  : 16
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0000 QPN 0x11ca PSN 0x6302b0 OUT 0x10 RKey 0x17fddc VAddr 0x007f88e1e5d000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
       local address: LID 0000 QPN 0x11cb PSN 0x99aeda OUT 0x10 RKey 0x17fddc VAddr 0x007f88e265d000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
       local address: LID 0000 QPN 0x11cc PSN 0xf0d01c OUT 0x10 RKey 0x17fddc VAddr 0x007f88e2e5d000
       ...
        remote address: LID 0000 QPN 0x11dd PSN 0x8efe92 OUT 0x10 RKey 0x17fddc VAddr 0x007f672004b000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
       8388608    20000            165.65             165.63            0.002468
      ---------------------------------------------------------------------------------------
    • 客户端命令
      ib_read_bw -a -q 20 --report_gbits -d mlx5_bond_0 #server_ip

      类似回显信息如下:

      ---------------------------------------------------------------------------------------
                          RDMA_Read BW Test
       Dual-port       : OFF        Device         : mlx5_bond_0
       Number of qps   : 20        Transport type : IB
       Connection type : RC        Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 100
       Mtu             : 1024[B]
       Link type       : Ethernet
       GID index       : 3
       Outstand reads  : 16
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0000 QPN 0x11ca PSN 0x787f05 OUT 0x10 RKey 0x17fddc VAddr 0x007f671684b000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
       local address: LID 0000 QPN 0x11cb PSN 0x467042 OUT 0x10 RKey 0x17fddc VAddr 0x007f671704b000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:45:14
       local address: LID 0000 QPN 0x11cc PSN 0xac262e OUT 0x10 RKey 0x17fddc VAddr 0x007f671784b000
       ...
        remote address: LID 0000 QPN 0x11dd PSN 0xeb1c3f OUT 0x10 RKey 0x17fddc VAddr 0x007f88eb65d000
       GID: 00:00:00:00:00:00:00:00:00:00:255:255:200:00:46:14
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
      Conflicting CPU frequency values detected: 800.000000 != 3177.498000. CPU Frequency is not max.
       2          20000           0.058511            0.058226            3.639132
      Conflicting CPU frequency values detected: 799.996000 != 3384.422000. CPU Frequency is not max.
      ...
      Conflicting CPU frequency values detected: 800.000000 != 3166.731000. CPU Frequency is not max.
       4194304    20000            165.55             165.55            0.004934
      Conflicting CPU frequency values detected: 800.000000 != 2967.226000. CPU Frequency is not max.
       8388608    20000            165.65             165.63            0.002468
      ---------------------------------------------------------------------------------------
    运行以上命令时,可通过rdma_monitor -s -t -G命令在控制台观测对应网卡各端口的实际带宽。

    类似回显信息如下:

    ------
    2022-2-18 09:48:59 CST
    tx_rate: 81.874 (40.923/40.951)
    rx_rate: 0.092 (0.055/0.037)
    tx_pause: 0 (0/0)
    rx_pause: 0 (0/0)
    tx_pause_duration: 0 (0/0)
    rx_pause_duration: 0 (0/0)
    np_cnp_sent: 0
    rp_cnp_handled: 4632
    num_of_qp: 22
    np_ecn_marked: 0
    rp_cnp_ignored: 0
    out_of_buffer: 0
    out_of_seq: 0
    packet_seq_err: 0
    tx_rate_prio0: 0.000 (0.000/0.000)
    rx_rate_prio0: 0.000 (0.000/0.000)
    tcp_segs_retrans: 0
    tcp_retrans_rate: 0
    cpu_usage: 0.35%
    free_mem: 1049633300 kB
    
    ------

nccl-tests用例

为测试和验证配备RDMA网络的机型在应用中的实际表现,下文以nccl-tests为例,展示如何使用RDMA加速您的应用。nccl-tests的更多信息,请参见nccl-tests

#!/bin/sh
# 使用的操作系统为 Alibaba Cloud Linux 2
# 安装openmpi及编译器
wget https://ali-perseus-release.oss-cn-huhehaote.aliyuncs.com/openmpi-4.0.3-1.x86_64.rpm
rpm -ivh --force openmpi-4.0.3-1.x86_64.rpm --nodeps
yum install -y gcc-c++

# 修改~/.bashrc
export PATH=/usr/local/cuda-11.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/lib:/usr/local/lib/openmpi:/usr/local/cuda-11.0/lib64:$LD_LIBRARY_PATH

# 下载测试代码并编译
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests/
make MPI=1 CUDA_HOME=/usr/local/cuda

# 将 host1, host2 替换为你对应的 IP 地址
mpirun --allow-run-as-root -np 16 -npernode 8 -H {host1}:{host2}  \
  --bind-to none \
  -mca btl_tcp_if_include bond0 \
  -x PATH \
  -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -x NCCL_SOCKET_IFNAME=bond0 \
  -x NCCL_IB_HCA=mlx5 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_DEBUG=INFO \
  -x NCCL_NSOCKS_PERTHREAD=8 \
  -x NCCL_SOCKET_NTHREADS=8 \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_DEBUG_SUBSYS=NET,GRAPH \
  -x NCCL_IB_QPS_PER_CONNECTION=4 \
  ./build/all_reduce_perf -b 4M -e 4M -f 2 -g 1 -t 1 -n 20

类似回显信息如下:

# 输出实例
# nThread 1 nGpus 1 minBytes 4194304 maxBytes 4194304 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  57655 on iZ2ze58t*****3vnehjdZ device  0 [0x54] NVIDIA A100-SXM-80GB
#   Rank  1 Pid  57656 on iZ2ze58t*****3vnehjdZ device  1 [0x5a] NVIDIA A100-SXM-80GB
#   Rank  2 Pid  57657 on iZ2ze58t*****3vnehjdZ device  2 [0x6b] NVIDIA A100-SXM-80GB
#   Rank  3 Pid  57658 on iZ2ze58t*****3vnehjdZ device  3 [0x70] NVIDIA A100-SXM-80GB
#   Rank  4 Pid  57659 on iZ2ze58t*****3vnehjdZ device  4 [0xbe] NVIDIA A100-SXM-80GB
#   Rank  5 Pid  57660 on iZ2ze58t*****3vnehjdZ device  5 [0xc3] NVIDIA A100-SXM-80GB
#   Rank  6 Pid  57661 on iZ2ze58t*****3vnehjdZ device  6 [0xda] NVIDIA A100-SXM-80GB
#   Rank  7 Pid  57662 on iZ2ze58t*****3vnehjdZ device  7 [0xe0] NVIDIA A100-SXM-80GB
#   Rank  8 Pid  58927 on iZ2ze58t*****3vnehjeZ device  0 [0x54] NVIDIA A100-SXM-80GB
#   Rank  9 Pid  58928 on iZ2ze58t*****3vnehjeZ device  1 [0x5a] NVIDIA A100-SXM-80GB
#   Rank 10 Pid  58929 on iZ2ze58t*****3vnehjeZ device  2 [0x6b] NVIDIA A100-SXM-80GB
#   Rank 11 Pid  58930 on iZ2ze58t*****3vnehjeZ device  3 [0x70] NVIDIA A100-SXM-80GB
#   Rank 12 Pid  58931 on iZ2ze58t*****3vnehjeZ device  4 [0xbe] NVIDIA A100-SXM-80GB
#   Rank 13 Pid  58932 on iZ2ze58t*****3vnehjeZ device  5 [0xc3] NVIDIA A100-SXM-80GB
#   Rank 14 Pid  58933 on iZ2ze58t*****3vnehjeZ device  6 [0xda] NVIDIA A100-SXM-80GB
#   Rank 15 Pid  58934 on iZ2ze58t*****3vnehjeZ device  7 [0xe0] NVIDIA A100-SXM-80GB
iZ2ze6t9*****ssopZ:57655:57655 [0] NCCL INFO NCCL_SOCKET_IFNAME set to bond0
...
iZ2ze58t*****3vnehjeZ:58929:59248 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57657:58004 [2] NCCL INFO NET/IB: Dev 1 Port 1 qpn 4573 mtu 3 GID 3 (0/22E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 0[54000] -> 8[54000] [receive] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjeZ:58931:59227 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57659:58012 [4] NCCL INFO NET/IB: Dev 2 Port 1 qpn 4573 mtu 3 GID 3 (0/62E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58933:59183 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 00 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO GPU Direct RDMA Enabled for GPU 54000 / HCA 0 (distance 4 <= 4), read 1
iZ2ze58t*****3vnehjdZ:57661:58000 [6] NCCL INFO NET/IB: Dev 3 Port 1 qpn 4573 mtu 3 GID 3 (0/A2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO Channel 04 : 8[54000] -> 0[54000] [send] via NET/IB/0/GDRDMA
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2E00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4660 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjeZ:58927:59225 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2D00C8FFFF0000)
iZ2ze58t*****3vnehjdZ:57655:57848 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 4661 mtu 3 GID 3 (0/E2E00C8FFFF0000)
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     4194304       1048576     float     sum    241.5   17.37   32.56  4e-07    235.2   17.84   33.44  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 33.002
#

常见问题

  • 问题一

    执行验证命令rdma_qos_check -V时,系统报错:drv_fw_ver eth1: 5.2-2.2.3/22.29.1016 fail

    解决办法:

    此报错表示Mellanox网卡固件未更新。您可以:
    • 在Alibaba Cloud Linux 2或CentOS 8.3系统中,执行/usr/share/nic-drivers-mellanox-rdma/sources/alifwmanager-22292302 --force --yes命令刷新服务器网卡固件。
    • 在Debian-base的系统中,下载固件刷新程序(下载地址),然后执行./alifwmanager-22292302 --force --yes来刷新服务器的网卡固件。
  • 问题二

    执行验证命令rdma_qos_check -V时,系统报错:* roce_ver : 0 fail

    解决办法:

    此报错表示缺少configfs、rdma_cm等内核模块,您可以执行modprobe mlx5_ib && modprobe configfs && modprobe rdma_cm命令加载对应内核模块。

  • 问题三

    在Debian系统中,执行命令systemctl start networking启动网络服务时,系统提示找不到bond。

    解决办法:

    此报错可能由于mlx5_ib内核模块未加载,您可以执行modprobe mlx5_ib加载此内核模块。

  • 问题四

    执行验证命令rdma_qos_check -V或验证带宽命令ib_read_bw时,系统报错:ERROR: RoCE tos isn't correct on mlx5_bond_3

    解决办法:

    您可以执行命令rdma_qos_init对网络进行初始化。

  • 问题五

    在Alibaba Cloud Linux 2中,重启服务器后,执行验证命令rdma_qos_check -V时,系统报错:cm_tos mlx5_bond_1: 0 fail

    解决办法:

    您可以执行命令rdma_qos_init对网络进行初始化。

  • 问题六

    在CentOS 8.3系统中,重启服务器后,执行验证命令rdma_qos_check -V时,系统报错:trust_mod eth1: pcp fail

    解决办法:

    您可以执行命令rdma_qos_init对网络进行初始化。

  • 问题七

    RDMA网络接口bond*出现获取不到bond ip的情况。

    解决办法:

    您可以执行命令ifdown bond*ifup bond*获取得到bond ip。
    说明 请将*替换为对应网络接口的序号。