PCCL: 多卡 p2p 带宽与延迟测试指南(v2.1)

更新时间:
复制为 MD 格式

1. 概述

本文将介绍 pccl_tools 中的 p2p bandwidth 与 latency 带宽测试工具 "p2pBandwidthLatencyTest" 的用法。下面各个小的章节将会分别介绍它在各种多卡或多机配置下几个典型模式的使用方法,其中所附的数据值仅供参考使用具体 PPU SDK release 所对应的实际 perf 数据还需参考具体的机器环境与带宽配置而决定。

PPU SDK v2.1.0 release 开始,pccl_tools 从 SDK 中移除并独立发布,并更名为 comm_tools,p2p 测试工具包含在 comm_tools 下。

重要

PPU SDK v2.1.0 独立发布的 comm_tools 与之前版本不兼容,不能基于之前版本的 PPU SDK 执行。

独立发布链接:https://art-pub.eng.t-head.cn/artifactory/generic-local/SAIL/v2.1.0/COMM ,包命名为comm_tools_<hggcrt_version>_<os_version>.tar.gz,请根据实际场景所需的系统版本和 hggc runtime 版本下载对应工具包。P2P 测试工具分为 MPI 编译的 multi_process 和不带 MPI 编译的 single_process 两种版本,前者依赖 MPI 支持多机测试,后者用于单机测试。

├── multi_process
│   ├── p2pBandwidthLatencyTest
└── single_process
    ├── p2pBandwidthLatencyTest

2. 参数介绍

可通过 p2pBandwidthLatencyTest --help 查询全部参数信息:

This is bandwidth/latency tests of GPU pairs using P2P.
Usage:
  p2pBandwidthLatencyTest [OPTION...]

  -p, --perf arg                   P2P perf type. uniBw: unidirectional bandwidth; biBw: bidirectional bandwidth; latency: latency. (default: uniBw)
  -d, --p2p_direction arg          P2P transmission direction. write: root device write to peer device; read: root device read from peer device. (default: write)
  -n, --num_elems arg              Number of transmission integers. (default: 10000000)
  -t, --p2p_type arg               P2P transmission type. CE: command level; SM: kernel level. (default: SM)
  -k, --kernel arg                 P2P transmission kernel type. combined with p2pType=SM. typical cacheCopy/ncCopy/bulkCopy, or normal format: ppu1.0: <ld_cp>-<st_cp>-<rmt_cp>, ppu1.5: <ld_cp>-<st_cp>-<rmt_cp>-<rtmd> (default: bulkCopy)
  -b, --num_thread_blocks arg      Number of thread blocks to be used in kernel copy. (default: -1)
  -s, --block_size arg             Thread block size to be used in kernel copy. (default: -1)
  -g, --num_gpus arg               Total number of devices in single process test. it will always be '1' in multi-process test. (default: -1)
  -l, --dev_list arg               Device list for single process p2p bandwidth test. List length should equal to num_gpus. (default: -1,)
  -m, --is_multi_proc arg          Multiple processes with single device per process(0 or 1). (default: 0)
  -a, --disable_p2p arg            Disable kernel level p2p transmission(0 or 1). (default: 0)
  -e, --enable_local_D2D arg       Enable D2D copy on one device(0 or 1). (default: 0)
  -j, --eval_P2P_perf_tri arg      Eval p2p copy perf on perf triangle way(0 or 1). (default: 0)
  -v, --verify arg                 Whether to enable data mismatch verification. No verify if not set(0 or 1). (default: 0)
  -u, --display_ur arg             Whether to show bandwidth matrix in terms of utilization ratio(0 or 1). (default: 0)
  -w, --warmup arg                 Whethre to do command engine bi-bw warmup before p2p perf test(0 or 1). (default: 1)
  -i, --num_iterations arg         Number of iterations for random test. (default: 1)
  -r, --repeat arg                 Number of repetitions of p2p transmission. (default: 15)
  -y, --min_num_elems arg          Minimum number of transmission integers in random test. (default: 8)
  -x, --max_num_elems arg          Maximum number of transmission integers in random test. (default: 8)
      --buffer_offset arg          P2P transmission offset from start address of buffer. (default: 0)
  -c, --bench_type arg             Bench output type. all: evaluate all kinds of perf kinds; p2p_write: p2p bench test with write type; p2p_read: p2p bench test with read type. (default: none)
  -f, --max_display_num_elems arg  Limit the maximum number of elements displayed. (default: 32)
      --disable_vm arg             Disable virtual mode(0 or 1). (default: 0)
      --is_single_die arg          Single die. (default: 0)
      --use_fake_driver arg        run on fake driver. (default: 0)
      --ignore_icn_conn_check arg  ignore icn connectivity check. (default: 0)
  -h, --help                       Print usage. (default: false)

2.1. 常用参数

performance type

  • perf=uniBw unidirectional bandwidth.

  • perf=biBw bidirectional bandwidth.

  • perf=latency unidirectional latency of transferring 4 integers.

p2p transmission direction

  • p2p_direction=write root device write to peer device.

  • p2p_direction=read root device read from peer device.

p2p transmission size

  • num_elems=number of integers.

p2p transmission way

  • p2p_type=CE through cudaMemcpyPeerAsync.

  • p2p_type=SM through kernels.

p2p transmission kernel type if p2pType=SM

  • kernel=cacheCopy normal load/store.

  • kernel=ncCopy normal v4 load/store.

  • kernel=bulkCopy bulk v4 load/store.

2.2. 可选参数

  • dev_list=当提供 devList 选项时,仅进行 dev_list 中指定的若干 PPU 之间的 p2p transmission 测试,注意使用此参数时必须同时使用 num_gpus 参数,且 dev_list 长度必须与 nDevs 数量相同。

  • display_ur当使用此参数时,p2p bandwidth 测试在提供原始带宽矩阵的基础上,还提供带宽利用率矩阵的输出。

  • disable_p2p当使用此参数时,关闭 PPU 之间的 p2p access 访问,在 unidirectional bandwidth / latency 测试时仅支持 WRITE mode。

除此之外,p2p bandwidth 与 latency 带宽测试工具还提供了 bench 参数,用于综合性批量输出的情况。

  • bench_type=all输出 enableP2P 条件下全部类型组合。

  • bench_type=p2p_write输出包含 enableP2p/disableP2p 条件下 CE WRITE 的带宽与延迟。

  • bench_type=p2p_read输出包含 enableP2p/disableP2p 条件下 CE READ 的(部分)带宽与延迟。

3. 示例

3.1 单机 8 卡

3.1.1 实际拓扑结构

PPU0    PPU1    PPU2    PPU3    PPU4    PPU5    PPU6    PPU7    CPU Affinity    NUMA Affinity
 PPU0    X       ICN1    ICN1    ICN1    SYS     SYS     SYS     ICN1    0-47,96-143     0
 PPU1    ICN1    X       ICN1    ICN1    SYS     SYS     ICN1    SYS     0-47,96-143     0
 PPU2    ICN1    ICN1    X       ICN1    SYS     ICN1    SYS     SYS     0-47,96-143     0
 PPU3    ICN1    ICN1    ICN1    X       ICN1    SYS     SYS     SYS     0-47,96-143     0
 PPU4    SYS     SYS     SYS     ICN1    X       ICN1    ICN1    ICN1    48-95,144-191   1
 PPU5    SYS     SYS     ICN1    SYS     ICN1    X       ICN1    ICN1    48-95,144-191   1
 PPU6    SYS     ICN1    SYS     SYS     ICN1    ICN1    X       ICN1    48-95,144-191   1
 PPU7    ICN1    SYS     SYS     SYS     ICN1    ICN1    ICN1    X       48-95,144-191   1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  ICN# = Connection traversing a bonded set of # ICN links

3.1.2 p2p min path

所有 p2p 带宽测试都会首先打印拓扑连接矩阵,矩阵中的每个元素值为 0 表示两个 PPU 之间没有 icnlink 通路(direct/route);针对多进程测试,元素值只能为 0/1,为 1 表示两个PPU之间存在 icnlink 通路;针对单进程测试元素值可能大于 1,元素值表示两个PPU之间的 min path。

P2P Connectivity & minPath Matrix
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0         1         1         1         2         2         1         2         
DEV1      1         0         1         1         2         2         2         1
DEV2      1         1         0         1         1         2         2         2         
DEV3      1         1         1         0         2         1         2         2
DEV4      2         2         1         2         0         1         1         1         
DEV5      2         2         2         1         1         0         1         1
DEV6      1         2         2         2         1         1         0         1         
DEV7      2         1         2         2         1         1         1         0

3.1.3 带宽利用率显示

使用display_ur参数打印 p2p bandwidth 测试带宽利用率矩阵示例如下(以 CE 单向写为例):

p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=write --display_ur=1
(EnabledP2P CE WRITE Unidirectional)
P2P Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7      
DEV0       x        47.50     46.34     45.13     90.37     90.35     45.61     92.17     
DEV1      48.22      x        46.03     45.29     93.40     95.59     95.50     46.29     
DEV2      45.76     47.52      x        45.64     47.67     91.48     90.10     94.32     
DEV3      47.55     47.54     47.53      x        95.01     47.53     94.99     94.97
DEV4      92.95     94.84     48.54     94.23      x        46.04     44.93     45.96     
DEV5      95.58     95.39     92.79     45.17     46.41      x        48.24     46.06     
DEV6      46.66     90.65     90.25     90.30     45.24     46.47      x        47.86     
DEV7      94.93     48.54     96.31     95.87     47.56     46.91     45.75      x   
(EnabledP2P CE WRITE Unidirectional) 
P2P Transport Bandwidth Utilization Ratio Matrix (2147483648 Bytes) (%)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7      
DEV0       x        86.44     86.51     86.54     85.21     86.98     86.50     84.28     
DEV1      87.66      x        84.76     88.01     84.86     83.89     86.87     87.01 
DEV3      84.20     88.93     84.41      x        85.03     88.97     85.01     83.72     
DEV4      88.84     87.90     86.82     84.64      x        83.19     83.83     86.27     
DEV5      83.62     83.62     83.47     84.75     86.69      x        88.55     90.04
DEV6      85.84     88.30     85.86     83.96     86.26     87.42      x        84.01     
DEV7      88.63     88.33     86.27     85.30     84.71     83.41     83.06      x        
Total:  31.10 s

3.1.4 实测数据

CE 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=write
(EnabledP2P CE WRITE Unidirectional)
P2P Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7      
DEV0       x        47.50     46.34     45.13     90.37     90.35     45.61     92.17     
DEV1      48.22      x        46.03     45.29     93.40     95.59     95.50     46.29     
DEV2      45.76     47.52      x        45.64     47.67     91.48     90.10     94.32     
DEV3      47.55     47.54     47.53      x        95.01     47.53     94.99     94.97
DEV4      92.95     94.84     48.54     94.23      x        46.04     44.93     45.96     
DEV5      95.58     95.39     92.79     45.17     46.41      x        48.24     46.06     
DEV6      46.66     90.65     90.25     90.30     45.24     46.47      x        47.86     
DEV7      94.93     48.54     96.31     95.87     47.56     46.91     45.75      x   
Total:  31.10 s
CE 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=read
(EnabledP2P CE READ Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        47.16     47.69     48.15     48.35     45.06     46.15     47.09     
DEV1      46.81      x        45.82     48.12     44.87     44.89     45.71     46.99     
DEV2      45.52     45.81      x        48.31     46.94     45.16     45.40     44.61     
DEV3      44.79     45.79     45.72      x        46.12     46.07     45.18     47.22     
DEV4      44.72     46.06     45.89     48.70      x        46.61     44.94     46.70     
DEV5      45.56     46.64     47.41     48.18     46.19      x        44.71     44.64     
DEV6      47.01     47.10     45.69     47.50     46.12     46.70      x        47.48     
DEV7      48.18     47.85     46.25     47.17     47.52     46.17     44.87      x      
Total:  39.59 s
CE 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=biBw --p2p_direction=write
(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        89.37     88.92     89.14     177.20    177.50    90.48     175.34    
DEV1      91.71      x        89.01     90.52     180.07    173.93    175.32    88.96     
DEV2      89.68     92.75      x        89.45     89.24     174.26    177.67    176.64    
DEV3      89.01     89.34     89.17      x        174.04    89.08     174.11    174.32  
DEV4      182.74    180.31    89.66     173.86     x        89.19     90.68     90.71     
DEV5      177.61    174.12    177.32    88.67     91.84      x        88.70     90.31     
DEV6      88.89     176.75    178.35    176.07    92.75     90.58      x        89.08     
DEV7      179.27    93.35     179.39    178.68    91.19     90.64     89.90      x      
Total:  32.46 s
CE 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=biBw --p2p_direction=read
(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        91.56     90.07     89.09     89.36     90.06     90.23     92.59     
DEV1      92.95      x        91.08     89.05     89.09     89.15     89.75     90.07     
DEV2      90.88     90.06      x        90.16     92.50     89.24     92.01     88.95     
DEV3      92.48     90.82     92.49      x        89.00     92.23     93.14     91.78  
DEV4      89.56     89.43     92.02     94.55      x        91.01     88.82     90.29     
DEV5      89.34     90.77     89.21     90.73     91.11      x        89.01     89.24     
DEV6      88.70     88.83     89.44     89.90     89.79     89.61      x        90.00     
DEV7      89.01     88.74     89.07     88.96     91.68     92.37     92.33      x     
Total:  40.76 s
CE 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=CE --perf=latency --p2p_direction=write
(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.27      1.27      1.27      1.65      1.65      1.22      1.67      
DEV1      1.28      0.00      1.34      1.34      1.69      1.69      1.69      1.31      
DEV2      1.20      1.21      0.00      1.21      1.20      1.60      1.61      1.61      
DEV3      1.20      1.21      1.21      0.00      1.60      1.19      1.61      1.61   
DEV4      1.67      1.70      1.31      1.69      0.00      1.34      1.34      1.34      
DEV5      1.68      1.69      1.70      1.30      1.34      0.00      1.34      1.34      
DEV6      1.18      1.60      1.61      1.61      1.21      1.21      0.00      1.21      
DEV7      1.57      1.20      1.61      1.61      1.21      1.21      1.22      0.00 
(EnabledP2P CE WRITE) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      7.17      6.99      7.18      7.12      7.11      7.15      6.95      
DEV1      7.20      0.00      7.04      7.01      6.95      7.00      7.09      7.16      
DEV2      7.04      7.01      0.00      6.98      6.95      7.01      7.03      7.01      
DEV3      7.18      6.88      7.14      0.00      6.97      7.15      6.88      7.02   
DEV4      7.63      7.76      7.61      7.48      0.00      7.49      7.59      7.56      
DEV5      7.58      7.55      7.61      7.56      7.48      0.00      7.51      7.49      
DEV6      7.73      7.49      7.38      7.52      7.47      7.51      0.00      7.51      
DEV7      7.32      7.23      7.30      7.30      7.24      7.18      7.33      0.00  
Total:   0.51 s
CE 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=CE --perf=latency --p2p_direction=read
(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.26      1.25      1.23      1.63      1.64      1.22      1.63      
DEV1      1.23      0.00      1.26      1.25      1.63      1.65      1.66      1.23      
DEV2      1.23      1.26      0.00      1.24      1.22      1.65      1.65      1.63      
DEV3      1.13      1.17      1.17      0.00      1.54      1.15      1.56      1.54    
DEV4      1.50      1.56      1.15      1.55      0.00      1.17      1.17      1.16      
DEV5      1.59      1.65      1.65      1.23      1.25      0.00      1.25      1.25      
DEV6      1.19      1.65      1.65      1.63      1.24      1.26      0.00      1.24      
DEV7      1.51      1.15      1.56      1.54      1.16      1.18      1.17      0.00  
(EnabledP2P CE READ) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      7.23      1.57      1.51      1.51      1.60      1.58      1.49      
DEV1      6.99      0.00      1.50      1.53      1.59      1.52      1.50      1.53      
DEV2      6.83      1.51      0.00      1.51      1.47      1.48      1.46      1.53      
DEV3      7.08      1.54      1.57      0.00      1.53      1.55      1.59      1.54  
DEV4      7.61      1.97      1.96      1.95      0.00      1.94      2.03      1.97      
DEV5      7.86      2.05      2.01      2.08      2.00      0.00      2.04      2.04      
DEV6      8.12      1.80      1.84      1.86      1.79      1.81      0.00      1.78      
DEV7      7.65      1.60      1.61      1.62      1.61      1.62      1.62      0.00
Total:   0.49 s
SM (cache copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=write
(EnabledP2P SM WRITE Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        33.46     33.46     33.47     66.60     66.60     33.75     66.35     
DEV1      33.46      x        33.40     33.41     66.45     66.39     66.37     33.75     
DEV2      33.42     33.46      x        33.41     33.77     66.39     66.43     66.41     
DEV3      33.40     33.42     33.46      x        66.41     33.77     66.38     66.39  
DEV4      66.36     66.45     33.81     66.59      x        33.46     33.40     33.40     
DEV5      66.39     66.33     66.43     33.76     33.46      x        33.41     33.39     
DEV6      33.76     66.35     66.45     66.41     33.40     33.46      x        33.39     
DEV7      66.44     33.76     66.35     66.43     33.43     33.41     33.39      x     
Total:  43.08 s
SM (cache copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=read
(EnabledP2P SM READ Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        36.73     36.73     36.73     73.41     73.39     36.89     73.06     
DEV1      36.73      x        36.64     36.66     73.06     73.12     73.15     36.82     
DEV2      36.64     36.73      x        36.66     36.80     73.06     73.14     73.07     
DEV3      36.67     36.64     36.73      x        73.06     36.82     73.13     73.08
DEV4      73.13     73.40     36.89     73.40      x        36.73     36.73     36.65     
DEV5      73.04     73.03     73.18     36.83     36.73      x        36.65     36.65     
DEV6      36.83     73.17     73.06     73.09     36.65     36.73      x        36.65     
DEV7      73.19     36.82     73.07     73.11     36.65     36.64     36.65      x     
Total:  39.42 s
SM (cache copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=write
(EnabledP2P SM WRITE Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        62.09     62.09     62.10     123.73    123.73    62.46     123.69    
DEV1      62.08      x        62.09     62.10     123.68    123.71    123.68    62.45     
DEV2      62.09     62.09      x        62.09     62.44     123.71    123.72    123.70    
DEV3      62.12     62.11     62.08      x        123.71    62.43     123.70    123.68 
DEV4      123.72    123.71    62.44     123.72     x        62.10     62.11     62.10     
DEV5      123.72    123.70    123.71    62.46     62.11      x        62.10     62.09     
DEV6      62.46     123.69    123.70    123.72    62.07     62.11      x        62.07     
DEV7      123.68    62.45     123.70    123.68    62.09     62.08     62.08      x     
Total:  46.50 s
SM (cache copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=read
(EnabledP2P SM READ Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        48.61     48.62     48.61     97.01     96.98     48.90     96.97     
DEV1      48.61      x        48.57     48.58     96.94     96.90     96.90     48.87     
DEV2      48.58     48.61      x        48.58     48.88     96.94     96.92     96.94     
DEV3      48.58     48.57     48.61      x        96.91     48.88     96.90     96.90  
DEV4      96.94     96.94     48.88     97.02      x        48.57     48.58     48.59     
DEV5      96.94     96.94     96.93     48.88     48.61      x        48.58     48.57     
DEV6      48.88     96.89     96.90     96.95     48.58     48.57      x        48.57     
DEV7      96.95     48.87     96.90     96.90     48.58     48.56     48.56      x   
Total:  59.17 s
SM (cache copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=write
(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      2.10      2.10      2.10      2.56      2.56      2.10      2.57      
DEV1      2.01      0.00      1.95      1.95      2.43      2.43      2.43      1.95      
DEV2      2.12      2.11      0.00      2.11      2.11      2.58      2.58      2.58      
DEV3      2.12      2.11      2.11      0.00      2.58      2.11      2.58      2.58  
DEV4      2.44      2.28      1.95      2.41      0.00      1.95      1.95      1.95      
DEV5      2.44      2.28      2.31      1.96      1.95      0.00      1.95      1.95      
DEV6      2.12      2.35      2.40      2.50      2.11      2.11      0.00      2.11      
DEV7      2.58      2.11      2.40      2.52      2.11      2.11      2.11      0.00  
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.22      1.54      1.56      1.55      1.63      1.53      1.56      
DEV1      4.25      0.00      1.56      1.61      1.54      1.54      1.54      1.54      
DEV2      4.29      1.66      0.00      1.58      1.56      1.63      1.60      1.60      
DEV3      4.19      1.56      1.57      0.00      1.58      1.57      1.59      1.56  
DEV4      4.69      2.15      2.29      2.17      0.00      2.19      2.20      2.16      
DEV5      4.68      2.14      2.14      2.12      2.14      0.00      2.14      2.21      
DEV6      4.30      1.91      1.89      1.90      1.96      1.91      0.00      1.90      
DEV7      4.29      1.91      1.85      1.86      1.85      1.89      1.84      0.00  
Total:   0.49 s
SM (cache copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=read
(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.86      1.86      1.94      2.23      2.09      1.79      2.29      
DEV1      1.66      0.00      1.88      2.09      2.11      2.12      2.12      1.87      
DEV2      1.78      1.79      0.00      1.78      1.78      2.23      2.16      2.15      
DEV3      1.78      1.79      1.79      0.00      2.12      1.78      2.11      2.15 
DEV4      2.12      2.34      1.89      2.11      0.00      1.80      1.87      1.88      
DEV5      2.11      2.34      2.34      1.73      1.84      0.00      1.87      1.87      
DEV6      1.78      2.11      2.16      2.14      1.77      1.78      0.00      1.78      
DEV7      2.12      1.79      2.14      2.15      1.78      1.78      1.79      0.00  
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.19      1.54      1.54      1.56      1.55      1.53      1.54      
DEV1      4.27      0.00      1.63      1.62      1.66      1.61      1.63      1.62      
DEV2      4.13      1.60      0.00      1.57      1.59      1.56      1.57      1.57      
DEV3      4.16      1.60      1.58      0.00      1.58      1.59      1.67      1.59  
DEV4      4.64      2.17      2.16      2.25      0.00      2.20      2.15      2.14      
DEV5      4.60      2.10      2.09      2.19      2.12      0.00      2.09      2.10      
DEV6      4.24      1.86      1.89      1.84      1.87      1.89      0.00      1.85      
DEV7      4.26      1.89      1.95      1.98      1.91      1.96      1.98      0.00  
Total:   0.48 s
SM (none cache copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=write
(EnabledP2P SM WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        33.46     33.46     33.47     66.74     66.73     33.75     66.50     
DEV1      33.46      x        33.41     33.41     66.53     66.51     66.59     33.74     
DEV2      33.43     33.46      x        33.40     33.76     66.57     66.57     66.47     
DEV3      33.42     33.41     33.46      x        66.47     33.76     66.55     66.49 
DEV4      66.49     66.57     33.81     66.73      x        33.46     33.40     33.41     
DEV5      66.52     66.54     66.47     33.75     33.46      x        33.40     33.40     
DEV6      33.74     66.46     66.50     66.50     33.41     33.46      x        33.40     
DEV7      66.55     33.76     66.55     66.46     33.41     33.41     33.39      x      
Total:  43.07 s
SM (none cache copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=read
(EnabledP2P SM READ Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        36.74     36.74     36.74     73.44     73.44     36.90     73.13     
DEV1      36.73      x        36.68     36.68     73.15     73.11     73.15     36.81     
DEV2      36.66     36.74      x        36.66     36.83     73.14     73.19     73.23     
DEV3      36.66     36.67     36.74      x        73.13     36.81     73.09     73.17  
DEV4      73.09     73.43     36.89     73.44      x        36.74     36.74     36.66     
DEV5      73.09     73.07     73.16     36.84     36.74      x        36.68     36.68     
DEV6      36.82     73.09     73.16     73.19     36.67     36.74      x        36.67     
DEV7      73.14     36.80     73.08     73.18     36.67     36.66     36.66      x    
Total:  39.30 s
SM (none cache copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=write
(EnabledP2P SM WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        62.09     62.10     62.09     123.98    123.98    62.48     123.95    
DEV1      62.08      x        62.09     62.09     123.91    123.93    123.90    62.45     
DEV2      62.08     62.08      x        62.08     62.45     123.93    123.96    123.94    
DEV3      62.10     62.08     62.09      x        123.94    62.43     123.97    123.95  
DEV4      123.96    123.94    62.45     123.96     x        62.09     62.10     62.10     
DEV5      123.96    123.96    123.95    62.44     62.10      x        62.09     62.07     
DEV6      62.46     123.92    123.96    123.94    62.10     62.10      x        62.09     
DEV7      123.95    62.43     123.94    123.95    62.10     62.08     62.09      x    
Total:  46.51 s
SM (none cache copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=read
(EnabledP2P SM READ Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        48.62     48.63     48.62     97.01     97.05     48.93     97.04     
DEV1      48.62      x        48.58     48.60     96.98     96.95     96.95     48.90     
DEV2      48.59     48.62      x        48.59     48.91     96.92     96.96     96.97     
DEV3      48.60     48.59     48.62      x        96.98     48.91     96.96     96.95  
DEV4      96.98     96.94     48.90     97.07      x        48.59     48.59     48.60     
DEV5      96.96     96.92     96.95     48.90     48.62      x        48.59     48.58     
DEV6      48.92     96.97     96.95     96.94     48.59     48.58      x        48.58     
DEV7      96.93     48.91     96.92     96.92     48.59     48.59     48.58      x      
Total:  59.13 s
SM (none cache copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=write
(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.87      1.83      1.87      2.24      2.24      1.65      2.16      
DEV1      1.80      0.00      1.79      1.79      2.03      1.96      1.95      1.53      
DEV2      1.69      1.71      0.00      1.73      1.90      2.24      2.15      2.12      
DEV3      1.68      1.66      1.73      0.00      2.24      1.78      2.14      2.27 
DEV4      2.16      2.27      1.78      2.02      0.00      1.55      1.55      1.62      
DEV5      2.16      2.25      2.11      1.54      1.55      0.00      1.55      1.62      
DEV6      1.65      2.11      2.11      2.35      1.91      1.83      0.00      1.75      
DEV7      2.12      1.64      2.11      2.19      1.91      1.78      1.72      0.00   
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.29      1.82      1.84      1.89      1.87      1.82      1.83      
DEV1      4.27      0.00      1.85      1.89      1.87      1.85      1.85      1.86      
DEV2      4.65      2.21      0.00      2.15      2.12      2.19      2.15      2.14      
DEV3      4.62      2.11      2.17      0.00      2.15      2.14      2.13      2.12 
DEV4      3.78      1.32      1.31      1.32      0.00      1.32      1.31      1.33      
DEV5      4.13      1.52      1.53      1.54      1.56      0.00      1.53      1.50      
DEV6      4.13      1.49      1.56      1.47      1.56      1.55      0.00      1.48      
DEV7      3.86      1.37      1.32      1.31      1.31      1.31      1.30      0.00  
Total:   0.47 s
SM (none cache copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=read
(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.65      1.64      1.63      2.10      2.15      1.65      2.05      
DEV1      1.62      0.00      1.65      1.66      2.11      2.11      2.11      1.64      
DEV2      1.61      1.64      0.00      1.64      1.64      2.11      2.11      2.11      
DEV3      1.40      1.46      1.46      0.00      1.95      1.46      1.95      1.95  
DEV4      1.79      1.95      1.46      1.95      0.00      1.46      1.46      1.47      
DEV5      2.04      2.11      2.11      1.64      1.64      0.00      1.64      1.69      
DEV6      1.52      2.11      2.11      2.11      1.64      1.64      0.00      1.65      
DEV7      1.79      1.46      1.95      1.95      1.46      1.46      1.47      0.00   
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.37      1.86      1.85      1.85      1.89      1.86      1.90      
DEV1      4.41      0.00      1.85      1.88      1.85      1.84      1.84      1.84      
DEV2      4.71      2.17      0.00      2.20      2.17      2.21      2.17      2.17      
DEV3      4.71      2.13      2.13      0.00      2.13      2.15      2.26      2.12 
DEV4      3.81      1.31      1.31      1.30      0.00      1.31      1.31      1.30      
DEV5      4.27      1.59      1.60      1.57      1.63      0.00      1.58      1.57      
DEV6      4.14      1.65      1.55      1.51      1.56      1.55      0.00      1.57      
DEV7      3.98      1.37      1.35      1.34      1.37      1.39      1.35      0.00   
Total:   0.49 s
SM (bulk copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write
(EnabledP2P SM WRITE Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        45.00     45.00     45.00     89.73     89.72     45.45     89.54     
DEV1      45.00      x        45.00     45.00     89.50     89.54     89.72     45.44     
DEV2      45.00     45.00      x        45.00     45.38     89.45     89.27     89.44     
DEV3      45.00     45.00     45.00      x        89.71     45.44     89.39     89.45    
DEV4      89.70     89.72     45.44     89.73      x        45.00     45.00     45.00     
DEV5      89.51     89.53     89.72     45.44     45.00      x        45.00     45.00     
DEV6      45.41     89.25     89.26     89.32     45.00     45.00      x        45.00     
DEV7      89.72     45.44     89.34     89.60     45.00     45.00     45.00      x    
Total:  32.15 s
SM (bulk copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read
(EnabledP2P SM READ Unidirectional) 
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        46.48     46.48     46.49     93.48     93.41     46.62     93.47     
DEV1      46.48      x        46.47     46.49     93.39     93.47     93.50     46.64     
DEV2      46.50     46.47      x        46.50     46.64     92.71     92.75     93.52     
DEV3      46.49     46.49     46.48      x        93.48     46.62     93.51     93.51 
DEV4      93.45     93.34     46.62     93.46      x        46.49     46.49     46.52     
DEV5      93.49     93.44     93.47     46.64     46.51      x        46.51     46.51     
DEV6      46.61     92.80     92.58     93.48     46.49     46.49      x        46.50     
DEV7      93.47     46.62     93.46     93.49     46.50     46.50     46.51      x     
Total:  31.11 s
SM (bulk copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=write
(EnabledP2P SM WRITE Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        84.83     84.85     84.85     169.60    169.57    85.25     169.56    
DEV1      84.85      x        84.83     84.84     169.54    169.55    169.56    85.24     
DEV2      84.84     84.82      x        84.83     85.25     169.55    169.53    169.54    
DEV3      84.84     84.83     84.83      x        169.58    85.26     169.58    169.52  
DEV4      169.58    169.56    85.25     169.57     x        84.84     84.85     84.85     
DEV5      169.53    169.57    169.57    85.25     84.84      x        84.83     84.83     
DEV6      85.25     169.55    169.56    169.57    84.85     84.84      x        84.83     
DEV7      169.56    85.23     169.52    169.50    84.85     84.82     84.82      x      
Total:  34.18 s
SM (bulk copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=read
(EnabledP2P SM READ Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0       x        78.52     78.52     78.53     157.98    157.96    78.87     157.91    
DEV1      78.53      x        78.51     78.49     157.81    157.71    157.73    78.84     
DEV2      78.50     78.51      x        78.50     78.87     157.83    157.80    157.80    
DEV3      78.49     78.50     78.52      x        157.83    78.86     157.75    157.78  
DEV4      157.80    157.96    78.88     157.95     x        78.52     78.53     78.50     
DEV5      157.75    157.76    157.77    78.84     78.52      x        78.51     78.47     
DEV6      78.84     157.61    157.75    157.75    78.51     78.52      x        78.50     
DEV7      157.85    78.82     157.78    157.67    78.50     78.47     78.49      x   
Total:  36.86 s
SM (bulk copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=bulkCopy --perf=latency --p2p_direction=write
(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.69      1.69      1.83      2.17      2.33      1.69      2.10      
DEV1      1.67      0.00      1.66      1.66      2.11      2.19      2.12      1.88      
DEV2      1.88      1.65      0.00      1.78      1.64      2.12      2.24      2.11      
DEV3      1.55      1.62      1.55      0.00      2.17      1.56      2.14      2.27 
DEV4      1.96      2.03      1.54      1.95      0.00      1.64      1.67      1.63      
DEV5      2.27      2.22      2.23      1.87      1.76      0.00      1.86      1.86      
DEV6      1.65      2.28      2.11      2.11      1.86      1.84      0.00      1.87      
DEV7      2.09      1.55      2.17      2.13      1.66      1.64      1.79      0.00   
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.35      1.84      1.83      1.84      1.87      1.85      1.86      
DEV1      4.20      0.00      1.83      1.93      1.85      1.85      1.83      1.82      
DEV2      4.66      2.17      0.00      2.17      2.25      2.18      2.16      2.15      
DEV3      4.68      2.11      2.20      0.00      2.14      2.23      2.18      2.12  
DEV4      3.82      1.32      1.32      1.40      0.00      1.32      1.34      1.33      
DEV5      3.84      1.32      1.32      1.33      1.31      0.00      1.32      1.38      
DEV6      3.99      1.43      1.47      1.50      1.46      1.46      0.00      1.44      
DEV7      3.75      1.35      1.38      1.36      1.36      1.34      1.35      0.00  
Total:   0.50 s
SM (bulk copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=bulkCopy --perf=latency --p2p_direction=read
(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      1.81      1.66      1.84      2.31      2.32      1.86      2.33      
DEV1      1.51      0.00      1.63      1.79      2.12      2.11      2.22      1.64      
DEV2      1.51      1.63      0.00      1.65      1.63      2.28      2.15      2.11      
DEV3      1.66      1.68      1.70      0.00      2.11      1.87      2.30      2.34  
DEV4      2.12      2.12      1.66      2.12      0.00      1.83      1.87      1.87      
DEV5      1.96      2.03      2.09      1.62      1.62      0.00      1.65      1.72      
DEV6      1.48      2.03      2.09      2.14      1.65      1.68      0.00      1.82      
DEV7      2.12      1.67      2.34      2.11      1.83      1.96      1.88      0.00 
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
          DEV0      DEV1      DEV2      DEV3      DEV4      DEV5      DEV6      DEV7
DEV0      0.00      4.22      1.56      1.56      1.60      1.54      1.60      1.54      
DEV1      4.13      0.00      1.64      1.58      1.61      1.58      1.61      1.62      
DEV2      3.77      1.33      0.00      1.33      1.33      1.37      1.31      1.32      
DEV3      3.69      1.33      1.32      0.00      1.31      1.35      1.34      1.32  
DEV4      4.70      2.17      2.31      2.20      0.00      2.25      2.20      2.19      
DEV5      4.69      2.19      2.19      2.11      2.31      0.00      2.16      2.16      
DEV6      4.38      1.86      1.89      1.85      1.84      1.83      0.00      1.85      
DEV7      4.61      2.17      2.16      2.15      2.16      2.15      2.16      0.00   
Total:   0.49 s

3.2 多机(两机四卡)

3.2.1 实际拓扑结构

image.png

运行示例的两机四卡 topo 结构

3.2.2 实测数据

Crossnode 测试需要指定运行 p2p 的测试的 hosts 的 ip 地址,可以通过设置hosts.cfg文件内容指定,运行时向 mpi 传入--hostfile参数值即可,以下是hosts.cfg文件内容示例:

# usage: mpirun --hostfile hosts.cfg
# tell mpi the hosts' ip addresses and slots of each host

# format
# $ip_address slots=$slots

# e.g.
# 30.21.220.4 slots=2
# 30.21.220.5 slots=2
SM (cache copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=write
(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   16.91               33.67               16.95
HOST1:DEV0          16.93               x                   16.94               33.66
HOST2:DEV0          33.60               16.94               x                   16.95
HOST3:DEV0          16.94               33.66               16.94               x
Total:  18.72 s
SM (cache copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=read
(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   18.95               37.92               18.98
HOST1:DEV0          18.96               x                   18.98               37.92
HOST2:DEV0          37.86               18.98               x                   18.98
HOST3:DEV0          18.97               37.92               18.98               x
Total:  18.72 s
SM (cache copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=write
(EnabledP2P CE WRITE Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   31.63               62.77               31.63
HOST1:DEV0          31.63               x                   31.63               62.74
HOST2:DEV0          62.77               31.63               x                   31.63
HOST3:DEV0          31.63               62.74               31.63               x
Total:  25.58 s
SM (cache copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=read
(EnabledP2P CE READ Bidirectional) 
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   25.58               50.73               25.58
HOST1:DEV0          25.58               x                   25.61               50.65
HOST2:DEV0          50.73               25.61               x                   25.57
HOST3:DEV0          25.58               50.63               25.58               x      
Total:  30.58 s
SM (cache copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=write
(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
              HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
DEV0          0.00                37.30               5.58                5.57
DEV0          41.91               0.00                4.90                4.90
DEV0          37.09               4.68                0.00                4.27
DEV0          36.07               4.43                4.41                0.00
Total:   0.47 s
SM (cache copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=read
(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
              HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
DEV0          0.00                42.19               5.77                5.77
DEV0          44.16               0.00                6.42                6.81
DEV0          36.71               4.13                0.00                4.13
DEV0          36.35               4.77                4.74                0.00
Total:   0.44 s
SM (none cache copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=write
(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   16.91               33.74               16.95
HOST1:DEV0          16.94               x                   16.94               33.73
HOST2:DEV0          33.68               16.94               x                   16.95
HOST3:DEV0          16.94               33.74               16.94               x
Total:  20.79 s
SM (none cache copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=read
(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   18.95               37.96               18.98
HOST1:DEV0          18.96               x                   18.98               37.96
HOST2:DEV0          37.89               18.89               x                   18.98
HOST3:DEV0          18.97               37.96               18.99               x      
Total:  18.72 s
SM (none cache copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=write
(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   31.63               62.91               31.63
HOST1:DEV0          31.63               x                   31.63               62.88
HOST2:DEV0          62.90               31.63               x                   31.63
HOST3:DEV0          31.63               62.89               31.63               x
Total:  25.59 s
SM (none cache copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=read
(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   25.58               50.72               25.58
HOST1:DEV0          25.58               x                   25.61               50.64
HOST2:DEV0          50.73               25.61               x                   25.58
HOST3:DEV0          25.58               50.64               25.58               x
Total:  30.58 s
SM (none cache copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=write
(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
              HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
DEV0          0.00                39.61               5.76                5.86
DEV1          40.75               0.00                4.86                4.97
DEV0          35.78               4.74                0.00                4.71
DEV1          34.23               3.90                3.87                0.00 
Total:   0.44 s
SM (none cache copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=read
(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
              HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
DEV0          0.00                40.95               5.57                5.55
DEV1          41.17               0.00                5.28                5.28
DEV0          35.48               4.68                0.00                4.53
DEV1          36.23               4.34                4.34                0.00 
Total:   0.43 s
SM (bulk copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write
(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   23.33               46.47               23.39
HOST1:DEV0          23.36               x                   23.39               46.47
HOST2:DEV0          46.38               23.39               x                   23.39
HOST3:DEV0          23.36               46.47               23.39               x      
Total:  15.12 s
SM (bulk copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read
(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   24.05               48.22               24.11
HOST1:DEV0          24.08               x                   24.11               48.22
HOST2:DEV0          48.11               24.11               x                   24.11
HOST3:DEV0          24.08               48.22               24.11               x
Total:  14.68 s
SM (bulk copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=write
(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   44.12               87.59               44.13
HOST1:DEV0          44.12               x                   44.17               87.55
HOST2:DEV0          87.60               44.17               x                   44.12
HOST3:DEV0          44.13               87.55               44.12               x
Total:  18.99 s
SM (bulk copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=read
(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   41.77               83.52               41.75
HOST1:DEV0          41.77               x                   41.80               83.43
HOST2:DEV0          83.51               41.80               x                   41.75
HOST3:DEV0          41.75               83.42               41.77               x      
Total:  19.90 s
SM (bulk copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write
(EnabledP2P CE WRITE) P2P Unidirectional Bandwidth Transport Bandwidth Matrix (40000000 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   20.49               46.23               23.36
HOST1:DEV0          20.90               x                   23.36               46.31
HOST2:DEV0          38.20               23.37               x                   23.36
HOST3:DEV0          22.04               46.28               23.37               x
Total:   0.70 s
SM (bulk copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read
(EnabledP2P CE READ) P2P Unidirectional Bandwidth Transport Bandwidth Matrix (40000000 Bytes) (GB/s)
                    HOST0:DEV0          HOST1:DEV0          HOST2:DEV0          HOST3:DEV0
HOST0:DEV0          x                   20.54               46.30               23.36
HOST1:DEV0          20.85               x                   23.36               46.23
HOST2:DEV0          38.43               23.37               x                   23.36
HOST3:DEV0          21.09               46.27               23.37               x
Total:   0.69 s