PCCL: 多卡 p2p 带宽与延迟测试指南(v2.1)
1. 概述
本文将介绍 pccl_tools 中的 p2p bandwidth 与 latency 带宽测试工具 "p2pBandwidthLatencyTest" 的用法。下面各个小的章节将会分别介绍它在各种多卡或多机配置下几个典型模式的使用方法,其中所附的数据值仅供参考使用,具体 PPU SDK release 所对应的实际 perf 数据还需参考具体的机器环境与带宽配置而决定。
PPU SDK v2.1.0 release 开始,pccl_tools 从 SDK 中移除并独立发布,并更名为 comm_tools,p2p 测试工具包含在 comm_tools 下。
PPU SDK v2.1.0 独立发布的 comm_tools 与之前版本不兼容,不能基于之前版本的 PPU SDK 执行。
独立发布链接:https://art-pub.eng.t-head.cn/artifactory/generic-local/SAIL/v2.1.0/COMM ,包命名为comm_tools_<hggcrt_version>_<os_version>.tar.gz,请根据实际场景所需的系统版本和 hggc runtime 版本下载对应工具包。P2P 测试工具分为 MPI 编译的 multi_process 和不带 MPI 编译的 single_process 两种版本,前者依赖 MPI 支持多机测试,后者用于单机测试。
├── multi_process
│ ├── p2pBandwidthLatencyTest
└── single_process
├── p2pBandwidthLatencyTest2. 参数介绍
可通过 p2pBandwidthLatencyTest --help 查询全部参数信息:
This is bandwidth/latency tests of GPU pairs using P2P.
Usage:
p2pBandwidthLatencyTest [OPTION...]
-p, --perf arg P2P perf type. uniBw: unidirectional bandwidth; biBw: bidirectional bandwidth; latency: latency. (default: uniBw)
-d, --p2p_direction arg P2P transmission direction. write: root device write to peer device; read: root device read from peer device. (default: write)
-n, --num_elems arg Number of transmission integers. (default: 10000000)
-t, --p2p_type arg P2P transmission type. CE: command level; SM: kernel level. (default: SM)
-k, --kernel arg P2P transmission kernel type. combined with p2pType=SM. typical cacheCopy/ncCopy/bulkCopy, or normal format: ppu1.0: <ld_cp>-<st_cp>-<rmt_cp>, ppu1.5: <ld_cp>-<st_cp>-<rmt_cp>-<rtmd> (default: bulkCopy)
-b, --num_thread_blocks arg Number of thread blocks to be used in kernel copy. (default: -1)
-s, --block_size arg Thread block size to be used in kernel copy. (default: -1)
-g, --num_gpus arg Total number of devices in single process test. it will always be '1' in multi-process test. (default: -1)
-l, --dev_list arg Device list for single process p2p bandwidth test. List length should equal to num_gpus. (default: -1,)
-m, --is_multi_proc arg Multiple processes with single device per process(0 or 1). (default: 0)
-a, --disable_p2p arg Disable kernel level p2p transmission(0 or 1). (default: 0)
-e, --enable_local_D2D arg Enable D2D copy on one device(0 or 1). (default: 0)
-j, --eval_P2P_perf_tri arg Eval p2p copy perf on perf triangle way(0 or 1). (default: 0)
-v, --verify arg Whether to enable data mismatch verification. No verify if not set(0 or 1). (default: 0)
-u, --display_ur arg Whether to show bandwidth matrix in terms of utilization ratio(0 or 1). (default: 0)
-w, --warmup arg Whethre to do command engine bi-bw warmup before p2p perf test(0 or 1). (default: 1)
-i, --num_iterations arg Number of iterations for random test. (default: 1)
-r, --repeat arg Number of repetitions of p2p transmission. (default: 15)
-y, --min_num_elems arg Minimum number of transmission integers in random test. (default: 8)
-x, --max_num_elems arg Maximum number of transmission integers in random test. (default: 8)
--buffer_offset arg P2P transmission offset from start address of buffer. (default: 0)
-c, --bench_type arg Bench output type. all: evaluate all kinds of perf kinds; p2p_write: p2p bench test with write type; p2p_read: p2p bench test with read type. (default: none)
-f, --max_display_num_elems arg Limit the maximum number of elements displayed. (default: 32)
--disable_vm arg Disable virtual mode(0 or 1). (default: 0)
--is_single_die arg Single die. (default: 0)
--use_fake_driver arg run on fake driver. (default: 0)
--ignore_icn_conn_check arg ignore icn connectivity check. (default: 0)
-h, --help Print usage. (default: false)2.1. 常用参数
performance type
perf=uniBw unidirectional bandwidth.
perf=biBw bidirectional bandwidth.
perf=latency unidirectional latency of transferring 4 integers.
p2p transmission direction
p2p_direction=write root device write to peer device.
p2p_direction=read root device read from peer device.
p2p transmission size
num_elems=number of integers.
p2p transmission way
p2p_type=CE through cudaMemcpyPeerAsync.
p2p_type=SM through kernels.
p2p transmission kernel type if p2pType=SM
kernel=cacheCopy normal load/store.
kernel=ncCopy normal v4 load/store.
kernel=bulkCopy bulk v4 load/store.
2.2. 可选参数
dev_list=当提供 devList 选项时,仅进行 dev_list 中指定的若干 PPU 之间的 p2p transmission 测试,注意使用此参数时必须同时使用 num_gpus 参数,且 dev_list 长度必须与 nDevs 数量相同。
display_ur当使用此参数时,p2p bandwidth 测试在提供原始带宽矩阵的基础上,还提供带宽利用率矩阵的输出。
disable_p2p当使用此参数时,关闭 PPU 之间的 p2p access 访问,在 unidirectional bandwidth / latency 测试时仅支持 WRITE mode。
除此之外,p2p bandwidth 与 latency 带宽测试工具还提供了 bench 参数,用于综合性批量输出的情况。
bench_type=all输出 enableP2P 条件下全部类型组合。
bench_type=p2p_write输出包含 enableP2p/disableP2p 条件下 CE WRITE 的带宽与延迟。
bench_type=p2p_read输出包含 enableP2p/disableP2p 条件下 CE READ 的(部分)带宽与延迟。
3. 示例
3.1 单机 8 卡
3.1.1 实际拓扑结构
PPU0 PPU1 PPU2 PPU3 PPU4 PPU5 PPU6 PPU7 CPU Affinity NUMA Affinity
PPU0 X ICN1 ICN1 ICN1 SYS SYS SYS ICN1 0-47,96-143 0
PPU1 ICN1 X ICN1 ICN1 SYS SYS ICN1 SYS 0-47,96-143 0
PPU2 ICN1 ICN1 X ICN1 SYS ICN1 SYS SYS 0-47,96-143 0
PPU3 ICN1 ICN1 ICN1 X ICN1 SYS SYS SYS 0-47,96-143 0
PPU4 SYS SYS SYS ICN1 X ICN1 ICN1 ICN1 48-95,144-191 1
PPU5 SYS SYS ICN1 SYS ICN1 X ICN1 ICN1 48-95,144-191 1
PPU6 SYS ICN1 SYS SYS ICN1 ICN1 X ICN1 48-95,144-191 1
PPU7 ICN1 SYS SYS SYS ICN1 ICN1 ICN1 X 48-95,144-191 1
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
ICN# = Connection traversing a bonded set of # ICN links3.1.2 p2p min path
所有 p2p 带宽测试都会首先打印拓扑连接矩阵,矩阵中的每个元素值为 0 表示两个 PPU 之间没有 icnlink 通路(direct/route);针对多进程测试,元素值只能为 0/1,为 1 表示两个PPU之间存在 icnlink 通路;针对单进程测试元素值可能大于 1,元素值表示两个PPU之间的 min path。
P2P Connectivity & minPath Matrix
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0 1 1 1 2 2 1 2
DEV1 1 0 1 1 2 2 2 1
DEV2 1 1 0 1 1 2 2 2
DEV3 1 1 1 0 2 1 2 2
DEV4 2 2 1 2 0 1 1 1
DEV5 2 2 2 1 1 0 1 1
DEV6 1 2 2 2 1 1 0 1
DEV7 2 1 2 2 1 1 1 03.1.3 带宽利用率显示
使用display_ur参数打印 p2p bandwidth 测试带宽利用率矩阵示例如下(以 CE 单向写为例):
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=write --display_ur=1(EnabledP2P CE WRITE Unidirectional)
P2P Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 47.50 46.34 45.13 90.37 90.35 45.61 92.17
DEV1 48.22 x 46.03 45.29 93.40 95.59 95.50 46.29
DEV2 45.76 47.52 x 45.64 47.67 91.48 90.10 94.32
DEV3 47.55 47.54 47.53 x 95.01 47.53 94.99 94.97
DEV4 92.95 94.84 48.54 94.23 x 46.04 44.93 45.96
DEV5 95.58 95.39 92.79 45.17 46.41 x 48.24 46.06
DEV6 46.66 90.65 90.25 90.30 45.24 46.47 x 47.86
DEV7 94.93 48.54 96.31 95.87 47.56 46.91 45.75 x
(EnabledP2P CE WRITE Unidirectional)
P2P Transport Bandwidth Utilization Ratio Matrix (2147483648 Bytes) (%)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 86.44 86.51 86.54 85.21 86.98 86.50 84.28
DEV1 87.66 x 84.76 88.01 84.86 83.89 86.87 87.01
DEV3 84.20 88.93 84.41 x 85.03 88.97 85.01 83.72
DEV4 88.84 87.90 86.82 84.64 x 83.19 83.83 86.27
DEV5 83.62 83.62 83.47 84.75 86.69 x 88.55 90.04
DEV6 85.84 88.30 85.86 83.96 86.26 87.42 x 84.01
DEV7 88.63 88.33 86.27 85.30 84.71 83.41 83.06 x
Total: 31.10 s3.1.4 实测数据
CE 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=write(EnabledP2P CE WRITE Unidirectional)
P2P Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 47.50 46.34 45.13 90.37 90.35 45.61 92.17
DEV1 48.22 x 46.03 45.29 93.40 95.59 95.50 46.29
DEV2 45.76 47.52 x 45.64 47.67 91.48 90.10 94.32
DEV3 47.55 47.54 47.53 x 95.01 47.53 94.99 94.97
DEV4 92.95 94.84 48.54 94.23 x 46.04 44.93 45.96
DEV5 95.58 95.39 92.79 45.17 46.41 x 48.24 46.06
DEV6 46.66 90.65 90.25 90.30 45.24 46.47 x 47.86
DEV7 94.93 48.54 96.31 95.87 47.56 46.91 45.75 x
Total: 31.10 sCE 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=uniBw --p2p_direction=read(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 47.16 47.69 48.15 48.35 45.06 46.15 47.09
DEV1 46.81 x 45.82 48.12 44.87 44.89 45.71 46.99
DEV2 45.52 45.81 x 48.31 46.94 45.16 45.40 44.61
DEV3 44.79 45.79 45.72 x 46.12 46.07 45.18 47.22
DEV4 44.72 46.06 45.89 48.70 x 46.61 44.94 46.70
DEV5 45.56 46.64 47.41 48.18 46.19 x 44.71 44.64
DEV6 47.01 47.10 45.69 47.50 46.12 46.70 x 47.48
DEV7 48.18 47.85 46.25 47.17 47.52 46.17 44.87 x
Total: 39.59 sCE 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=biBw --p2p_direction=write(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 89.37 88.92 89.14 177.20 177.50 90.48 175.34
DEV1 91.71 x 89.01 90.52 180.07 173.93 175.32 88.96
DEV2 89.68 92.75 x 89.45 89.24 174.26 177.67 176.64
DEV3 89.01 89.34 89.17 x 174.04 89.08 174.11 174.32
DEV4 182.74 180.31 89.66 173.86 x 89.19 90.68 90.71
DEV5 177.61 174.12 177.32 88.67 91.84 x 88.70 90.31
DEV6 88.89 176.75 178.35 176.07 92.75 90.58 x 89.08
DEV7 179.27 93.35 179.39 178.68 91.19 90.64 89.90 x
Total: 32.46 sCE 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=CE --perf=biBw --p2p_direction=read(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 91.56 90.07 89.09 89.36 90.06 90.23 92.59
DEV1 92.95 x 91.08 89.05 89.09 89.15 89.75 90.07
DEV2 90.88 90.06 x 90.16 92.50 89.24 92.01 88.95
DEV3 92.48 90.82 92.49 x 89.00 92.23 93.14 91.78
DEV4 89.56 89.43 92.02 94.55 x 91.01 88.82 90.29
DEV5 89.34 90.77 89.21 90.73 91.11 x 89.01 89.24
DEV6 88.70 88.83 89.44 89.90 89.79 89.61 x 90.00
DEV7 89.01 88.74 89.07 88.96 91.68 92.37 92.33 x
Total: 40.76 sCE 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=CE --perf=latency --p2p_direction=write(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.27 1.27 1.27 1.65 1.65 1.22 1.67
DEV1 1.28 0.00 1.34 1.34 1.69 1.69 1.69 1.31
DEV2 1.20 1.21 0.00 1.21 1.20 1.60 1.61 1.61
DEV3 1.20 1.21 1.21 0.00 1.60 1.19 1.61 1.61
DEV4 1.67 1.70 1.31 1.69 0.00 1.34 1.34 1.34
DEV5 1.68 1.69 1.70 1.30 1.34 0.00 1.34 1.34
DEV6 1.18 1.60 1.61 1.61 1.21 1.21 0.00 1.21
DEV7 1.57 1.20 1.61 1.61 1.21 1.21 1.22 0.00
(EnabledP2P CE WRITE) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 7.17 6.99 7.18 7.12 7.11 7.15 6.95
DEV1 7.20 0.00 7.04 7.01 6.95 7.00 7.09 7.16
DEV2 7.04 7.01 0.00 6.98 6.95 7.01 7.03 7.01
DEV3 7.18 6.88 7.14 0.00 6.97 7.15 6.88 7.02
DEV4 7.63 7.76 7.61 7.48 0.00 7.49 7.59 7.56
DEV5 7.58 7.55 7.61 7.56 7.48 0.00 7.51 7.49
DEV6 7.73 7.49 7.38 7.52 7.47 7.51 0.00 7.51
DEV7 7.32 7.23 7.30 7.30 7.24 7.18 7.33 0.00
Total: 0.51 sCE 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=CE --perf=latency --p2p_direction=read(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.26 1.25 1.23 1.63 1.64 1.22 1.63
DEV1 1.23 0.00 1.26 1.25 1.63 1.65 1.66 1.23
DEV2 1.23 1.26 0.00 1.24 1.22 1.65 1.65 1.63
DEV3 1.13 1.17 1.17 0.00 1.54 1.15 1.56 1.54
DEV4 1.50 1.56 1.15 1.55 0.00 1.17 1.17 1.16
DEV5 1.59 1.65 1.65 1.23 1.25 0.00 1.25 1.25
DEV6 1.19 1.65 1.65 1.63 1.24 1.26 0.00 1.24
DEV7 1.51 1.15 1.56 1.54 1.16 1.18 1.17 0.00
(EnabledP2P CE READ) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 7.23 1.57 1.51 1.51 1.60 1.58 1.49
DEV1 6.99 0.00 1.50 1.53 1.59 1.52 1.50 1.53
DEV2 6.83 1.51 0.00 1.51 1.47 1.48 1.46 1.53
DEV3 7.08 1.54 1.57 0.00 1.53 1.55 1.59 1.54
DEV4 7.61 1.97 1.96 1.95 0.00 1.94 2.03 1.97
DEV5 7.86 2.05 2.01 2.08 2.00 0.00 2.04 2.04
DEV6 8.12 1.80 1.84 1.86 1.79 1.81 0.00 1.78
DEV7 7.65 1.60 1.61 1.62 1.61 1.62 1.62 0.00
Total: 0.49 sSM (cache copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=write(EnabledP2P SM WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 33.46 33.46 33.47 66.60 66.60 33.75 66.35
DEV1 33.46 x 33.40 33.41 66.45 66.39 66.37 33.75
DEV2 33.42 33.46 x 33.41 33.77 66.39 66.43 66.41
DEV3 33.40 33.42 33.46 x 66.41 33.77 66.38 66.39
DEV4 66.36 66.45 33.81 66.59 x 33.46 33.40 33.40
DEV5 66.39 66.33 66.43 33.76 33.46 x 33.41 33.39
DEV6 33.76 66.35 66.45 66.41 33.40 33.46 x 33.39
DEV7 66.44 33.76 66.35 66.43 33.43 33.41 33.39 x
Total: 43.08 sSM (cache copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=read(EnabledP2P SM READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 36.73 36.73 36.73 73.41 73.39 36.89 73.06
DEV1 36.73 x 36.64 36.66 73.06 73.12 73.15 36.82
DEV2 36.64 36.73 x 36.66 36.80 73.06 73.14 73.07
DEV3 36.67 36.64 36.73 x 73.06 36.82 73.13 73.08
DEV4 73.13 73.40 36.89 73.40 x 36.73 36.73 36.65
DEV5 73.04 73.03 73.18 36.83 36.73 x 36.65 36.65
DEV6 36.83 73.17 73.06 73.09 36.65 36.73 x 36.65
DEV7 73.19 36.82 73.07 73.11 36.65 36.64 36.65 x
Total: 39.42 sSM (cache copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=write(EnabledP2P SM WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 62.09 62.09 62.10 123.73 123.73 62.46 123.69
DEV1 62.08 x 62.09 62.10 123.68 123.71 123.68 62.45
DEV2 62.09 62.09 x 62.09 62.44 123.71 123.72 123.70
DEV3 62.12 62.11 62.08 x 123.71 62.43 123.70 123.68
DEV4 123.72 123.71 62.44 123.72 x 62.10 62.11 62.10
DEV5 123.72 123.70 123.71 62.46 62.11 x 62.10 62.09
DEV6 62.46 123.69 123.70 123.72 62.07 62.11 x 62.07
DEV7 123.68 62.45 123.70 123.68 62.09 62.08 62.08 x
Total: 46.50 sSM (cache copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=read(EnabledP2P SM READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 48.61 48.62 48.61 97.01 96.98 48.90 96.97
DEV1 48.61 x 48.57 48.58 96.94 96.90 96.90 48.87
DEV2 48.58 48.61 x 48.58 48.88 96.94 96.92 96.94
DEV3 48.58 48.57 48.61 x 96.91 48.88 96.90 96.90
DEV4 96.94 96.94 48.88 97.02 x 48.57 48.58 48.59
DEV5 96.94 96.94 96.93 48.88 48.61 x 48.58 48.57
DEV6 48.88 96.89 96.90 96.95 48.58 48.57 x 48.57
DEV7 96.95 48.87 96.90 96.90 48.58 48.56 48.56 x
Total: 59.17 sSM (cache copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=write(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 2.10 2.10 2.10 2.56 2.56 2.10 2.57
DEV1 2.01 0.00 1.95 1.95 2.43 2.43 2.43 1.95
DEV2 2.12 2.11 0.00 2.11 2.11 2.58 2.58 2.58
DEV3 2.12 2.11 2.11 0.00 2.58 2.11 2.58 2.58
DEV4 2.44 2.28 1.95 2.41 0.00 1.95 1.95 1.95
DEV5 2.44 2.28 2.31 1.96 1.95 0.00 1.95 1.95
DEV6 2.12 2.35 2.40 2.50 2.11 2.11 0.00 2.11
DEV7 2.58 2.11 2.40 2.52 2.11 2.11 2.11 0.00
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.22 1.54 1.56 1.55 1.63 1.53 1.56
DEV1 4.25 0.00 1.56 1.61 1.54 1.54 1.54 1.54
DEV2 4.29 1.66 0.00 1.58 1.56 1.63 1.60 1.60
DEV3 4.19 1.56 1.57 0.00 1.58 1.57 1.59 1.56
DEV4 4.69 2.15 2.29 2.17 0.00 2.19 2.20 2.16
DEV5 4.68 2.14 2.14 2.12 2.14 0.00 2.14 2.21
DEV6 4.30 1.91 1.89 1.90 1.96 1.91 0.00 1.90
DEV7 4.29 1.91 1.85 1.86 1.85 1.89 1.84 0.00
Total: 0.49 sSM (cache copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=read(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.86 1.86 1.94 2.23 2.09 1.79 2.29
DEV1 1.66 0.00 1.88 2.09 2.11 2.12 2.12 1.87
DEV2 1.78 1.79 0.00 1.78 1.78 2.23 2.16 2.15
DEV3 1.78 1.79 1.79 0.00 2.12 1.78 2.11 2.15
DEV4 2.12 2.34 1.89 2.11 0.00 1.80 1.87 1.88
DEV5 2.11 2.34 2.34 1.73 1.84 0.00 1.87 1.87
DEV6 1.78 2.11 2.16 2.14 1.77 1.78 0.00 1.78
DEV7 2.12 1.79 2.14 2.15 1.78 1.78 1.79 0.00
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.19 1.54 1.54 1.56 1.55 1.53 1.54
DEV1 4.27 0.00 1.63 1.62 1.66 1.61 1.63 1.62
DEV2 4.13 1.60 0.00 1.57 1.59 1.56 1.57 1.57
DEV3 4.16 1.60 1.58 0.00 1.58 1.59 1.67 1.59
DEV4 4.64 2.17 2.16 2.25 0.00 2.20 2.15 2.14
DEV5 4.60 2.10 2.09 2.19 2.12 0.00 2.09 2.10
DEV6 4.24 1.86 1.89 1.84 1.87 1.89 0.00 1.85
DEV7 4.26 1.89 1.95 1.98 1.91 1.96 1.98 0.00
Total: 0.48 sSM (none cache copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=write(EnabledP2P SM WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 33.46 33.46 33.47 66.74 66.73 33.75 66.50
DEV1 33.46 x 33.41 33.41 66.53 66.51 66.59 33.74
DEV2 33.43 33.46 x 33.40 33.76 66.57 66.57 66.47
DEV3 33.42 33.41 33.46 x 66.47 33.76 66.55 66.49
DEV4 66.49 66.57 33.81 66.73 x 33.46 33.40 33.41
DEV5 66.52 66.54 66.47 33.75 33.46 x 33.40 33.40
DEV6 33.74 66.46 66.50 66.50 33.41 33.46 x 33.40
DEV7 66.55 33.76 66.55 66.46 33.41 33.41 33.39 x
Total: 43.07 sSM (none cache copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=read(EnabledP2P SM READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 36.74 36.74 36.74 73.44 73.44 36.90 73.13
DEV1 36.73 x 36.68 36.68 73.15 73.11 73.15 36.81
DEV2 36.66 36.74 x 36.66 36.83 73.14 73.19 73.23
DEV3 36.66 36.67 36.74 x 73.13 36.81 73.09 73.17
DEV4 73.09 73.43 36.89 73.44 x 36.74 36.74 36.66
DEV5 73.09 73.07 73.16 36.84 36.74 x 36.68 36.68
DEV6 36.82 73.09 73.16 73.19 36.67 36.74 x 36.67
DEV7 73.14 36.80 73.08 73.18 36.67 36.66 36.66 x
Total: 39.30 sSM (none cache copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=write(EnabledP2P SM WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 62.09 62.10 62.09 123.98 123.98 62.48 123.95
DEV1 62.08 x 62.09 62.09 123.91 123.93 123.90 62.45
DEV2 62.08 62.08 x 62.08 62.45 123.93 123.96 123.94
DEV3 62.10 62.08 62.09 x 123.94 62.43 123.97 123.95
DEV4 123.96 123.94 62.45 123.96 x 62.09 62.10 62.10
DEV5 123.96 123.96 123.95 62.44 62.10 x 62.09 62.07
DEV6 62.46 123.92 123.96 123.94 62.10 62.10 x 62.09
DEV7 123.95 62.43 123.94 123.95 62.10 62.08 62.09 x
Total: 46.51 sSM (none cache copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=read(EnabledP2P SM READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 48.62 48.63 48.62 97.01 97.05 48.93 97.04
DEV1 48.62 x 48.58 48.60 96.98 96.95 96.95 48.90
DEV2 48.59 48.62 x 48.59 48.91 96.92 96.96 96.97
DEV3 48.60 48.59 48.62 x 96.98 48.91 96.96 96.95
DEV4 96.98 96.94 48.90 97.07 x 48.59 48.59 48.60
DEV5 96.96 96.92 96.95 48.90 48.62 x 48.59 48.58
DEV6 48.92 96.97 96.95 96.94 48.59 48.58 x 48.58
DEV7 96.93 48.91 96.92 96.92 48.59 48.59 48.58 x
Total: 59.13 sSM (none cache copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=write(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.87 1.83 1.87 2.24 2.24 1.65 2.16
DEV1 1.80 0.00 1.79 1.79 2.03 1.96 1.95 1.53
DEV2 1.69 1.71 0.00 1.73 1.90 2.24 2.15 2.12
DEV3 1.68 1.66 1.73 0.00 2.24 1.78 2.14 2.27
DEV4 2.16 2.27 1.78 2.02 0.00 1.55 1.55 1.62
DEV5 2.16 2.25 2.11 1.54 1.55 0.00 1.55 1.62
DEV6 1.65 2.11 2.11 2.35 1.91 1.83 0.00 1.75
DEV7 2.12 1.64 2.11 2.19 1.91 1.78 1.72 0.00
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.29 1.82 1.84 1.89 1.87 1.82 1.83
DEV1 4.27 0.00 1.85 1.89 1.87 1.85 1.85 1.86
DEV2 4.65 2.21 0.00 2.15 2.12 2.19 2.15 2.14
DEV3 4.62 2.11 2.17 0.00 2.15 2.14 2.13 2.12
DEV4 3.78 1.32 1.31 1.32 0.00 1.32 1.31 1.33
DEV5 4.13 1.52 1.53 1.54 1.56 0.00 1.53 1.50
DEV6 4.13 1.49 1.56 1.47 1.56 1.55 0.00 1.48
DEV7 3.86 1.37 1.32 1.31 1.31 1.31 1.30 0.00
Total: 0.47 sSM (none cache copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=read(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.65 1.64 1.63 2.10 2.15 1.65 2.05
DEV1 1.62 0.00 1.65 1.66 2.11 2.11 2.11 1.64
DEV2 1.61 1.64 0.00 1.64 1.64 2.11 2.11 2.11
DEV3 1.40 1.46 1.46 0.00 1.95 1.46 1.95 1.95
DEV4 1.79 1.95 1.46 1.95 0.00 1.46 1.46 1.47
DEV5 2.04 2.11 2.11 1.64 1.64 0.00 1.64 1.69
DEV6 1.52 2.11 2.11 2.11 1.64 1.64 0.00 1.65
DEV7 1.79 1.46 1.95 1.95 1.46 1.46 1.47 0.00
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.37 1.86 1.85 1.85 1.89 1.86 1.90
DEV1 4.41 0.00 1.85 1.88 1.85 1.84 1.84 1.84
DEV2 4.71 2.17 0.00 2.20 2.17 2.21 2.17 2.17
DEV3 4.71 2.13 2.13 0.00 2.13 2.15 2.26 2.12
DEV4 3.81 1.31 1.31 1.30 0.00 1.31 1.31 1.30
DEV5 4.27 1.59 1.60 1.57 1.63 0.00 1.58 1.57
DEV6 4.14 1.65 1.55 1.51 1.56 1.55 0.00 1.57
DEV7 3.98 1.37 1.35 1.34 1.37 1.39 1.35 0.00
Total: 0.49 sSM (bulk copy) 单向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write(EnabledP2P SM WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 45.00 45.00 45.00 89.73 89.72 45.45 89.54
DEV1 45.00 x 45.00 45.00 89.50 89.54 89.72 45.44
DEV2 45.00 45.00 x 45.00 45.38 89.45 89.27 89.44
DEV3 45.00 45.00 45.00 x 89.71 45.44 89.39 89.45
DEV4 89.70 89.72 45.44 89.73 x 45.00 45.00 45.00
DEV5 89.51 89.53 89.72 45.44 45.00 x 45.00 45.00
DEV6 45.41 89.25 89.26 89.32 45.00 45.00 x 45.00
DEV7 89.72 45.44 89.34 89.60 45.00 45.00 45.00 x
Total: 32.15 sSM (bulk copy) 单向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read(EnabledP2P SM READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 46.48 46.48 46.49 93.48 93.41 46.62 93.47
DEV1 46.48 x 46.47 46.49 93.39 93.47 93.50 46.64
DEV2 46.50 46.47 x 46.50 46.64 92.71 92.75 93.52
DEV3 46.49 46.49 46.48 x 93.48 46.62 93.51 93.51
DEV4 93.45 93.34 46.62 93.46 x 46.49 46.49 46.52
DEV5 93.49 93.44 93.47 46.64 46.51 x 46.51 46.51
DEV6 46.61 92.80 92.58 93.48 46.49 46.49 x 46.50
DEV7 93.47 46.62 93.46 93.49 46.50 46.50 46.51 x
Total: 31.11 sSM (bulk copy) 双向写带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=write(EnabledP2P SM WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 84.83 84.85 84.85 169.60 169.57 85.25 169.56
DEV1 84.85 x 84.83 84.84 169.54 169.55 169.56 85.24
DEV2 84.84 84.82 x 84.83 85.25 169.55 169.53 169.54
DEV3 84.84 84.83 84.83 x 169.58 85.26 169.58 169.52
DEV4 169.58 169.56 85.25 169.57 x 84.84 84.85 84.85
DEV5 169.53 169.57 169.57 85.25 84.84 x 84.83 84.83
DEV6 85.25 169.55 169.56 169.57 84.85 84.84 x 84.83
DEV7 169.56 85.23 169.52 169.50 84.85 84.82 84.82 x
Total: 34.18 sSM (bulk copy) 双向读带宽
p2pBandwidthLatencyTest --num_gpus=8 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=read(EnabledP2P SM READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 x 78.52 78.52 78.53 157.98 157.96 78.87 157.91
DEV1 78.53 x 78.51 78.49 157.81 157.71 157.73 78.84
DEV2 78.50 78.51 x 78.50 78.87 157.83 157.80 157.80
DEV3 78.49 78.50 78.52 x 157.83 78.86 157.75 157.78
DEV4 157.80 157.96 78.88 157.95 x 78.52 78.53 78.50
DEV5 157.75 157.76 157.77 78.84 78.52 x 78.51 78.47
DEV6 78.84 157.61 157.75 157.75 78.51 78.52 x 78.50
DEV7 157.85 78.82 157.78 157.67 78.50 78.47 78.49 x
Total: 36.86 sSM (bulk copy) 单向写延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=bulkCopy --perf=latency --p2p_direction=write(EnabledP2P SM WRITE) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.69 1.69 1.83 2.17 2.33 1.69 2.10
DEV1 1.67 0.00 1.66 1.66 2.11 2.19 2.12 1.88
DEV2 1.88 1.65 0.00 1.78 1.64 2.12 2.24 2.11
DEV3 1.55 1.62 1.55 0.00 2.17 1.56 2.14 2.27
DEV4 1.96 2.03 1.54 1.95 0.00 1.64 1.67 1.63
DEV5 2.27 2.22 2.23 1.87 1.76 0.00 1.86 1.86
DEV6 1.65 2.28 2.11 2.11 1.86 1.84 0.00 1.87
DEV7 2.09 1.55 2.17 2.13 1.66 1.64 1.79 0.00
(EnabledP2P SM WRITE) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.35 1.84 1.83 1.84 1.87 1.85 1.86
DEV1 4.20 0.00 1.83 1.93 1.85 1.85 1.83 1.82
DEV2 4.66 2.17 0.00 2.17 2.25 2.18 2.16 2.15
DEV3 4.68 2.11 2.20 0.00 2.14 2.23 2.18 2.12
DEV4 3.82 1.32 1.32 1.40 0.00 1.32 1.34 1.33
DEV5 3.84 1.32 1.32 1.33 1.31 0.00 1.32 1.38
DEV6 3.99 1.43 1.47 1.50 1.46 1.46 0.00 1.44
DEV7 3.75 1.35 1.38 1.36 1.36 1.34 1.35 0.00
Total: 0.50 sSM (bulk copy) 单向读延迟
p2pBandwidthLatencyTest --num_gpus=8 --p2p_type=SM --kernel=bulkCopy --perf=latency --p2p_direction=read(EnabledP2P SM READ) GPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 1.81 1.66 1.84 2.31 2.32 1.86 2.33
DEV1 1.51 0.00 1.63 1.79 2.12 2.11 2.22 1.64
DEV2 1.51 1.63 0.00 1.65 1.63 2.28 2.15 2.11
DEV3 1.66 1.68 1.70 0.00 2.11 1.87 2.30 2.34
DEV4 2.12 2.12 1.66 2.12 0.00 1.83 1.87 1.87
DEV5 1.96 2.03 2.09 1.62 1.62 0.00 1.65 1.72
DEV6 1.48 2.03 2.09 2.14 1.65 1.68 0.00 1.82
DEV7 2.12 1.67 2.34 2.11 1.83 1.96 1.88 0.00
(EnabledP2P SM READ) CPU P2P Transmission Latency Matrix (US)
DEV0 DEV1 DEV2 DEV3 DEV4 DEV5 DEV6 DEV7
DEV0 0.00 4.22 1.56 1.56 1.60 1.54 1.60 1.54
DEV1 4.13 0.00 1.64 1.58 1.61 1.58 1.61 1.62
DEV2 3.77 1.33 0.00 1.33 1.33 1.37 1.31 1.32
DEV3 3.69 1.33 1.32 0.00 1.31 1.35 1.34 1.32
DEV4 4.70 2.17 2.31 2.20 0.00 2.25 2.20 2.19
DEV5 4.69 2.19 2.19 2.11 2.31 0.00 2.16 2.16
DEV6 4.38 1.86 1.89 1.85 1.84 1.83 0.00 1.85
DEV7 4.61 2.17 2.16 2.15 2.16 2.15 2.16 0.00
Total: 0.49 s3.2 多机(两机四卡)
3.2.1 实际拓扑结构

运行示例的两机四卡 topo 结构
3.2.2 实测数据
Crossnode 测试需要指定运行 p2p 的测试的 hosts 的 ip 地址,可以通过设置hosts.cfg文件内容指定,运行时向 mpi 传入--hostfile参数值即可,以下是hosts.cfg文件内容示例:
# usage: mpirun --hostfile hosts.cfg
# tell mpi the hosts' ip addresses and slots of each host
# format
# $ip_address slots=$slots
# e.g.
# 30.21.220.4 slots=2
# 30.21.220.5 slots=2SM (cache copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=write(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 16.91 33.67 16.95
HOST1:DEV0 16.93 x 16.94 33.66
HOST2:DEV0 33.60 16.94 x 16.95
HOST3:DEV0 16.94 33.66 16.94 x
Total: 18.72 sSM (cache copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=uniBw --p2p_direction=read(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 18.95 37.92 18.98
HOST1:DEV0 18.96 x 18.98 37.92
HOST2:DEV0 37.86 18.98 x 18.98
HOST3:DEV0 18.97 37.92 18.98 x
Total: 18.72 sSM (cache copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=write(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 31.63 62.77 31.63
HOST1:DEV0 31.63 x 31.63 62.74
HOST2:DEV0 62.77 31.63 x 31.63
HOST3:DEV0 31.63 62.74 31.63 x
Total: 25.58 sSM (cache copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=cacheCopy --perf=biBw --p2p_direction=read(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 25.58 50.73 25.58
HOST1:DEV0 25.58 x 25.61 50.65
HOST2:DEV0 50.73 25.61 x 25.57
HOST3:DEV0 25.58 50.63 25.58 x
Total: 30.58 sSM (cache copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=write(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
DEV0 0.00 37.30 5.58 5.57
DEV0 41.91 0.00 4.90 4.90
DEV0 37.09 4.68 0.00 4.27
DEV0 36.07 4.43 4.41 0.00
Total: 0.47 sSM (cache copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=cacheCopy --perf=latency --p2p_direction=read(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
DEV0 0.00 42.19 5.77 5.77
DEV0 44.16 0.00 6.42 6.81
DEV0 36.71 4.13 0.00 4.13
DEV0 36.35 4.77 4.74 0.00
Total: 0.44 sSM (none cache copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=write(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 16.91 33.74 16.95
HOST1:DEV0 16.94 x 16.94 33.73
HOST2:DEV0 33.68 16.94 x 16.95
HOST3:DEV0 16.94 33.74 16.94 x
Total: 20.79 sSM (none cache copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=uniBw --p2p_direction=read(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 18.95 37.96 18.98
HOST1:DEV0 18.96 x 18.98 37.96
HOST2:DEV0 37.89 18.89 x 18.98
HOST3:DEV0 18.97 37.96 18.99 x
Total: 18.72 sSM (none cache copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=write(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 31.63 62.91 31.63
HOST1:DEV0 31.63 x 31.63 62.88
HOST2:DEV0 62.90 31.63 x 31.63
HOST3:DEV0 31.63 62.89 31.63 x
Total: 25.59 sSM (none cache copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=ncCopy --perf=biBw --p2p_direction=read(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 25.58 50.72 25.58
HOST1:DEV0 25.58 x 25.61 50.64
HOST2:DEV0 50.73 25.61 x 25.58
HOST3:DEV0 25.58 50.64 25.58 x
Total: 30.58 sSM (none cache copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=write(EnabledP2P CE WRITE) GPU P2P Transmission Latency Matrix (US)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
DEV0 0.00 39.61 5.76 5.86
DEV1 40.75 0.00 4.86 4.97
DEV0 35.78 4.74 0.00 4.71
DEV1 34.23 3.90 3.87 0.00
Total: 0.44 sSM (none cache copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=ncCopy --perf=latency --p2p_direction=read(EnabledP2P CE READ) GPU P2P Transmission Latency Matrix (US)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
DEV0 0.00 40.95 5.57 5.55
DEV1 41.17 0.00 5.28 5.28
DEV0 35.48 4.68 0.00 4.53
DEV1 36.23 4.34 4.34 0.00
Total: 0.43 sSM (bulk copy) 单向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write(EnabledP2P CE WRITE Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 23.33 46.47 23.39
HOST1:DEV0 23.36 x 23.39 46.47
HOST2:DEV0 46.38 23.39 x 23.39
HOST3:DEV0 23.36 46.47 23.39 x
Total: 15.12 sSM (bulk copy) 单向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read(EnabledP2P CE READ Unidirectional)
P2P Unidirectional Bandwidth Transport Bandwidth Matrix (2147483648 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 24.05 48.22 24.11
HOST1:DEV0 24.08 x 24.11 48.22
HOST2:DEV0 48.11 24.11 x 24.11
HOST3:DEV0 24.08 48.22 24.11 x
Total: 14.68 sSM (bulk copy) 双向写带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=write(EnabledP2P CE WRITE Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 44.12 87.59 44.13
HOST1:DEV0 44.12 x 44.17 87.55
HOST2:DEV0 87.60 44.17 x 44.12
HOST3:DEV0 44.13 87.55 44.12 x
Total: 18.99 sSM (bulk copy) 双向读带宽
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --num_elems=536870912 --p2p_type=SM --kernel=bulkCopy --perf=biBw --p2p_direction=read(EnabledP2P CE READ Bidirectional)
P2P Bidirectional Bandwidth Transport Bandwidth Matrix (2147483648 *2 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 41.77 83.52 41.75
HOST1:DEV0 41.77 x 41.80 83.43
HOST2:DEV0 83.51 41.80 x 41.75
HOST3:DEV0 41.75 83.42 41.77 x
Total: 19.90 sSM (bulk copy) 单向写延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=write(EnabledP2P CE WRITE) P2P Unidirectional Bandwidth Transport Bandwidth Matrix (40000000 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 20.49 46.23 23.36
HOST1:DEV0 20.90 x 23.36 46.31
HOST2:DEV0 38.20 23.37 x 23.36
HOST3:DEV0 22.04 46.28 23.37 x
Total: 0.70 sSM (bulk copy) 单向读延迟
mpirun -np 4 -npernode 2 --hostfile ./hosts.cfg -mca btl_tcp_if_include ens81f1 p2pBandwidthLatencyTest --is_multi_proc=1 --p2p_type=SM --kernel=bulkCopy --perf=uniBw --p2p_direction=read(EnabledP2P CE READ) P2P Unidirectional Bandwidth Transport Bandwidth Matrix (40000000 Bytes) (GB/s)
HOST0:DEV0 HOST1:DEV0 HOST2:DEV0 HOST3:DEV0
HOST0:DEV0 x 20.54 46.30 23.36
HOST1:DEV0 20.85 x 23.36 46.23
HOST2:DEV0 38.43 23.37 x 23.36
HOST3:DEV0 21.09 46.27 23.37 x
Total: 0.69 s