PCCL: sailbandwidth 使用指南(v2.1)
更新时间:
复制为 MD 格式
1. 概述
本文将介绍 pccl_tools 中的 D2D/H2D/D2H 带宽测试工具 sailbandwidth 的用法。sailbandwidth 是 Nvidia 开源带宽测试工具 nvbandwidth 的 PPU 适配版本,接口与 nvbandwidth v0.8 保持兼容,已在真武810E 机型上评估并基于 PPU 硬件特性进行优化。开源代码链接如下:https://github.com/NVIDIA/nvbandwidth
下面各章节将会分别介绍它的参数及常见 cases 的测试方法,其中所附的数据值仅供参考使用,实际 perf 数据还需参考当前的机器环境、带宽配置及 PPU SDK release 版本而决定。
2. 使用方式
2.1 编译包使用方式
PPU SDK v2.1.0 release 以后
从 PPU SDK v2.1.0 release 开始,sailbandwidth 从 SDK 中移除并独立发布,链接为:https://art-pub.eng.t-head.cn/artifactory/generic-local/SAIL/v2.1.0/COMM,它位于 comm_tools 包的如下位置:
// 单进程版本,适用于单机场景测试
comm_tools/single_process/sailbandwidth
// 多进程版本,适用于 cross-node 场景测试
comm_tools/multi_process/sailbandwidthPPU SDK v2.1.0 release 以前
sailbandwidth 工具随同 PPU SDK 一起发布,它位于 SDK 目录的如下位置:
// 单进程版本,适用于单机场景测试
PPU_SDK/pccl_tools/sailbandwidth
// 多进程版本,适用于 cross-node 场景测试,从 PPU SDK v2.0.0 release 起支持
PPU_SDK/pccl_tools/mp/sailbandwidth备注:PPU SDK v2.0及之前的版本,不能使用comm_tools,否则会报错。
2.2 源码编译方式
sailbandwidth 位于在 pccl-tests 目录下,目前在阿里集团内部开源,签署了平头哥 NDA 的同学可以申请访问开源代码的权限。
开源代码发布路径为:
https://code.alibaba-inc.com/ppu_open_source/pccl-testssource SDK 后,然后进入 sailbandwidth 目录可以使用如下命令进行编译:
// 编译单进程版本
./build.sh --use_sdk
// 编译多进程版本
./build.sh --use_sdk --use_mpi3. 参数介绍
可以使用 ./sailbandwidth -h 查询全部参数b信息:
sailbandwidth Version: v2.1.0
sailbandwidth CLI:
// 与 nvbandwidth 兼容的参数:
-h [ --help ] Produce help message
-b [ --bufferSize ] arg (=512) Memcpy buffer size in MiB
-l [ --list ] List available testcases
-t [ --testcase ] arg Testcase(s) to run (by name or index)
-p [ --testcasePrefixes ] arg Testcase(s) to run (by prefix))
-v [ --verbose ] Verbose output
-s [ --skipVerification ] Skips data verification after copy
-d [ --disableAffinity ] Disable automatic CPU affinity control
-i [ --testSamples ] arg (=3) Iterations of the benchmark
-m [ --useMean ] Use mean instead of median for results
-j [ --json ] Print output in json format instead of plain text.
// sailbandwidth 新增的参数:
--useNormalCopy Use normal copy for sm test
--nBlocks arg (=-1) nBlocks for sm bulk copy test
--nThreads arg (=-1) nThreads for sm bulk copy test
--isSingleDie Use single die for ce test
--testLocalCopy Test local copy in D2D Tests
--displayUR Show utilization ratio matrixs in D2D Tests
--disableVm Disable virtual mode4. 示例
4.1 单进程 cases
4.1.1 Copy Engine (CE) Copy
D2D 单向 Read 带宽测试
./sailbandwidth -t device_to_device_memcpy_read_ce -s true -i 1Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 48.87 47.22 47.69 48.43 45.11 45.07 48.22
1 45.19 N/A 48.92 46.46 45.00 45.28 47.31 47.39
2 48.88 45.62 N/A 45.38 48.48 47.57 48.86 46.14
3 46.03 44.84 46.26 N/A 45.27 48.70 46.56 45.29
4 45.36 45.93 44.87 48.00 N/A 46.31 44.95 44.84
5 48.88 47.64 45.81 48.76 48.21 N/A 45.84 44.99
6 46.01 48.87 47.93 48.47 45.44 45.17 N/A 45.59
7 45.57 46.83 48.56 47.71 47.64 47.62 48.88 N/A
SUM device_to_device_memcpy_read_ce 2620.72D2D 双向 Read 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_read_ce -s true -i 1Running device_to_device_bidirectional_memcpy_read_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 46.24 44.67 44.25 47.57 44.49 44.04 44.92
1 43.68 N/A 44.83 44.93 46.02 44.30 44.33 46.76
2 44.35 44.11 N/A 45.45 47.03 44.16 44.42 45.41
3 46.48 44.96 44.58 N/A 46.42 44.63 44.71 45.29
4 44.46 46.86 44.30 45.60 N/A 46.47 46.58 44.13
5 44.28 45.28 44.04 44.03 47.42 N/A 46.61 43.64
6 45.59 44.35 44.43 45.05 46.14 46.56 N/A 44.37
7 46.10 44.40 44.82 45.55 45.39 45.78 45.91 N/A
SUM device_to_device_bidirectional_memcpy_read_ce_read1 2531.15
memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 44.81 44.70 46.81 47.57 44.93 44.07 44.08
1 43.71 N/A 44.84 44.59 44.21 44.30 44.34 44.13
2 44.11 44.12 N/A 44.96 47.00 46.93 45.71 44.83
3 45.84 45.40 44.59 N/A 46.43 45.21 44.11 45.46
4 44.28 46.42 44.78 44.64 N/A 46.53 44.19 45.67
5 44.29 44.35 44.49 45.62 47.42 N/A 46.30 43.64
6 45.59 44.38 44.44 44.06 44.59 45.74 N/A 46.23
7 44.45 44.40 44.87 45.18 45.17 45.56 45.89 N/A
SUM device_to_device_bidirectional_memcpy_read_ce_read2 2524.94
memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 91.06 89.38 91.06 95.14 89.42 88.11 89.00
1 87.38 N/A 89.67 89.52 90.23 88.60 88.67 90.90
2 88.45 88.22 N/A 90.40 94.03 91.09 90.13 90.24
3 92.32 90.36 89.17 N/A 92.85 89.84 88.82 90.75
4 88.74 93.27 89.08 90.24 N/A 93.00 90.77 89.80
5 88.57 89.63 88.53 89.65 94.84 N/A 92.92 87.27
6 91.18 88.73 88.87 89.11 90.73 92.30 N/A 90.59
7 90.56 88.80 89.69 90.73 90.56 91.35 91.80 N/A
SUM device_to_device_bidirectional_memcpy_read_ce_total 5056.09D2D 单向 Write 带宽测试
./sailbandwidth -t device_to_device_memcpy_write_ce -s true -i 1Running device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 46.11 46.97 48.07 96.90 97.65 48.60 95.29
1 46.44 N/A 46.67 46.85 94.09 94.44 94.85 47.58
2 45.82 46.94 N/A 48.38 48.90 95.74 94.97 93.26
3 46.10 48.78 45.99 N/A 90.39 46.81 97.31 95.06
4 93.30 90.51 46.70 97.19 N/A 45.79 46.20 48.26
5 91.44 90.75 90.36 44.99 45.77 N/A 46.84 48.17
6 46.54 94.74 95.56 96.83 48.88 47.58 N/A 46.38
7 90.39 44.84 90.40 90.32 44.92 44.99 45.06 N/A
SUM device_to_device_memcpy_write_ce 3748.65D2D 双向 Write 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_write_ce -s true -i 1Running device_to_device_bidirectional_memcpy_write_ce.
memcpy CE GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 46.48 45.11 45.45 86.87 87.96 45.57 91.42
1 45.89 N/A 44.44 45.29 89.64 90.34 89.51 44.58
2 44.10 44.29 N/A 45.36 44.98 89.66 87.56 88.32
3 44.54 44.11 44.09 N/A 89.57 43.92 86.25 87.00
4 88.64 87.24 46.77 90.35 N/A 45.24 46.75 44.01
5 86.21 90.83 87.64 45.94 46.71 N/A 46.74 44.32
6 46.97 89.54 87.69 89.79 44.17 44.28 N/A 45.51
7 87.95 43.98 86.17 91.33 46.82 44.05 44.06 N/A
SUM device_to_device_bidirectional_memcpy_write_ce_write1 3571.98
memcpy CE GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 44.63 43.96 43.99 86.89 87.26 45.45 90.44
1 45.38 N/A 44.47 44.75 88.64 88.19 87.81 44.05
2 44.10 44.29 N/A 44.21 44.99 89.75 88.71 90.57
3 44.04 44.41 45.21 N/A 87.08 44.02 90.27 88.86
4 88.42 86.38 46.43 89.49 N/A 44.05 46.85 44.01
5 86.20 86.07 86.57 44.82 46.55 N/A 46.77 45.17
6 46.97 89.57 87.71 87.19 44.16 44.24 N/A 45.48
7 88.15 44.84 88.43 91.16 46.71 46.13 46.43 N/A
SUM device_to_device_bidirectional_memcpy_write_ce_write2 3561.37
memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 91.11 89.07 89.43 173.76 175.22 91.02 181.85
1 91.26 N/A 88.91 90.04 178.29 178.53 177.32 88.63
2 88.20 88.59 N/A 89.58 89.97 179.41 176.26 178.89
3 88.58 88.52 89.30 N/A 176.66 87.94 176.52 175.86
4 177.06 173.62 93.21 179.83 N/A 89.28 93.60 88.02
5 172.41 176.90 174.21 90.76 93.26 N/A 93.51 89.49
6 93.94 179.11 175.41 176.98 88.33 88.52 N/A 90.99
7 176.10 88.81 174.59 182.49 93.53 90.18 90.49 N/A
SUM device_to_device_bidirectional_memcpy_write_ce_total 7133.35H2D 单向带宽测试
./sailbandwidth -t host_to_device_memcpy_ce -s true -i 1Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 44.74 45.08 44.65 44.98 45.17 45.05 45.00 44.95
SUM host_to_device_memcpy_ce 359.62H2D 双向带宽测试
./sailbandwidth -t host_to_device_bidirectional_memcpy_ce -s true -i 1Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 34.47 34.73 34.61 34.44 34.76 34.62 34.78 34.65
SUM host_to_device_bidirectional_memcpy_ce 277.06D2H 单向带宽测试
./sailbandwidth -t device_to_host_memcpy_ce -s true -i 1Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 56.86 56.76 56.75 56.75 56.75 56.76 56.81 56.77
SUM device_to_host_memcpy_ce 454.21D2H 双向带宽测试
./sailbandwidth -t device_to_host_bidirectional_memcpy_ce -s true -i 1Running device_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 41.58 41.66 41.60 41.64 41.55 41.67 41.68 41.64
SUM device_to_host_bidirectional_memcpy_ce 333.02ALL2H 单向带宽测试
./sailbandwidth -t all_to_host_memcpy_ce -s true -i 1Running all_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 35.73 35.74 35.36 35.28 34.50 34.60 34.70 34.44
SUM all_to_host_memcpy_ce 280.34ALL2H 双向带宽测试
./sailbandwidth -t all_to_host_bidirectional_memcpy_ce -s true -i 1Running all_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 32.84 32.72 32.92 32.70 32.13 32.22 32.18 32.10
SUM all_to_host_bidirectional_memcpy_ce 259.81H2ALL 单向带宽测试
./sailbandwidth -t host_to_all_memcpy_ce -s true -i 1Running host_to_all_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 41.03 41.59 41.46 41.38 41.54 41.41 41.61 41.94
SUM host_to_all_memcpy_ce 331.97H2ALL 双向带宽测试
./sailbandwidth -t host_to_all_bidirectional_memcpy_ce -s true -i 1Running host_to_all_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 13.36 13.44 13.40 13.43 13.33 13.33 13.36 13.34
SUM host_to_all_bidirectional_memcpy_ce 107.01ALL2D 单向 Read 带宽测试
./sailbandwidth -t all_to_one_read_ce -s true -i 1Running all_to_one_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 47.85 51.28 49.56 49.25 48.04 49.41 50.00 46.47
SUM all_to_one_read_ce 391.88ALL2D 单向 Write 带宽测试
./sailbandwidth -t all_to_one_write_ce -s true -i 1Running all_to_one_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 125.83 125.11 126.47 125.63 131.94 131.80 139.10 131.74
SUM all_to_one_write_ce 1037.62D2ALL 单向 Read 带宽测试
./sailbandwidth -t one_to_all_read_ce -s true -i 1Running one_to_all_read_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 124.36 125.04 125.84 126.31 133.50 132.97 140.04 131.60
SUM one_to_all_read_ce 1039.66D2ALL 单向 Write 带宽测试
./sailbandwidth -t one_to_all_write_ce -s true -i 1Running one_to_all_write_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 105.36 106.22 103.95 107.19 113.33 112.38 119.61 112.42
SUM one_to_all_write_ce 880.45Local Copy 带宽测试
./sailbandwidth -t device_local_copy -s true -i 1Running device_local_copy.
memcpy local GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 1055.36 1055.80 1056.31 1056.58 1055.00 1056.70 1055.45 1055.33
SUM device_local_copy 8446.534.1.2 Steaming Multiprocessor (SM) Copy
D2D 单向 Read 带宽测试
./sailbandwidth -t device_to_device_memcpy_read_sm -s true -i 1Running device_to_device_memcpy_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 48.17 48.17 48.17 96.31 96.31 48.17 96.31
1 48.17 N/A 48.17 48.17 96.31 96.31 96.31 48.17
2 48.17 48.17 N/A 48.17 48.17 96.31 96.31 96.31
3 48.17 48.17 48.17 N/A 96.31 48.17 96.31 96.31
4 96.31 96.31 48.17 96.31 N/A 48.17 48.17 48.17
5 96.31 96.31 96.31 48.17 48.17 N/A 48.17 48.17
6 48.17 96.31 96.31 96.31 48.17 48.17 N/A 48.17
7 96.31 48.17 96.31 96.31 48.17 48.17 48.17 N/A
SUM device_to_device_memcpy_read_sm 3852.87D2D 双向 Read 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_read_sm -s true -i 1Running device_to_device_bidirectional_memcpy_read_sm.
memcpy SM GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 41.86 41.86 41.86 83.70 83.70 41.86 83.70
1 41.86 N/A 41.86 41.86 83.70 83.70 83.70 41.86
2 41.86 41.86 N/A 41.86 41.86 83.70 83.70 83.70
3 41.86 41.86 41.86 N/A 83.70 41.86 83.69 83.69
4 83.70 83.70 41.86 83.70 N/A 41.86 41.86 41.86
5 83.70 83.70 83.70 41.86 41.86 N/A 41.86 41.86
6 41.86 83.70 83.69 83.70 41.86 41.86 N/A 41.86
7 83.70 41.86 83.70 83.70 41.86 41.86 41.86 N/A
SUM device_to_device_bidirectional_memcpy_read_sm_read1 3348.32
memcpy SM GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 41.86 41.86 41.86 83.70 83.70 41.86 83.71
1 41.86 N/A 41.86 41.86 83.70 83.70 83.70 41.86
2 41.86 41.86 N/A 41.86 41.86 83.70 83.70 83.70
3 41.86 41.86 41.86 N/A 83.70 41.86 83.70 83.70
4 83.70 83.70 41.86 83.70 N/A 41.86 41.86 41.86
5 83.70 83.70 83.70 41.86 41.86 N/A 41.86 41.86
6 41.86 83.70 83.70 83.70 41.86 41.86 N/A 41.86
7 83.70 41.86 83.70 83.70 41.86 41.86 41.86 N/A
SUM device_to_device_bidirectional_memcpy_read_sm_read2 3348.37
memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 83.72 83.72 83.72 167.40 167.40 83.72 167.41
1 83.72 N/A 83.72 83.72 167.40 167.40 167.40 83.72
2 83.72 83.72 N/A 83.72 83.72 167.40 167.39 167.39
3 83.72 83.72 83.72 N/A 167.40 83.72 167.39 167.39
4 167.40 167.40 83.72 167.40 N/A 83.72 83.72 83.72
5 167.39 167.40 167.40 83.72 83.72 N/A 83.72 83.72
6 83.72 167.40 167.39 167.40 83.72 83.72 N/A 83.72
7 167.41 83.72 167.40 167.40 83.72 83.72 83.72 N/A
SUM device_to_device_bidirectional_memcpy_read_sm_total 6696.68D2D 单向 Write 带宽测试
./sailbandwidth -t device_to_device_memcpy_write_sm -s true -i 1Running device_to_device_memcpy_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 46.80 46.80 46.80 93.25 93.23 46.80 93.25
1 46.80 N/A 46.80 46.80 93.24 93.25 93.25 46.80
2 46.80 46.80 N/A 46.80 46.80 93.25 93.24 93.25
3 46.80 46.80 46.80 N/A 93.25 46.80 93.23 93.24
4 93.24 93.26 46.80 93.24 N/A 46.80 46.80 46.80
5 93.24 93.25 93.24 46.80 46.80 N/A 46.80 46.80
6 46.80 93.25 93.25 93.23 46.80 46.80 N/A 46.80
7 93.25 46.80 93.25 93.25 46.80 46.80 46.80 N/A
SUM device_to_device_memcpy_write_sm 3735.50D2D 双向 Write 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_write_sm -s true -i 1Running device_to_device_bidirectional_memcpy_write_sm.
memcpy SM GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 45.27 45.27 45.26 89.99 89.99 45.27 89.99
1 45.26 N/A 45.26 45.27 89.99 89.99 89.98 45.27
2 45.26 45.27 N/A 45.27 45.26 89.99 89.99 89.99
3 45.26 45.26 45.26 N/A 89.99 45.26 89.98 89.99
4 89.99 89.99 45.26 89.97 N/A 45.27 45.27 45.27
5 89.98 89.98 89.98 45.26 45.26 N/A 45.27 45.27
6 45.26 90.00 89.99 89.97 45.27 45.27 N/A 45.27
7 89.98 45.27 89.99 89.99 45.27 45.27 45.27 N/A
SUM device_to_device_bidirectional_memcpy_write_sm_write1 3608.14
memcpy SM GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 45.26 45.26 45.27 89.99 89.98 45.26 89.99
1 45.26 N/A 45.26 45.27 89.99 89.99 89.98 45.27
2 45.26 45.26 N/A 45.27 45.27 89.98 90.02 89.97
3 45.27 45.26 45.27 N/A 89.97 45.26 89.98 89.99
4 89.99 89.97 45.26 89.99 N/A 45.26 45.26 45.26
5 89.99 89.99 89.98 45.27 45.27 N/A 45.27 45.27
6 45.27 89.98 89.97 89.99 45.27 45.26 N/A 45.26
7 89.98 45.27 90.00 89.98 45.27 45.27 45.27 N/A
SUM device_to_device_bidirectional_memcpy_write_sm_write2 3608.13
memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 90.53 90.53 90.53 179.98 179.97 90.53 179.98
1 90.53 N/A 90.53 90.53 179.98 179.98 179.96 90.53
2 90.53 90.53 N/A 90.53 90.53 179.98 180.00 179.96
3 90.53 90.53 90.53 N/A 179.96 90.53 179.96 179.99
4 179.98 179.97 90.53 179.96 N/A 90.53 90.53 90.53
5 179.97 179.97 179.97 90.53 90.53 N/A 90.53 90.53
6 90.53 179.98 179.97 179.95 90.53 90.53 N/A 90.53
7 179.96 90.53 179.99 179.96 90.53 90.53 90.53 N/A
SUM device_to_device_bidirectional_memcpy_write_sm_total 7216.27H2D 单向带宽测试
./sailbandwidth -t host_to_device_memcpy_sm -s true -i 1Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 32.31 32.65 32.26 32.47 32.65 32.58 32.45 32.60
SUM host_to_device_memcpy_sm 259.96H2D 双向带宽测试
./sailbandwidth -t host_to_device_bidirectional_memcpy_sm -s true -i 1Running host_to_device_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 27.07 27.37 27.21 27.25 27.28 27.32 27.30 27.31
SUM host_to_device_bidirectional_memcpy_sm 218.12D2H 单向带宽测试
./sailbandwidth -t device_to_host_memcpy_sm -s true -i 1Running device_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 52.69 52.60 52.62 52.61 52.60 52.61 52.61 52.61
SUM device_to_host_memcpy_sm 420.95D2H 双向带宽测试
./sailbandwidth -t device_to_host_bidirectional_memcpy_sm -s true -i 1Running device_to_host_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 27.07 27.37 27.18 27.23 27.31 27.36 27.33 27.32
SUM device_to_host_bidirectional_memcpy_sm 218.16ALL2H 单向带宽测试
./sailbandwidth -t all_to_host_memcpy_sm -s true -i 1Running all_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 34.59 34.93 34.91 34.81 34.50 34.69 34.72 34.51
SUM all_to_host_memcpy_sm 277.68ALL2H 双向带宽测试
./sailbandwidth -t all_to_host_bidirectional_memcpy_sm -s true -i 1Running all_to_host_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 25.33 25.63 25.48 25.51 25.57 25.48 25.50 25.58
SUM all_to_host_bidirectional_memcpy_sm 204.07H2ALL 单向带宽测试
./sailbandwidth -t host_to_all_memcpy_sm -s true -i 1Running host_to_all_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 30.99 31.29 31.04 31.20 31.32 31.22 31.19 31.31
SUM host_to_all_memcpy_sm 249.56H2ALL 双向带宽测试
./sailbandwidth -t host_to_all_bidirectional_memcpy_sm -s true -i 1Running host_to_all_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 26.19 26.41 26.26 26.39 26.44 26.43 26.40 26.46
SUM host_to_all_bidirectional_memcpy_sm 210.98ALL2D 单向 Read 带宽测试
./sailbandwidth -t all_to_one_read_sm -s true -i 1Running all_to_one_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 59.86 59.52 58.54 59.96 58.62 59.74 57.47 58.42
SUM all_to_one_read_sm 472.13ALL2D 单向 Write 带宽测试
./sailbandwidth -t all_to_one_write_sm -s true -i 1Running all_to_one_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 117.35 117.35 117.31 117.35 124.83 124.83 131.07 124.83
SUM all_to_one_write_sm 974.92D2ALL 单向 Read 带宽测试
./sailbandwidth -t one_to_all_read_sm -s true -i 1Running one_to_all_read_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 117.35 117.35 117.31 117.35 124.83 124.83 131.07 124.83
SUM one_to_all_read_sm 974.92D2ALL 单向 Write 带宽测试
./sailbandwidth -t one_to_all_write_sm -s true -i 1Running one_to_all_write_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 61.98 59.85 59.38 59.13 58.58 56.35 59.03 61.80
SUM one_to_all_write_sm 476.10H2D latency 测试
./sailbandwidth -t host_device_latency_sm -s true -i 1Running host_device_latency_sm.
memory latency SM CPU(row) <-> GPU(column) (ns)
0 1 2 3 4 5 6 7
0 1047.14 1045.69 1051.88 1046.35 1013.99 944.99 990.09 982.71
SUM host_device_latency_sm 8122.84D2D latency 测试
./sailbandwidth -t device_to_device_latency_sm -s true -i 1Running device_to_device_latency_sm.
Device to Device Latency SM GPU(row) <-> GPU(column) (ns)
0 1 2 3 4 5 6 7
0 N/A 742.70 744.00 744.41 1143.64 1142.41 711.45 1142.57
1 742.64 N/A 739.88 740.14 1141.27 1141.64 1140.24 709.46
2 740.55 739.95 N/A 742.10 710.87 1140.07 1140.09 1140.00
3 740.59 740.14 742.28 N/A 1140.01 711.41 1140.17 1141.28
4 1140.44 1141.36 709.26 1141.55 N/A 741.84 741.95 741.55
5 1140.50 1141.37 1141.25 711.22 742.00 N/A 740.08 741.95
6 709.85 1141.80 1140.12 1141.51 741.69 741.95 N/A 742.49
7 1140.71 711.63 1141.20 1141.26 741.51 742.24 740.29 N/A
SUM device_to_device_latency_sm 50870.514.2 多进程 cases
4.2.1 Copy Engine (CE) Copy
D2D 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_read_ceRunning multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3
0 N/A 91.68 91.27 95.83
1 90.91 N/A 95.75 95.80
2 91.86 91.12 N/A 91.05
3 91.06 95.83 91.49 N/A
SUM multinode_device_to_device_memcpy_read_ce 1113.66D2D 双向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_read_ceRunning multinode_device_to_device_bidirectional_memcpy_read_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0 1 2 3
0 N/A 94.05 96.31 96.31
1 94.05 N/A 96.31 96.31
2 96.31 96.31 N/A 94.05
3 96.31 96.31 94.04 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_ce_read1 1146.66
memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0 1 2 3
0 N/A 94.04 96.30 96.31
1 94.05 N/A 96.29 96.31
2 96.31 96.31 N/A 94.04
3 96.31 96.31 94.04 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_ce_read2 1146.63
memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3
0 N/A 188.09 192.61 192.62
1 188.10 N/A 192.60 192.62
2 192.62 192.62 N/A 188.09
3 192.62 192.62 188.09 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_ce_total 2293.29D2D 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_write_ceRunning multinode_device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3
0 N/A 383.86 383.87 383.86
1 383.85 N/A 383.87 383.86
2 383.80 383.82 N/A 383.82
3 383.82 383.82 383.84 N/A
SUM multinode_device_to_device_memcpy_write_ce 4606.09D2D 双向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_write_ceRunning multinode_device_to_device_bidirectional_memcpy_write_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0 1 2 3
0 N/A 376.64 376.25 376.26
1 376.64 N/A 376.22 376.23
2 376.24 376.24 N/A 376.63
3 376.25 376.26 376.68 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_ce_write1 4516.53
memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0 1 2 3
0 N/A 376.64 376.42 376.44
1 376.63 N/A 376.43 376.46
2 376.30 376.30 N/A 376.55
3 376.24 376.31 376.65 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_ce_write2 4517.37
memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3
0 N/A 753.28 752.67 752.70
1 753.27 N/A 752.65 752.68
2 752.55 752.54 N/A 753.18
3 752.50 752.56 753.33 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_ce_total 9033.904.2.2 Steaming Multiprocessor (SM) Copy
D2D 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_read_smRunning multinode_device_to_device_memcpy_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3
0 N/A 351.95 351.66 351.90
1 351.94 N/A 351.67 351.88
2 351.73 351.90 N/A 352.01
3 351.70 351.87 351.75 N/A
SUM multinode_device_to_device_memcpy_read_sm 4221.96D2D 双向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_read_smRunning multinode_device_to_device_bidirectional_memcpy_read_sm.
memcpy SM GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0 1 2 3
0 N/A 310.79 311.02 311.40
1 311.10 N/A 311.12 311.18
2 311.35 310.65 N/A 310.67
3 311.19 311.05 310.94 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_sm_read1 3732.47
memcpy SM GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0 1 2 3
0 N/A 311.08 312.20 311.96
1 311.16 N/A 311.24 311.35
2 311.14 311.06 N/A 311.16
3 311.79 311.24 310.99 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_sm_read2 3736.38
memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3
0 N/A 621.88 623.22 623.37
1 622.26 N/A 622.36 622.53
2 622.50 621.72 N/A 621.83
3 622.97 622.29 621.93 N/A
SUM multinode_device_to_device_bidirectional_memcpy_read_sm_total 7468.85D2D 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_write_smRunning multinode_device_to_device_memcpy_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0 1 2 3
0 N/A 341.81 341.80 341.69
1 341.73 N/A 341.87 341.63
2 342.30 341.87 N/A 342.02
3 342.05 342.32 342.17 N/A
SUM multinode_device_to_device_memcpy_write_sm 4103.26D2D 双向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_write_smRunning multinode_device_to_device_bidirectional_memcpy_write_sm.
memcpy SM GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
0 1 2 3
0 N/A 331.61 331.43 331.65
1 331.67 N/A 331.81 331.77
2 332.03 332.10 N/A 332.01
3 331.85 331.84 331.76 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_sm_write1 3981.53
memcpy SM GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
0 1 2 3
0 N/A 331.75 332.11 332.07
1 331.81 N/A 331.99 332.20
2 331.48 331.77 N/A 332.05
3 331.47 331.85 332.10 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_sm_write2 3982.65
memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0 1 2 3
0 N/A 663.37 663.54 663.72
1 663.48 N/A 663.80 663.97
2 663.51 663.86 N/A 664.05
3 663.32 663.69 663.86 N/A
SUM multinode_device_to_device_bidirectional_memcpy_write_sm_total 7964.18ALL2One 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_all_to_one_write_smRunning multinode_device_to_device_all_to_one_write_sm.
memcpy SM All Gpus -> GPU(column) total bandwidth (GB/s)
0 1 2 3
0 352.59 352.37 353.21 352.58
SUM multinode_device_to_device_all_to_one_write_sm 1410.75ALLFromOne 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_all_from_one_read_smRunning multinode_device_to_device_all_from_one_read_sm.
memcpy SM All Gpus <- GPU(column) total bandwidth (GB/s)
0 1 2 3
0 354.36 354.40 354.32 354.43
SUM multinode_device_to_device_all_from_one_read_sm 1417.51该文章对您有帮助吗?