PCCL: sailbandwidth 使用指南(v2.1)

更新时间:
复制为 MD 格式

1. 概述

本文将介绍 pccl_tools 中的 D2D/H2D/D2H 带宽测试工具 sailbandwidth 的用法。sailbandwidth 是 Nvidia 开源带宽测试工具 nvbandwidth 的 PPU 适配版本,接口与 nvbandwidth v0.8 保持兼容,已在真武810E 机型上评估并基于 PPU 硬件特性进行优化。开源代码链接如下:https://github.com/NVIDIA/nvbandwidth

下面各章节将会分别介绍它的参数及常见 cases 的测试方法,其中所附的数据值仅供参考使用实际 perf 数据还需参考当前的机器环境、带宽配置及 PPU SDK release 版本而决定。

2. 使用方式

2.1 编译包使用方式

PPU SDK v2.1.0 release 以后

从 PPU SDK v2.1.0 release 开始,sailbandwidth 从 SDK 中移除并独立发布,链接为:https://art-pub.eng.t-head.cn/artifactory/generic-local/SAIL/v2.1.0/COMM,它位于 comm_tools 包的如下位置:

// 单进程版本,适用于单机场景测试
comm_tools/single_process/sailbandwidth

// 多进程版本,适用于 cross-node 场景测试
comm_tools/multi_process/sailbandwidth

PPU SDK v2.1.0 release 以前

sailbandwidth 工具随同 PPU SDK 一起发布,它位于 SDK 目录的如下位置:

// 单进程版本,适用于单机场景测试
PPU_SDK/pccl_tools/sailbandwidth

// 多进程版本,适用于 cross-node 场景测试,从 PPU SDK v2.0.0 release 起支持
PPU_SDK/pccl_tools/mp/sailbandwidth

备注:PPU SDK v2.0及之前的版本,不能使用comm_tools,否则会报错。

2.2 源码编译方式

sailbandwidth 位于在 pccl-tests 目录下,目前在阿里集团内部开源,签署了平头哥 NDA 的同学可以申请访问开源代码的权限。

开源代码发布路径为:

https://code.alibaba-inc.com/ppu_open_source/pccl-tests

source SDK 后,然后进入 sailbandwidth 目录可以使用如下命令进行编译:

// 编译单进程版本
./build.sh --use_sdk

// 编译多进程版本
./build.sh --use_sdk --use_mpi

3. 参数介绍

可以使用 ./sailbandwidth -h 查询全部参数b信息:

sailbandwidth Version: v2.1.0
sailbandwidth CLI:
// 与 nvbandwidth 兼容的参数:
  -h [ --help ]                  Produce help message
  -b [ --bufferSize ] arg (=512) Memcpy buffer size in MiB
  -l [ --list ]                  List available testcases
  -t [ --testcase ] arg          Testcase(s) to run (by name or index)
  -p [ --testcasePrefixes ] arg  Testcase(s) to run (by prefix))
  -v [ --verbose ]               Verbose output
  -s [ --skipVerification ]      Skips data verification after copy
  -d [ --disableAffinity ]       Disable automatic CPU affinity control
  -i [ --testSamples ] arg (=3)  Iterations of the benchmark
  -m [ --useMean ]               Use mean instead of median for results
  -j [ --json ]                  Print output in json format instead of plain text.

// sailbandwidth 新增的参数:
  --useNormalCopy                Use normal copy for sm test
  --nBlocks arg (=-1)            nBlocks for sm bulk copy test
  --nThreads arg (=-1)           nThreads for sm bulk copy test
  --isSingleDie                  Use single die for ce test
  --testLocalCopy                Test local copy in D2D Tests
  --displayUR                    Show utilization ratio matrixs in D2D Tests
  --disableVm                    Disable virtual mode

4. 示例

4.1 单进程 cases

4.1.1 Copy Engine (CE) Copy

D2D 单向 Read 带宽测试
./sailbandwidth -t device_to_device_memcpy_read_ce -s true -i 1
Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     48.87     47.22     47.69     48.43     45.11     45.07     48.22
1     45.19       N/A     48.92     46.46     45.00     45.28     47.31     47.39
2     48.88     45.62       N/A     45.38     48.48     47.57     48.86     46.14
3     46.03     44.84     46.26       N/A     45.27     48.70     46.56     45.29
4     45.36     45.93     44.87     48.00       N/A     46.31     44.95     44.84
5     48.88     47.64     45.81     48.76     48.21       N/A     45.84     44.99
6     46.01     48.87     47.93     48.47     45.44     45.17       N/A     45.59
7     45.57     46.83     48.56     47.71     47.64     47.62     48.88       N/A
SUM device_to_device_memcpy_read_ce 2620.72
D2D 双向 Read 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_read_ce -s true -i 1
Running device_to_device_bidirectional_memcpy_read_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     46.24     44.67     44.25     47.57     44.49     44.04     44.92
1     43.68       N/A     44.83     44.93     46.02     44.30     44.33     46.76
2     44.35     44.11       N/A     45.45     47.03     44.16     44.42     45.41
3     46.48     44.96     44.58       N/A     46.42     44.63     44.71     45.29
4     44.46     46.86     44.30     45.60       N/A     46.47     46.58     44.13
5     44.28     45.28     44.04     44.03     47.42       N/A     46.61     43.64
6     45.59     44.35     44.43     45.05     46.14     46.56       N/A     44.37
7     46.10     44.40     44.82     45.55     45.39     45.78     45.91       N/A

SUM device_to_device_bidirectional_memcpy_read_ce_read1 2531.15

memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     44.81     44.70     46.81     47.57     44.93     44.07     44.08
1     43.71       N/A     44.84     44.59     44.21     44.30     44.34     44.13
2     44.11     44.12       N/A     44.96     47.00     46.93     45.71     44.83
3     45.84     45.40     44.59       N/A     46.43     45.21     44.11     45.46
4     44.28     46.42     44.78     44.64       N/A     46.53     44.19     45.67
5     44.29     44.35     44.49     45.62     47.42       N/A     46.30     43.64
6     45.59     44.38     44.44     44.06     44.59     45.74       N/A     46.23
7     44.45     44.40     44.87     45.18     45.17     45.56     45.89       N/A

SUM device_to_device_bidirectional_memcpy_read_ce_read2 2524.94

memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     91.06     89.38     91.06     95.14     89.42     88.11     89.00
1     87.38       N/A     89.67     89.52     90.23     88.60     88.67     90.90
2     88.45     88.22       N/A     90.40     94.03     91.09     90.13     90.24
3     92.32     90.36     89.17       N/A     92.85     89.84     88.82     90.75
4     88.74     93.27     89.08     90.24       N/A     93.00     90.77     89.80
5     88.57     89.63     88.53     89.65     94.84       N/A     92.92     87.27
6     91.18     88.73     88.87     89.11     90.73     92.30       N/A     90.59
7     90.56     88.80     89.69     90.73     90.56     91.35     91.80       N/A

SUM device_to_device_bidirectional_memcpy_read_ce_total 5056.09
D2D 单向 Write 带宽测试
./sailbandwidth -t device_to_device_memcpy_write_ce -s true -i 1
Running device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     46.11     46.97     48.07     96.90     97.65     48.60     95.29
1     46.44       N/A     46.67     46.85     94.09     94.44     94.85     47.58
2     45.82     46.94       N/A     48.38     48.90     95.74     94.97     93.26
3     46.10     48.78     45.99       N/A     90.39     46.81     97.31     95.06
4     93.30     90.51     46.70     97.19       N/A     45.79     46.20     48.26
5     91.44     90.75     90.36     44.99     45.77       N/A     46.84     48.17
6     46.54     94.74     95.56     96.83     48.88     47.58       N/A     46.38
7     90.39     44.84     90.40     90.32     44.92     44.99     45.06       N/A

SUM device_to_device_memcpy_write_ce 3748.65
D2D 双向 Write 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_write_ce -s true -i 1
Running device_to_device_bidirectional_memcpy_write_ce.
memcpy CE GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     46.48     45.11     45.45     86.87     87.96     45.57     91.42
1     45.89       N/A     44.44     45.29     89.64     90.34     89.51     44.58
2     44.10     44.29       N/A     45.36     44.98     89.66     87.56     88.32
3     44.54     44.11     44.09       N/A     89.57     43.92     86.25     87.00
4     88.64     87.24     46.77     90.35       N/A     45.24     46.75     44.01
5     86.21     90.83     87.64     45.94     46.71       N/A     46.74     44.32
6     46.97     89.54     87.69     89.79     44.17     44.28       N/A     45.51
7     87.95     43.98     86.17     91.33     46.82     44.05     44.06       N/A

SUM device_to_device_bidirectional_memcpy_write_ce_write1 3571.98

memcpy CE GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     44.63     43.96     43.99     86.89     87.26     45.45     90.44
1     45.38       N/A     44.47     44.75     88.64     88.19     87.81     44.05
2     44.10     44.29       N/A     44.21     44.99     89.75     88.71     90.57
3     44.04     44.41     45.21       N/A     87.08     44.02     90.27     88.86
4     88.42     86.38     46.43     89.49       N/A     44.05     46.85     44.01
5     86.20     86.07     86.57     44.82     46.55       N/A     46.77     45.17
6     46.97     89.57     87.71     87.19     44.16     44.24       N/A     45.48
7     88.15     44.84     88.43     91.16     46.71     46.13     46.43       N/A

SUM device_to_device_bidirectional_memcpy_write_ce_write2 3561.37

memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     91.11     89.07     89.43    173.76    175.22     91.02    181.85
1     91.26       N/A     88.91     90.04    178.29    178.53    177.32     88.63
2     88.20     88.59       N/A     89.58     89.97    179.41    176.26    178.89
3     88.58     88.52     89.30       N/A    176.66     87.94    176.52    175.86
4    177.06    173.62     93.21    179.83       N/A     89.28     93.60     88.02
5    172.41    176.90    174.21     90.76     93.26       N/A     93.51     89.49
6     93.94    179.11    175.41    176.98     88.33     88.52       N/A     90.99
7    176.10     88.81    174.59    182.49     93.53     90.18     90.49       N/A

SUM device_to_device_bidirectional_memcpy_write_ce_total 7133.35
H2D 单向带宽测试
./sailbandwidth -t host_to_device_memcpy_ce -s true -i 1
Running host_to_device_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     44.74     45.08     44.65     44.98     45.17     45.05     45.00     44.95

SUM host_to_device_memcpy_ce 359.62
H2D 双向带宽测试
./sailbandwidth -t host_to_device_bidirectional_memcpy_ce -s true -i 1
Running host_to_device_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     34.47     34.73     34.61     34.44     34.76     34.62     34.78     34.65

SUM host_to_device_bidirectional_memcpy_ce 277.06
D2H 单向带宽测试
./sailbandwidth -t device_to_host_memcpy_ce -s true -i 1
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     56.86     56.76     56.75     56.75     56.75     56.76     56.81     56.77

SUM device_to_host_memcpy_ce 454.21
D2H 双向带宽测试
./sailbandwidth -t device_to_host_bidirectional_memcpy_ce -s true -i 1
Running device_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     41.58     41.66     41.60     41.64     41.55     41.67     41.68     41.64

SUM device_to_host_bidirectional_memcpy_ce 333.02
ALL2H 单向带宽测试
./sailbandwidth -t all_to_host_memcpy_ce -s true -i 1
Running all_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     35.73     35.74     35.36     35.28     34.50     34.60     34.70     34.44

SUM all_to_host_memcpy_ce 280.34
ALL2H 双向带宽测试
./sailbandwidth -t all_to_host_bidirectional_memcpy_ce -s true -i 1
Running all_to_host_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     32.84     32.72     32.92     32.70     32.13     32.22     32.18     32.10

SUM all_to_host_bidirectional_memcpy_ce 259.81
H2ALL 单向带宽测试
./sailbandwidth -t host_to_all_memcpy_ce -s true -i 1
Running host_to_all_memcpy_ce.
memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     41.03     41.59     41.46     41.38     41.54     41.41     41.61     41.94

SUM host_to_all_memcpy_ce 331.97
H2ALL 双向带宽测试
./sailbandwidth -t host_to_all_bidirectional_memcpy_ce -s true -i 1
Running host_to_all_bidirectional_memcpy_ce.
memcpy CE CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     13.36     13.44     13.40     13.43     13.33     13.33     13.36     13.34

SUM host_to_all_bidirectional_memcpy_ce 107.01
ALL2D 单向 Read 带宽测试
./sailbandwidth -t all_to_one_read_ce -s true -i 1
Running all_to_one_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     47.85     51.28     49.56     49.25     48.04     49.41     50.00     46.47

SUM all_to_one_read_ce 391.88
ALL2D 单向 Write 带宽测试
./sailbandwidth -t all_to_one_write_ce -s true -i 1
Running all_to_one_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0    125.83    125.11    126.47    125.63    131.94    131.80    139.10    131.74

SUM all_to_one_write_ce 1037.62
D2ALL 单向 Read 带宽测试
./sailbandwidth -t one_to_all_read_ce -s true -i 1
Running one_to_all_read_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0    124.36    125.04    125.84    126.31    133.50    132.97    140.04    131.60

SUM one_to_all_read_ce 1039.66
D2ALL 单向 Write 带宽测试
./sailbandwidth -t one_to_all_write_ce -s true -i 1
Running one_to_all_write_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0    105.36    106.22    103.95    107.19    113.33    112.38    119.61    112.42

SUM one_to_all_write_ce 880.45
Local Copy 带宽测试
./sailbandwidth -t device_local_copy -s true -i 1
Running device_local_copy.
memcpy local GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0   1055.36   1055.80   1056.31   1056.58   1055.00   1056.70   1055.45   1055.33

SUM device_local_copy 8446.53

4.1.2 Steaming Multiprocessor (SM) Copy

D2D 单向 Read 带宽测试
./sailbandwidth -t device_to_device_memcpy_read_sm -s true -i 1
Running device_to_device_memcpy_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     48.17     48.17     48.17     96.31     96.31     48.17     96.31
1     48.17       N/A     48.17     48.17     96.31     96.31     96.31     48.17
2     48.17     48.17       N/A     48.17     48.17     96.31     96.31     96.31
3     48.17     48.17     48.17       N/A     96.31     48.17     96.31     96.31
4     96.31     96.31     48.17     96.31       N/A     48.17     48.17     48.17
5     96.31     96.31     96.31     48.17     48.17       N/A     48.17     48.17
6     48.17     96.31     96.31     96.31     48.17     48.17       N/A     48.17
7     96.31     48.17     96.31     96.31     48.17     48.17     48.17       N/A

SUM device_to_device_memcpy_read_sm 3852.87
D2D 双向 Read 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_read_sm -s true -i 1
Running device_to_device_bidirectional_memcpy_read_sm.
memcpy SM GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     41.86     41.86     41.86     83.70     83.70     41.86     83.70
1     41.86       N/A     41.86     41.86     83.70     83.70     83.70     41.86
2     41.86     41.86       N/A     41.86     41.86     83.70     83.70     83.70
3     41.86     41.86     41.86       N/A     83.70     41.86     83.69     83.69
4     83.70     83.70     41.86     83.70       N/A     41.86     41.86     41.86
5     83.70     83.70     83.70     41.86     41.86       N/A     41.86     41.86
6     41.86     83.70     83.69     83.70     41.86     41.86       N/A     41.86
7     83.70     41.86     83.70     83.70     41.86     41.86     41.86       N/A

SUM device_to_device_bidirectional_memcpy_read_sm_read1 3348.32

memcpy SM GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     41.86     41.86     41.86     83.70     83.70     41.86     83.71
1     41.86       N/A     41.86     41.86     83.70     83.70     83.70     41.86
2     41.86     41.86       N/A     41.86     41.86     83.70     83.70     83.70
3     41.86     41.86     41.86       N/A     83.70     41.86     83.70     83.70
4     83.70     83.70     41.86     83.70       N/A     41.86     41.86     41.86
5     83.70     83.70     83.70     41.86     41.86       N/A     41.86     41.86
6     41.86     83.70     83.70     83.70     41.86     41.86       N/A     41.86
7     83.70     41.86     83.70     83.70     41.86     41.86     41.86       N/A

SUM device_to_device_bidirectional_memcpy_read_sm_read2 3348.37

memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     83.72     83.72     83.72    167.40    167.40     83.72    167.41
1     83.72       N/A     83.72     83.72    167.40    167.40    167.40     83.72
2     83.72     83.72       N/A     83.72     83.72    167.40    167.39    167.39
3     83.72     83.72     83.72       N/A    167.40     83.72    167.39    167.39
4    167.40    167.40     83.72    167.40       N/A     83.72     83.72     83.72
5    167.39    167.40    167.40     83.72     83.72       N/A     83.72     83.72
6     83.72    167.40    167.39    167.40     83.72     83.72       N/A     83.72
7    167.41     83.72    167.40    167.40     83.72     83.72     83.72       N/A

SUM device_to_device_bidirectional_memcpy_read_sm_total 6696.68
D2D 单向 Write 带宽测试
./sailbandwidth -t device_to_device_memcpy_write_sm -s true -i 1
Running device_to_device_memcpy_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     46.80     46.80     46.80     93.25     93.23     46.80     93.25
1     46.80       N/A     46.80     46.80     93.24     93.25     93.25     46.80
2     46.80     46.80       N/A     46.80     46.80     93.25     93.24     93.25
3     46.80     46.80     46.80       N/A     93.25     46.80     93.23     93.24
4     93.24     93.26     46.80     93.24       N/A     46.80     46.80     46.80
5     93.24     93.25     93.24     46.80     46.80       N/A     46.80     46.80
6     46.80     93.25     93.25     93.23     46.80     46.80       N/A     46.80
7     93.25     46.80     93.25     93.25     46.80     46.80     46.80       N/A

SUM device_to_device_memcpy_write_sm 3735.50
D2D 双向 Write 带宽测试
./sailbandwidth -t device_to_device_bidirectional_memcpy_write_sm -s true -i 1
Running device_to_device_bidirectional_memcpy_write_sm.
memcpy SM GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     45.27     45.27     45.26     89.99     89.99     45.27     89.99
1     45.26       N/A     45.26     45.27     89.99     89.99     89.98     45.27
2     45.26     45.27       N/A     45.27     45.26     89.99     89.99     89.99
3     45.26     45.26     45.26       N/A     89.99     45.26     89.98     89.99
4     89.99     89.99     45.26     89.97       N/A     45.27     45.27     45.27
5     89.98     89.98     89.98     45.26     45.26       N/A     45.27     45.27
6     45.26     90.00     89.99     89.97     45.27     45.27       N/A     45.27
7     89.98     45.27     89.99     89.99     45.27     45.27     45.27       N/A

SUM device_to_device_bidirectional_memcpy_write_sm_write1 3608.14

memcpy SM GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     45.26     45.26     45.27     89.99     89.98     45.26     89.99
1     45.26       N/A     45.26     45.27     89.99     89.99     89.98     45.27
2     45.26     45.26       N/A     45.27     45.27     89.98     90.02     89.97
3     45.27     45.26     45.27       N/A     89.97     45.26     89.98     89.99
4     89.99     89.97     45.26     89.99       N/A     45.26     45.26     45.26
5     89.99     89.99     89.98     45.27     45.27       N/A     45.27     45.27
6     45.27     89.98     89.97     89.99     45.27     45.26       N/A     45.26
7     89.98     45.27     90.00     89.98     45.27     45.27     45.27       N/A

SUM device_to_device_bidirectional_memcpy_write_sm_write2 3608.13

memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
0         1         2         3         4         5         6         7
0       N/A     90.53     90.53     90.53    179.98    179.97     90.53    179.98
1     90.53       N/A     90.53     90.53    179.98    179.98    179.96     90.53
2     90.53     90.53       N/A     90.53     90.53    179.98    180.00    179.96
3     90.53     90.53     90.53       N/A    179.96     90.53    179.96    179.99
4    179.98    179.97     90.53    179.96       N/A     90.53     90.53     90.53
5    179.97    179.97    179.97     90.53     90.53       N/A     90.53     90.53
6     90.53    179.98    179.97    179.95     90.53     90.53       N/A     90.53
7    179.96     90.53    179.99    179.96     90.53     90.53     90.53       N/A

SUM device_to_device_bidirectional_memcpy_write_sm_total 7216.27
H2D 单向带宽测试
./sailbandwidth -t host_to_device_memcpy_sm -s true -i 1
Running host_to_device_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     32.31     32.65     32.26     32.47     32.65     32.58     32.45     32.60

SUM host_to_device_memcpy_sm 259.96
H2D 双向带宽测试
./sailbandwidth -t host_to_device_bidirectional_memcpy_sm -s true -i 1
Running host_to_device_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     27.07     27.37     27.21     27.25     27.28     27.32     27.30     27.31

SUM host_to_device_bidirectional_memcpy_sm 218.12
D2H 单向带宽测试
./sailbandwidth -t device_to_host_memcpy_sm -s true -i 1
Running device_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     52.69     52.60     52.62     52.61     52.60     52.61     52.61     52.61

SUM device_to_host_memcpy_sm 420.95
D2H 双向带宽测试
./sailbandwidth -t device_to_host_bidirectional_memcpy_sm -s true -i 1
Running device_to_host_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     27.07     27.37     27.18     27.23     27.31     27.36     27.33     27.32

SUM device_to_host_bidirectional_memcpy_sm 218.16
ALL2H 单向带宽测试
./sailbandwidth -t all_to_host_memcpy_sm -s true -i 1
Running all_to_host_memcpy_sm.
memcpy SM CPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     34.59     34.93     34.91     34.81     34.50     34.69     34.72     34.51

SUM all_to_host_memcpy_sm 277.68
ALL2H 双向带宽测试
./sailbandwidth -t all_to_host_bidirectional_memcpy_sm -s true -i 1
Running all_to_host_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     25.33     25.63     25.48     25.51     25.57     25.48     25.50     25.58

SUM all_to_host_bidirectional_memcpy_sm 204.07
H2ALL 单向带宽测试
./sailbandwidth -t host_to_all_memcpy_sm -s true -i 1
Running host_to_all_memcpy_sm.
memcpy SM CPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     30.99     31.29     31.04     31.20     31.32     31.22     31.19     31.31

SUM host_to_all_memcpy_sm 249.56
H2ALL 双向带宽测试
./sailbandwidth -t host_to_all_bidirectional_memcpy_sm -s true -i 1
Running host_to_all_bidirectional_memcpy_sm.
memcpy SM CPU(row) <-> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     26.19     26.41     26.26     26.39     26.44     26.43     26.40     26.46

SUM host_to_all_bidirectional_memcpy_sm 210.98
ALL2D 单向 Read 带宽测试
./sailbandwidth -t all_to_one_read_sm -s true -i 1
Running all_to_one_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     59.86     59.52     58.54     59.96     58.62     59.74     57.47     58.42

SUM all_to_one_read_sm 472.13
ALL2D 单向 Write 带宽测试
./sailbandwidth -t all_to_one_write_sm -s true -i 1
Running all_to_one_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0    117.35    117.35    117.31    117.35    124.83    124.83    131.07    124.83

SUM all_to_one_write_sm 974.92
D2ALL 单向 Read 带宽测试
./sailbandwidth -t one_to_all_read_sm -s true -i 1
Running one_to_all_read_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0    117.35    117.35    117.31    117.35    124.83    124.83    131.07    124.83

SUM one_to_all_read_sm 974.92
D2ALL 单向 Write 带宽测试
./sailbandwidth -t one_to_all_write_sm -s true -i 1
Running one_to_all_write_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
0         1         2         3         4         5         6         7
0     61.98     59.85     59.38     59.13     58.58     56.35     59.03     61.80

SUM one_to_all_write_sm 476.10
H2D latency 测试
./sailbandwidth -t host_device_latency_sm -s true -i 1
Running host_device_latency_sm.
memory latency SM CPU(row) <-> GPU(column) (ns)
0         1         2         3         4         5         6         7
0   1047.14   1045.69   1051.88   1046.35   1013.99    944.99    990.09    982.71

SUM host_device_latency_sm 8122.84
D2D latency 测试
./sailbandwidth -t device_to_device_latency_sm -s true -i 1
Running device_to_device_latency_sm.
Device to Device Latency SM GPU(row) <-> GPU(column) (ns)
0         1         2         3         4         5         6         7
0       N/A    742.70    744.00    744.41   1143.64   1142.41    711.45   1142.57
1    742.64       N/A    739.88    740.14   1141.27   1141.64   1140.24    709.46
2    740.55    739.95       N/A    742.10    710.87   1140.07   1140.09   1140.00
3    740.59    740.14    742.28       N/A   1140.01    711.41   1140.17   1141.28
4   1140.44   1141.36    709.26   1141.55       N/A    741.84    741.95    741.55
5   1140.50   1141.37   1141.25    711.22    742.00       N/A    740.08    741.95
6    709.85   1141.80   1140.12   1141.51    741.69    741.95       N/A    742.49
7   1140.71    711.63   1141.20   1141.26    741.51    742.24    740.29       N/A

SUM device_to_device_latency_sm 50870.51

4.2 多进程 cases

4.2.1 Copy Engine (CE) Copy

D2D 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_read_ce
Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3
 0       N/A     91.68     91.27     95.83
 1     90.91       N/A     95.75     95.80
 2     91.86     91.12       N/A     91.05
 3     91.06     95.83     91.49       N/A

SUM multinode_device_to_device_memcpy_read_ce 1113.66
D2D 双向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_read_ce
Running multinode_device_to_device_bidirectional_memcpy_read_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
           0         1         2         3
 0       N/A     94.05     96.31     96.31
 1     94.05       N/A     96.31     96.31
 2     96.31     96.31       N/A     94.05
 3     96.31     96.31     94.04       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_ce_read1 1146.66

memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
           0         1         2         3
 0       N/A     94.04     96.30     96.31
 1     94.05       N/A     96.29     96.31
 2     96.31     96.31       N/A     94.04
 3     96.31     96.31     94.04       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_ce_read2 1146.63

memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
           0         1         2         3
 0       N/A    188.09    192.61    192.62
 1    188.10       N/A    192.60    192.62
 2    192.62    192.62       N/A    188.09
 3    192.62    192.62    188.09       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_ce_total 2293.29
D2D 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_write_ce
Running multinode_device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3
 0       N/A    383.86    383.87    383.86
 1    383.85       N/A    383.87    383.86
 2    383.80    383.82       N/A    383.82
 3    383.82    383.82    383.84       N/A

SUM multinode_device_to_device_memcpy_write_ce 4606.09
D2D 双向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_write_ce
Running multinode_device_to_device_bidirectional_memcpy_write_ce.
memcpy CE GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
           0         1         2         3
 0       N/A    376.64    376.25    376.26
 1    376.64       N/A    376.22    376.23
 2    376.24    376.24       N/A    376.63
 3    376.25    376.26    376.68       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_ce_write1 4516.53

memcpy CE GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
           0         1         2         3
 0       N/A    376.64    376.42    376.44
 1    376.63       N/A    376.43    376.46
 2    376.30    376.30       N/A    376.55
 3    376.24    376.31    376.65       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_ce_write2 4517.37

memcpy CE GPU(row) <-> GPU(column) Total bandwidth (GB/s)
           0         1         2         3
 0       N/A    753.28    752.67    752.70
 1    753.27       N/A    752.65    752.68
 2    752.55    752.54       N/A    753.18
 3    752.50    752.56    753.33       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_ce_total 9033.90

4.2.2 Steaming Multiprocessor (SM) Copy

D2D 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_read_sm
Running multinode_device_to_device_memcpy_read_sm.
memcpy SM GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3
 0       N/A    351.95    351.66    351.90
 1    351.94       N/A    351.67    351.88
 2    351.73    351.90       N/A    352.01
 3    351.70    351.87    351.75       N/A

SUM multinode_device_to_device_memcpy_read_sm 4221.96
D2D 双向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_read_sm
Running multinode_device_to_device_bidirectional_memcpy_read_sm.
memcpy SM GPU(row) <-> GPU(column) Read1 bandwidth (GB/s)
           0         1         2         3
 0       N/A    310.79    311.02    311.40
 1    311.10       N/A    311.12    311.18
 2    311.35    310.65       N/A    310.67
 3    311.19    311.05    310.94       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_sm_read1 3732.47

memcpy SM GPU(row) <-> GPU(column) Read2 bandwidth (GB/s)
           0         1         2         3
 0       N/A    311.08    312.20    311.96
 1    311.16       N/A    311.24    311.35
 2    311.14    311.06       N/A    311.16
 3    311.79    311.24    310.99       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_sm_read2 3736.38

memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
           0         1         2         3
 0       N/A    621.88    623.22    623.37
 1    622.26       N/A    622.36    622.53
 2    622.50    621.72       N/A    621.83
 3    622.97    622.29    621.93       N/A

SUM multinode_device_to_device_bidirectional_memcpy_read_sm_total 7468.85
D2D 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_memcpy_write_sm
Running multinode_device_to_device_memcpy_write_sm.
memcpy SM GPU(row) <- GPU(column) bandwidth (GB/s)
           0         1         2         3
 0       N/A    341.81    341.80    341.69
 1    341.73       N/A    341.87    341.63
 2    342.30    341.87       N/A    342.02
 3    342.05    342.32    342.17       N/A

SUM multinode_device_to_device_memcpy_write_sm 4103.26
D2D 双向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_bidirectional_memcpy_write_sm
Running multinode_device_to_device_bidirectional_memcpy_write_sm.
memcpy SM GPU(row) <-> GPU(column) Write1 bandwidth (GB/s)
           0         1         2         3
 0       N/A    331.61    331.43    331.65
 1    331.67       N/A    331.81    331.77
 2    332.03    332.10       N/A    332.01
 3    331.85    331.84    331.76       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_sm_write1 3981.53

memcpy SM GPU(row) <-> GPU(column) Write2 bandwidth (GB/s)
           0         1         2         3
 0       N/A    331.75    332.11    332.07
 1    331.81       N/A    331.99    332.20
 2    331.48    331.77       N/A    332.05
 3    331.47    331.85    332.10       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_sm_write2 3982.65

memcpy SM GPU(row) <-> GPU(column) Total bandwidth (GB/s)
           0         1         2         3
 0       N/A    663.37    663.54    663.72
 1    663.48       N/A    663.80    663.97
 2    663.51    663.86       N/A    664.05
 3    663.32    663.69    663.86       N/A

SUM multinode_device_to_device_bidirectional_memcpy_write_sm_total 7964.18
ALL2One 单向 Write 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_all_to_one_write_sm
Running multinode_device_to_device_all_to_one_write_sm.
memcpy SM All Gpus -> GPU(column) total bandwidth (GB/s)
           0         1         2         3
 0    352.59    352.37    353.21    352.58

SUM multinode_device_to_device_all_to_one_write_sm 1410.75
ALLFromOne 单向 Read 带宽测试
mpirun -np 4 -npernode 2 --host “$host1:2,$host2:2” -mca btl_tcp_if_include bond0 ./sailbandwidth -t multinode_device_to_device_all_from_one_read_sm
Running multinode_device_to_device_all_from_one_read_sm.
memcpy SM All Gpus <- GPU(column) total bandwidth (GB/s)
           0         1         2         3
 0    354.36    354.40    354.32    354.43

SUM multinode_device_to_device_all_from_one_read_sm 1417.51