PerfTracker是一个用于大模型训练的在线性能分析诊断工具,基于高精度的软硬件全栈信息在线监控。当任务性能出现问题时,它能在线获取各个Worker的所有CUDA核函数、Python函数执行记录及硬件监控记录,并生成分析报告,自动化诊断性能损失原因,如慢节点定位、瓶颈/耗时异常函数以及Hang问题等。本文为您介绍如何使用PerfTracker。
使用限制
目前,使用PerfTracker功能有如下限制:
限制类型 | 说明 |
地域 | 华北6(乌兰察布) |
框架 | PyTorch |
PerfTracker功能介绍
主要功能
在线获取任务执行记录:当任务性能出现问题时,能够在线收集所有Worker的函数(CUDA核函数和Python函数等各种类型)执行记录,以及GPU、NVLink、PCIe、DRAM等硬件的高精度(100微秒粒度)监控信息。
函数级性能分析:通过集中处理这些高精度软硬件监控数据,生成各个函数的性能报告,并自动化诊断导致性能损失的原因,包括慢节点定位、瓶颈/耗时异常函数等。同时也为人工深入分析提供了依据,帮助确定性能优化的方向。
解决方案
支持线上训练任务函数运行记录细粒度采集:将收集信息由离线复现转向在线细粒度捕获,提升实时性和准确性。
基于多节点函数运行记录的高效性能分析算法:将人工性能分析的经验程序化,构建自动化诊断分析算法,实现高效的性能分析和问题定位。
实现原理
PerfTracker由Collector和Analyzers两部分组成。其中Collector运行于客户任务容器中,与训练任务进程完全独立,Analyzers运行于AIMaster容器中,同时提供可视化的页面。PerfTracker原理如下图所示:
PerfTracker Collector:支持超高精度的全栈信息在线监控,利用Torch profiler API和nsys进行原始监控数据采集。可以采集以下类型数据:
任务运行时CUDA Kernel函数(包括计算Kernel和通信Kernel等)、向GPU launch Kernel函数、显存操作、Python函数,以及其它所有函数的执行记录,用于代码级别的性能分析,100%精确记录程序行为。
100微秒精度的GPU、NVLink、PCIe、DRAM等硬件的各种指标监控信息。
采集结果如下图所示:
CUDA Kernel函数、显存操作
Python函数、GPU launch Kernel
硬件监控信息
PerfTracker Analyzers:将以上数据进行汇总分析,生成性能分析报告及可视化输出。
使用PerfTracker
准备工作
下载并使用AIMaster安装包后,查看分析结果时,您可以选择AI Master或本地可视化方式;否则,只能使用本地可视化方式展示分析结果。
在提交训练任务前,请先将PerfTracker和AIMaster(可选)安装包下载到本地。以避免因并发执行量高导致下载缓慢的情况。
说明您可以选择通过命令行直接下载,或使用浏览器访问命令行中的链接完成下载。
下载PerfTracker到指定目录(例如
/cpfs01/perftracker
,您需要将其修改为本地存在的目录):wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/c4d_perftracker_collector-1.4.0-py3-none-any.whl
(可选)根据训练环境中的Python版本下载相应版本的AIMaster到指定目录(例如
/cpfs01/perftracker
,您需要将其修改为本地存在的目录),本方案以Python 3.10版本为例:Python 版本
AIMaster下载命令
3.6
wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp36-cp36m-linux_x86_64.whl
3.8
wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp38-cp38-linux_x86_64.whl
3.9
wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp39-cp39-linux_x86_64.whl
3.10
wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl
3.11
wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp311-cp311-linux_x86_64.whl
准备训练代码,导入PerfTracker模块并标记step。
在训练代码的头部import(导入)PerfTracker模块。示例如下:
try: from c4d_perftracker_collector.PerfTracker import PerfTracker my_tracer = PerfTracker() except: my_tracer = None
如果您希望在PerfTracker探测到性能问题时自动发起诊断(而不依赖人工发起诊断),可以在import的时候设置
enable_auto_trigger=True
,如下所示:try: from c4d_perftracker_collector.PerfTracker import PerfTracker my_tracer = PerfTracker(enable_auto_trigger=True) except: my_tracer = None
当设置
enable_auto_trigger=True
时,PerfTracker会:在检测到任务训练性能降低20%以上时,自动触发一次诊断。
在检测到两个step之间时长超过28分钟时,自动触发一次诊断。因此如果您的两个iteration之间夹杂超过28分钟的其它活动(例如存Checkpoint或做validate等)也会触发诊断。
在训练代码中标记step。
使用PerfTracker需要在训练代码中标记step。每次执行到一个tracer.step(),PerfTracker会记录下来,用于后台控制采集多少个iteration。
while iteration < args.train_iters: ... # 训练代码 if my_tracer is not None: my_tracer.step() # 标记一个step
一个包含了上述内容(import以及插入step)的简易代码training.py示例如下:
import torch import time import torch.distributed as dist import argparse try: from c4d_perftracker_collector.PerfTracker import PerfTracker my_tracer = PerfTracker() except: my_tracer = None dist.init_process_group("nccl") torch.cuda.set_device(dist.get_rank()) # 检查CUDA是否可用 if torch.cuda.is_available(): print("CUDA is available!") device = torch.device('cuda') # 使用默认的CUDA设备 else: print("CUDA is not available.") device = torch.device('cpu') # 如果没有CUDA,则使用CPU def matmul(): matrix_a = torch.randn(1000, 1000) matrix_b = torch.randn(1000, 1000) # 将矩阵移动到CUDA设备 matrix_a = matrix_a.to(device) matrix_b = matrix_b.to(device) # 执行矩阵乘法 result = torch.matmul(matrix_a, matrix_b) result_cpu = result.to('cpu') print(result_cpu) del matrix_a, matrix_b, result torch.cuda.empty_cache() for i in range(1000): matmul() time.sleep(dist.get_rank()) print("Epoch:", i) my_tracer.step() dist.barrier()
将准备好的训练代码文件(training.py)和安装包(c4d_perftracker_collector-1.4.0-py3-none-any.whl和(可选)pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl)上传到对象存储OSS存储空间中。
创建训练任务
创建训练任务时开启PerfTracker功能,其中关键配置说明如下。参数配置完成后单击确定。
参数
描述
环境信息
节点镜像
选择PyTorch2.0或以上版本。本方案示例为
easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04
。直接挂载
单击OSS,选择训练代码文件和安装包所在的OSS存储目录,并配置挂载路径,本方案示例为
/mnt/data/
。启动命令
# 安装PerfTracker pip3 install /mnt/data/c4d_perftracker_collector-1.4.0-py3-none-any.whl -i https://mirrors.aliyun.com/pypi/simple/ # (可选)安装AIMaster client(以python 3.10为例) pip3 install /mnt/data/pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple/ pip install -U protobuf==3.20.0 # 执行训练代码(例如训练代码为training.py) CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=4 /mnt/data/training.py
其中
/mnt/data/
为数据集挂载路径。您也可以直接在启动命令中配置下载PerfTracker和AIMaster安装包命令,但由于并发量高,可能会导致下载速度缓慢。本方案在准备工作阶段完成安装包的下载。
资源信息
框架
选择PyTorch。
任务资源
选择资源规格,至少包含4个GPU,例如ecs.gn6e-c12g1.12xlarge。
容错与诊断
自动容错
打开自动容错开关,并配置其他参数为
--enable-perftracker=True
,用于开启PerfTracker自动诊断功能。在任务运行期间,单击目标任务名称,并在概览页签的实例区域,单击master实例操作列下的进入容器,然后执行以下命令实现相应功能。
分析模式:用于在任务性能不符合预期时诊断原因。
该命令会产出并保存分析结果,不保存原始trace:
c4d_perftracker --trigger-on --auto-analyze --output-dir /path/to/trace
保存分析结果后,可通过本地可视化查看分析报告。
在存储(如CPFS或OSS)空间充足时,推荐使用该命令。除分析结果外,原始trace也存档,以便在诊断问题后进行人工确认。但需要注意一个Worker的trace通常超过几百MB,诊断完毕后可以手动删除。
c4d_perftracker --trigger-on --auto-analyze --output-dir /path/to/trace --save-raw-trace all
其中
/path/to/trace
表示原始trace保存的目录,如果不填,则默认值为/tmp/perftracker/output
。您可以将其配置为任务数据集挂载目录(例如/mnt/data/
),原始trace将保存到数据集中,其中.json
文件可在Perfetto页面进行可视化,.nsys-rep
文件可使用免费软件Nsight Systems进行可视化。
快照模式
快照模式用于记录一瞬间所有Worker分别正在执行什么函数(包括函数栈),并自动找出和大多数Worker相比最不同的几个Worker,以及输出它们此时的行为,多用于任务Hang时的分析。
c4d_perftracker --trigger-on --auto-analyze --hang
除此之外,PerfTracker还提供了丰富的参数选择,和对诊断结果的交互查询功能,方便定位性能问题根因。
查看分析结果
AI Master
如果准备工作中关于AI Master可选部分均已下载并安装,则分析结果会在aimaster实例日志中输出。按照下图操作指引,查看日志。
PerfTracker会输出所有对任务性能有一定影响的函数的性能报告,并指示是否有性能异常。不同类型的函数会按如下分类进行展示:
GPU计算函数
GPU Compute:
[2025-03-04 06:04:00,046 PerfTracker] (compute_functions.py 131) INFO: {
"min/median/max GPU utilization (in [0,1])": [
0.27586059769318555,
0.28605496203987174,
0.2945494558115959
],
"workers with abnormal GPU utilization": {},
"major_kernel_executions": {
"void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float)320_1_1|512_1_1": {
"median cost per execution (ms)": 403.7,
"bottleneck ratio (in [0,1])": 0.01608086667957405
},
"sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize160x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel7_16_1|128_1_1": {
"median cost per execution (ms)": 130.0,
"bottleneck ratio (in [0,1])": 0.015779752711771233
},
"ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_32_1|128_1_1": {
"median cost per execution (ms)": 132.60000000000002,
"bottleneck ratio (in [0,1])": 0.013880912782219888
},
"void (anonymous namespace)::indexing_backward_kernel<c10::Half, 4>(long const*, long const*, c10::Half const*, c10::Half*, long, long, long, long, bool)256_16_1|32_4_1": {
"median cost per execution (ms)": 1202.25,
"bottleneck ratio (in [0,1])": 0.012148757934008617
},
"ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_24_1|128_1_1": {
"median cost per execution (ms)": 105.6,
"bottleneck ratio (in [0,1])": 0.005656117080836238
}
},
"workers with potential GPU issues": [],
"detailed report": {}
}
报告解读:
"min/median/max GPU utilization (in [0,1])"
表明该任务所有Worker的GPU利用率最高为29.4%,最低为27.5%,中位数为28.6%。"workers with abnormal GPU utilization"
为空,表明GPU利用率没有显著离群的Worker(如果非空,则会列出离群的Worker号及其GPU利用率)。"major_kernel_executions"
列出了几个总耗时较长的GPU Kernel执行情况,包括平均单次执行耗时(median cost per execution),以及占端到端性能的百分比(bottleneck ratio)。"workers with potential GPU issues"
会列出GPU核函数执行较慢的Worker号,若为空则表示所有Worker都正常。"detailed report"
会在"workers with potential GPU issues"
非空时,列出具体哪个Worker执行的哪个Kernel函数比正常Worker慢,以及慢多少。
GPU显存操作函数
GPU memory operations:
[2025-03-04 06:04:00,048 PerfTracker] (gpu_mem.py 37) INFO: {
"Memcpy DtoD (Device -> Device)": {
"avg bottleneck ratio (in [0,1])": 0.010486858246092,
"abnormal_workers": {
"job_x08j11173.cloud.sqa.na131_2_122482.json": 0.010614755325049817,
"job_x08j11173.cloud.sqa.na131_8_122483.json": 0.0105935370201344,
"job_x08j11173.cloud.sqa.na131_1_122484.json": 0.010571838319204434,
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010551186610995748,
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.010408514784026183,
"job_x08j11173.cloud.sqa.na131_5_122489.json": 0.010394903160689894,
"job_x08j11173.cloud.sqa.na131_8_122486.json": 0.010387693451926115,
"job_x08j11173.cloud.sqa.na131_9_122488.json": 0.010372437296709398
}
}
}
报告解读:
"avg bottleneck ratio (in [0,1])"
表明该任务在监控期间Memcpy DtoD的平均时间占比为1.048%。"abnormal_workers"
表明其中8个Worker的Memcpy DtoD函数耗时异常。对于GPU显存操作函数,当bottleneck ratio(即运行时长除去和计算overlap的部分)大于0.01(1%)即被认为异常。
集合通信
Communication:
{
"nvlink ring send": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
"different from other workers": 0,
"features": {
"bottleneck ratio (in [0,1])": 0.2743985289797289,
"avg throughput (%)": 73.75921390374332,
"throughput std (%)": 11.384679144385027
}
},
"abnormal_workers": []
}
},
"nvlink ring recv": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_3_66933.json",
"different from other workers": 2,
"features": {
"bottleneck ratio (in [0,1])": 0.27346865947352955,
"avg throughput (%)": 72.70337362637363,
"throughput std (%)": 12.658093406593407
}
},
"abnormal_workers": []
}
},
"pcie sendrecv send": {
"ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
"different from other workers": 3,
"features": {
"bottleneck ratio (in [0,1])": 0.07248997985478658,
"avg throughput (%)": 46.667,
"throughput std (%)": 14.636000000000001
}
},
"abnormal_workers": []
}
},
"pcie sendrecv recv": {
"ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_7_66936.json",
"different from other workers": 1,
"features": {
"bottleneck ratio (in [0,1])": 0.0643436909425455,
"avg throughput (%)": 54.833333333333336,
"throughput std (%)": 14.166666666666666
}
},
"abnormal_workers": []
}
},
"pcie ring send": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
"different from other workers": 0,
"features": {
"bottleneck ratio (in [0,1])": 0.2743985289797289,
"avg throughput (%)": 41.36698734177215,
"throughput std (%)": 14.653768987341774
}
},
"abnormal_workers": []
}
},
"pcie ring recv": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"example_of_normal_worker": {
"worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
"different from other workers": 0,
"features": {
"bottleneck ratio (in [0,1])": 0.2743985289797289,
"avg throughput (%)": 41.5311475409836,
"throughput std (%)": 15.282721311475411
}
},
"abnormal_workers": []
}
}
}
该报告按照不同的通信类型将集合通信函数进行分类,然后输出每种通信函数的性能分析,其中:
"example_of_normal_worker"
列出了该函数执行的常态性能参数,包括bottleneck ratio(指占端到端时间的占比,已除去和计算overlap的时间)、avg throughput和throughput std。"abnormal_workers"
若非空,则会列出所有该通信函数性能异常的Worker及其性能指标。
CUDA runtime
CUDA Runtime:
[2025-03-04 06:04:00,047 PerfTracker] (cuda_runtimes.py 43) INFO: {
"cudaLaunchKernel": {
"avg bottleneck ratio (in [0,1])": 0.039727736621541394,
"avg execution time / monitoring duration (in [0,1])": 0.06956947111288565,
"abnormal_workers": {
"job_x08j11173.cloud.sqa.na131_5_122489.json": 0.05342638907019616,
"job_x08j11173.cloud.sqa.na131_8_122483.json": 0.05125160206973098,
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.04770049253555521,
"job_x08j11173.cloud.sqa.na131_8_122486.json": 0.04358845044879828,
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.042635952262081556,
"job_x08j11173.cloud.sqa.na131_9_122488.json": 0.0354174573296689,
"job_x08j11173.cloud.sqa.na131_1_122484.json": 0.023585242093250733,
"job_x08j11173.cloud.sqa.na131_2_122482.json": 0.02021630716304934
}
}
}
报告解读:
"avg bottleneck ratio (in [0,1])"
表明该任务在监控期间cudaLaunchKernel的平均时间占比(已排除和计算overlap的部分)为3.97%。"avg execution time / monitoring duration (in [0,1])"
表示cudaLaunchKernel的平均时间占比(不排除和计算overlap的部分)为6.95%。"abnormal_workers"
表明其中8个Worker的cudaLaunchKernel函数耗时异常。对于CUDA Runtime函数、bottleneck ratio(即运行时长除去和计算overlap的部分)大于0.01(1%)即被认为异常。
Python函数
Python functions:
[2025-03-04 06:04:00,048 PerfTracker] (python_functions.py 43) INFO: {
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: backward_step|megatron/core/pipeline_parallel/schedules.py: custom_backward|<built-in method run_backward of torch._C._EngineBase object at 0x>": {
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.16970858578301054,
"job_x08j11173.cloud.sqa.na131_5_122489.json": 0.16821543761561655,
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.16787961852913025,
"job_x08j11173.cloud.sqa.na131_8_122483.json": 0.16769273336153187,
"job_x08j11173.cloud.sqa.na131_8_122486.json": 0.14482595694389258,
"job_x08j11173.cloud.sqa.na131_9_122488.json": 0.10359829140378449,
"job_x08j11173.cloud.sqa.na131_1_122484.json": 0.06543764774209325,
"job_x08j11173.cloud.sqa.na131_2_122482.json": 0.06217541348063737
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
"job_x08j11173.cloud.sqa.na131_9_122488.json": 0.02471835416438489,
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.02022024568555683,
"job_x08j11173.cloud.sqa.na131_2_122482.json": 0.015394834126935101,
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.011625367332189284
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step": {
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.012272193902698852
},
"autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward|LinearWithGradAccumulationAndAsyncCommunicationBackward|torch/autograd/function.py: apply|torch/cuda/amp/autocast_mode.py: decorate_bwd|megatron/core/tensor_parallel/layers.py: backward|<built-in method matmul of Tensor object at 0x>|aten::matmul|aten::mm": {
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.014066713574814782,
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.013168949365116213,
"job_x08j11173.cloud.sqa.na131_8_122483.json": 0.013000378454189552,
"job_x08j11173.cloud.sqa.na131_5_122489.json": 0.012500119397472594,
"job_x08j11173.cloud.sqa.na131_8_122486.json": 0.012470581043494208
},
"autograd::engine::evaluate_function: FastLayerNormFNBackward|FastLayerNormFNBackward|torch/autograd/function.py: apply|apex/contrib/layer_norm/layer_norm.py: backward|<built-in method ln_bwd of PyCapsule object at 0x>": {
"job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010127612754279463
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving": {
"job_x08j11173.cloud.sqa.na131_2_122487.json": 0.01041679269251709
},
"autograd::engine::evaluate_function: torch::autograd::AccumulateGrad": {
"job_x08j11173.cloud.sqa.na131_8_122486.json": 0.013633967050768714
}
}
该报告列出了所有执行时间占比大于1%(除去与GPU计算、通信等overlap的时间)的Python函数,按照函数名聚类,每个函数下列出了所有执行时间占比大于1%的Worker,以及该函数分别在这些Worker上的执行时间占比。
此外,PerfTracker会输出rank0 Worker的详细函数性能指标,作为示例来展示当前任务有哪些函数运行,以及PerfTracker收集了哪些性能特征,便于您进行进一步查询更多细节(详见附录中的“对性能报告进行交互查询”章节)。
{
"pattern": {
"GPU memory operation": {
"details": {
"Memcpy HtoD (Pageable -> Device)": {
"bottleneck": 1.8938920771644163e-05
},
"Memcpy DtoH (Device -> Pageable)": {
"bottleneck": 7.615863884767547e-05
},
"Memcpy DtoD (Device -> Device)": {
"bottleneck": 0.007481276660560544
},
"Memset (Device)": {
"bottleneck": 0.0005826740305488821
}
},
"summary": {
"bottleneck": 0.008159048250728745
}
},
"communication": {
"details": {
"nvlink sendrecv send": {},
"nvlink sendrecv recv": {},
"nvlink ring send": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"bottleneck": 0.3221236413339125,
"throughput_avg": 72.84106601466992,
"throughput_std": 11.46433251833741,
"comm_duration": 409
}
},
"nvlink ring recv": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"bottleneck": 0.3221236413339125,
"throughput_avg": 72.09130120481929,
"throughput_std": 14.103665060240965,
"comm_duration": 415
}
},
"pcie sendrecv send": {},
"pcie sendrecv recv": {},
"pcie ring send": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"bottleneck": 0.3221236413339125,
"throughput_avg": 43.22107518796992,
"throughput_std": 15.588227067669173,
"comm_duration": 665
}
},
"pcie ring recv": {
"ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
"bottleneck": 0.3221236413339125,
"throughput_avg": 42.52839670658683,
"throughput_std": 17.446902694610777,
"comm_duration": 668
}
},
"others": {}
},
"summary": {
"bottleneck": 0.6444608492212074
}
},
"Python_function": {
"thread71813": {
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: backward_step|megatron/core/pipeline_parallel/schedules.py: custom_backward|<built-in method run_backward of torch._C._EngineBase object at 0x>": {
"bottleneck ratio (in [0,1])": 0.04466643725052002,
"execution time / monitoring duration (in [0,1])": 0.27638775949343625
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
"bottleneck ratio (in [0,1])": 0.023676874610643994,
"execution time / monitoring duration (in [0,1])": 0.17463498252380863
},
"megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward|nn.Module: ParallelTransformerLayer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
"bottleneck ratio (in [0,1])": 0.007685575231437642,
"execution time / monitoring duration (in [0,1])": 0.01774576876303058
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step": {
"bottleneck ratio (in [0,1])": 0.007029966208129874,
"execution time / monitoring duration (in [0,1])": 0.48153616407069133
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step": {
"bottleneck ratio (in [0,1])": 0.00670236317435654,
"execution time / monitoring duration (in [0,1])": 0.010474029098241421
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|torch/utils/_contextlib.py: decorate_context|megatron/optimizer/optimizer.py: step|megatron/optimizer/optimizer.py: clip_grad_norm": {
"bottleneck ratio (in [0,1])": 0.0061998773402663215,
"execution time / monitoring duration (in [0,1])": 0.01125777805357861
},
"megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward|nn.Module: ParallelTransformerLayer_1|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
"bottleneck ratio (in [0,1])": 0.005476168793332855,
"execution time / monitoring duration (in [0,1])": 0.017565244582058315
},
"pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|pretrain_gpt.py: get_batch|megatron/core/tensor_parallel/data.py: broadcast_data": {
"bottleneck ratio (in [0,1])": 0.005102225847033157,
"execution time / monitoring duration (in [0,1])": 0.005930299936091217
}
},
"thread74140": {
"autograd::engine::evaluate_function: torch::autograd::AccumulateGrad": {
"bottleneck ratio (in [0,1])": 0.00576267033947411,
"execution time / monitoring duration (in [0,1])": 0.028585681692349724
},
"autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward|LinearWithGradAccumulationAndAsyncCommunicationBackward|torch/autograd/function.py: apply|torch/cuda/amp/autocast_mode.py: decorate_bwd|megatron/core/tensor_parallel/layers.py: backward|<built-in method matmul of Tensor object at 0x>": {
"bottleneck ratio (in [0,1])": 0.005251722434400817,
"execution time / monitoring duration (in [0,1])": 0.01923912281366278
},
"autograd::engine::evaluate_function: GeLUFunctionBackward": {
"bottleneck ratio (in [0,1])": 0.005025261296663285,
"execution time / monitoring duration (in [0,1])": 0.04228174505633724
}
}
},
"GPU Compute": {
"summary": {
"bottleneck": 0.22080323589594392
},
"details": {
"sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize160x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel7_16_1|128_1_1": {
"avg_cost": 129.8,
"bottleneck": 0.011231585929107188
},
"ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_32_1|128_1_1": {
"avg_cost": 133.4,
"bottleneck": 0.010950725763621315
},
"ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_24_1|128_1_1": {
"avg_cost": 109.5,
"bottleneck": 0.007726273763308622
},
"void at::native::vectorized_elementwise_kernel<4, at::native::CUDAFunctor_add<c10::Half>, at::detail::Array<char*, 3> >(int, at::native::CUDAFunctor_add<c10::Half>, at::detail::Array<char*, 3>)100864_1_1|128_1_1": {
"avg_cost": 176.0,
"bottleneck": 0.0012749520281166412
},
"void at::native::(anonymous namespace)::multi_tensor_apply_kernel<at::native::(anonymous namespace)::TensorListMetadata<1>, at::native::(anonymous namespace)::UnaryOpFunctor<float, 1, 1, 0>, at::native::_amp_foreach_non_finite_check_and_unscale_cuda_(c10::ArrayRef<at::Tensor>, at::Tensor&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float)#1}>(at::native::(anonymous namespace)::TensorListMetadata<1>, at::native::(anonymous namespace)::UnaryOpFunctor<float, 1, 1, 0>, at::native::_amp_foreach_non_finite_check_and_unscale_cuda_(c10::ArrayRef<at::Tensor>, at::Tensor&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float)#1})320_1_1|512_1_1": {
"avg_cost": 116.4,
"bottleneck": 0.006687856766956983
},
"void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float)320_1_1|512_1_1": {
"avg_cost": 407.1,
"bottleneck": 0.022835502981469677
}
}
},
"cuda_runtime": {
"details": {
"cudaLaunchKernel": {
"bottleneck": 0.011437093367267582,
"proportion": 0.047973495181857964
},
"cudaEventQuery": {
"bottleneck": 0.0003695104329276106,
"proportion": 0.0016428506379998565
},
"cudaMemcpyAsync": {
"bottleneck": 0.0031047741392663465,
"proportion": 0.1264205197968136
},
"cudaStreamSynchronize": {
"bottleneck": 0.0002381468548093979,
"proportion": 0.00025708577558104204
},
"cudaStreamIsCapturing": {
"bottleneck": 8.180001950305883e-05,
"proportion": 0.00041746216849836924
},
"cudaEventRecord": {
"bottleneck": 0.0002449971027480777,
"proportion": 0.0009364691887936391
},
"cudaStreamWaitEvent": {
"bottleneck": 0.00022323749164874186,
"proportion": 0.0010831450858336067
},
"cudaStreamGetCaptureInfo_v2": {
"bottleneck": 3.7877841543288325e-05,
"proportion": 6.326405449251348e-05
},
"cudaLaunchKernelExC": {
"bottleneck": 0.0010468790673347137,
"proportion": 0.002914981975788806
},
"cudaPointerGetAttributes": {
"bottleneck": 1.1685717071865549e-05,
"proportion": 1.6521186205051293e-05
},
"cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": {
"bottleneck": 6.165223144811824e-05,
"proportion": 0.00035742176009464626
},
"cudaOccupancyMaxActiveBlocksPerMultiprocessor": {
"bottleneck": 0.00017649462336127966,
"proportion": 0.0006084631992592061
},
"cudaFuncSetAttribute": {
"bottleneck": 0.0001652118620505129,
"proportion": 0.0003554069812891522
},
"cudaMemsetAsync": {
"bottleneck": 0.0006604444924409529,
"proportion": 0.0027936922916980637
},
"cudaDeviceSynchronize": {
"bottleneck": 0.0009050386394279318,
"proportion": 0.3238394269646713
},
"cudaPeekAtLastError": {
"bottleneck": 0.0,
"proportion": 0.0
},
"cudaDeviceGetAttribute": {
"bottleneck": 1.611823044395248e-06,
"proportion": 2.01477880549406e-06
}
},
"summary": {
"bottleneck": 0.018766455705893872
}
}
}
}
本地可视化
参考分析模式生成并保存分析结果至
--output-dir
参数配置的目录中。命令执行成功后,系统会在该目录中生成
<时间戳>/PerfDisplay
文件夹。在容器中,将PerfDisplay文件夹拷贝到数据源的挂载目录
/mnt/data
中,并参考命令行工具ossutil 2.0,将PerfDisplay
目录下载到本地。在终端中,进入
PerfDisplay
文件夹中并运行sudo python3 app.py
(Linux可能不需要sudo),然后在浏览器中打开http://127.0.0.1:5000/,即可在可视化页面中查看任务性能报告,如下图所示。