PerfTracker:在线性能分析诊断工具

PerfTracker是一个用于大模型训练的在线性能分析诊断工具,基于高精度的软硬件全栈信息在线监控。当任务性能出现问题时,它能在线获取各个Worker的所有CUDA核函数、Python函数执行记录及硬件监控记录,并生成分析报告,自动化诊断性能损失原因,如慢节点定位、瓶颈/耗时异常函数以及Hang问题等。本文为您介绍如何使用PerfTracker。

使用限制

目前,使用PerfTracker功能有如下限制:

限制类型

说明

地域

华北6(乌兰察布)

框架

PyTorch

PerfTracker功能介绍

主要功能

  • 在线获取任务执行记录:当任务性能出现问题时,能够在线收集所有Worker的函数(CUDA核函数和Python函数等各种类型)执行记录,以及GPU、NVLink、PCIe、DRAM等硬件的高精度(100微秒粒度)监控信息。

  • 函数级性能分析:通过集中处理这些高精度软硬件监控数据,生成各个函数的性能报告,并自动化诊断导致性能损失的原因,包括慢节点定位、瓶颈/耗时异常函数等。同时也为人工深入分析提供了依据,帮助确定性能优化的方向。

解决方案

  • 支持线上训练任务函数运行记录细粒度采集:将收集信息由离线复现转向在线细粒度捕获,提升实时性和准确性。

  • 基于多节点函数运行记录的高效性能分析算法:将人工性能分析的经验程序化,构建自动化诊断分析算法,实现高效的性能分析和问题定位。

实现原理

PerfTrackerCollectorAnalyzers两部分组成。其中Collector运行于客户任务容器中,与训练任务进程完全独立,Analyzers运行于AIMaster容器中,同时提供可视化的页面。PerfTracker原理如下图所示:

image
  • PerfTracker Collector:支持超高精度的全栈信息在线监控,利用Torch profiler APInsys进行原始监控数据采集。可以采集以下类型数据:

    • 任务运行时CUDA Kernel函数(包括计算Kernel和通信Kernel等)、向GPU launch Kernel函数、显存操作、Python函数,以及其它所有函数的执行记录,用于代码级别的性能分析,100%精确记录程序行为。

    • 100微秒精度的GPU、NVLink、PCIe、DRAM等硬件的各种指标监控信息。

    采集结果如下图所示:

    • CUDA Kernel函数、显存操作

      image

    • Python函数、GPU launch Kernel

      image

    • 硬件监控信息

      image

  • PerfTracker Analyzers:将以上数据进行汇总分析,生成性能分析报告及可视化输出。

    image

使用PerfTracker

准备工作

下载并使用AIMaster安装包后,查看分析结果时,您可以选择AI Master本地可视化方式;否则,只能使用本地可视化方式展示分析结果。

  1. 在提交训练任务前,请先将PerfTrackerAIMaster(可选)安装包下载到本地。以避免因并发执行量高导致下载缓慢的情况。

    说明

    您可以选择通过命令行直接下载,或使用浏览器访问命令行中的链接完成下载。

    • 下载PerfTracker到指定目录(例如/cpfs01/perftracker,您需要将其修改为本地存在的目录):

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/c4d_perftracker_collector-1.4.0-py3-none-any.whl
    • (可选)根据训练环境中的Python版本下载相应版本的AIMaster到指定目录(例如/cpfs01/perftracker,您需要将其修改为本地存在的目录),本方案以Python 3.10版本为例:

      Python 版本

      AIMaster下载命令

      3.6

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp36-cp36m-linux_x86_64.whl

      3.8

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp38-cp38-linux_x86_64.whl

      3.9

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp39-cp39-linux_x86_64.whl

      3.10

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl

      3.11

      wget -t 5 -w 2 -P /cpfs01/perftracker https://network-research-lingjun-open-oss.oss-cn-hangzhou.aliyuncs.com/files/aimaster_wheel/pai_aimaster-1.5.0-cp311-cp311-linux_x86_64.whl

  2. 准备训练代码,导入PerfTracker模块并标记step。

    • 在训练代码的头部import(导入)PerfTracker模块。示例如下:

      try:
          from c4d_perftracker_collector.PerfTracker import PerfTracker
          my_tracer = PerfTracker()
      except:
          my_tracer = None

      如果您希望在PerfTracker探测到性能问题时自动发起诊断(而不依赖人工发起诊断),可以在import的时候设置enable_auto_trigger=True,如下所示:

      try:
          from c4d_perftracker_collector.PerfTracker import PerfTracker
          my_tracer = PerfTracker(enable_auto_trigger=True)
      except:
          my_tracer = None

      当设置enable_auto_trigger=True时,PerfTracker会:

      • 在检测到任务训练性能降低20%以上时,自动触发一次诊断。

      • 在检测到两个step之间时长超过28分钟时,自动触发一次诊断。因此如果您的两个iteration之间夹杂超过28分钟的其它活动(例如存Checkpoint或做validate等)也会触发诊断。

    • 在训练代码中标记step。

      使用PerfTracker需要在训练代码中标记step。每次执行到一个tracer.step(),PerfTracker会记录下来,用于后台控制采集多少个iteration。

      while iteration < args.train_iters:
      
          ...	# 训练代码
          if my_tracer is not None:
              my_tracer.step() # 标记一个step

    一个包含了上述内容(import以及插入step)的简易代码training.py示例如下:

    import torch
    import time
    import torch.distributed as dist
    import argparse
    try:
        from c4d_perftracker_collector.PerfTracker import PerfTracker
        my_tracer = PerfTracker()
    except:
        my_tracer = None
    
    dist.init_process_group("nccl")
    torch.cuda.set_device(dist.get_rank())
    # 检查CUDA是否可用
    if torch.cuda.is_available():
        print("CUDA is available!")
        device = torch.device('cuda')  # 使用默认的CUDA设备
    else:
        print("CUDA is not available.")
        device = torch.device('cpu')  # 如果没有CUDA,则使用CPU
    
    def matmul():
        matrix_a = torch.randn(1000, 1000)
        matrix_b = torch.randn(1000, 1000)
    
        # 将矩阵移动到CUDA设备
        matrix_a = matrix_a.to(device)
        matrix_b = matrix_b.to(device)
    
        # 执行矩阵乘法
        result = torch.matmul(matrix_a, matrix_b)
        result_cpu = result.to('cpu')
        print(result_cpu)
    
        del matrix_a, matrix_b, result
        torch.cuda.empty_cache()
    
    for i in range(1000):
        matmul()
        time.sleep(dist.get_rank())
        print("Epoch:", i)
        my_tracer.step()
        dist.barrier()
    
  3. 将准备好的训练代码文件(training.py)和安装包(c4d_perftracker_collector-1.4.0-py3-none-any.whl和(可选)pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl)上传到对象存储OSS存储空间中。

创建训练任务

  1. 创建训练任务时开启PerfTracker功能,其中关键配置说明如下。参数配置完成后单击确定image

    参数

    描述

    环境信息

    节点镜像

    选择PyTorch2.0或以上版本。本方案示例为easyanimate:1.1.5-pytorch2.2.0-gpu-py310-cu118-ubuntu22.04

    直接挂载

    单击OSS,选择训练代码文件和安装包所在的OSS存储目录,并配置挂载路径,本方案示例为/mnt/data/

    启动命令

    # 安装PerfTracker
    pip3 install /mnt/data/c4d_perftracker_collector-1.4.0-py3-none-any.whl -i https://mirrors.aliyun.com/pypi/simple/
    # (可选)安装AIMaster client(以python 3.10为例)
    pip3 install /mnt/data/pai_aimaster-1.5.0-cp310-cp310-linux_x86_64.whl -i https://mirrors.aliyun.com/pypi/simple/
    pip install -U protobuf==3.20.0
    # 执行训练代码(例如训练代码为training.py)
    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=4 /mnt/data/training.py

    其中/mnt/data/为数据集挂载路径。

    您也可以直接在启动命令中配置下载PerfTrackerAIMaster安装包命令,但由于并发量高,可能会导致下载速度缓慢。本方案在准备工作阶段完成安装包的下载。

    资源信息

    框架

    选择PyTorch

    任务资源

    选择资源规格,至少包含4GPU,例如ecs.gn6e-c12g1.12xlarge。

    容错与诊断

    自动容错

    打开自动容错开关,并配置其他参数--enable-perftracker=True,用于开启PerfTracker自动诊断功能。

  2. 在任务运行期间,单击目标任务名称,并在概览页签的实例区域,单击master实例操作列下的进入容器,然后执行以下命令实现相应功能。

    • 分析模式:用于在任务性能不符合预期时诊断原因。

      • 该命令会产出并保存分析结果,不保存原始trace:

        c4d_perftracker --trigger-on --auto-analyze --output-dir /path/to/trace

        保存分析结果后,可通过本地可视化查看分析报告。

      • 在存储(如CPFSOSS)空间充足时,推荐使用该命令。除分析结果外,原始trace也存档,以便在诊断问题后进行人工确认。但需要注意一个Workertrace通常超过几百MB,诊断完毕后可以手动删除。

        c4d_perftracker --trigger-on --auto-analyze --output-dir /path/to/trace --save-raw-trace all

        其中/path/to/trace表示原始trace保存的目录,如果不填,则默认值为/tmp/perftracker/output。您可以将其配置为任务数据集挂载目录(例如/mnt/data/),原始trace将保存到数据集中,其中.json文件可在Perfetto页面进行可视化,.nsys-rep文件可使用免费软件Nsight Systems进行可视化。

    • 快照模式

      快照模式用于记录一瞬间所有Worker分别正在执行什么函数(包括函数栈),并自动找出和大多数Worker相比最不同的几个Worker,以及输出它们此时的行为,多用于任务Hang时的分析。

      c4d_perftracker --trigger-on --auto-analyze --hang

    除此之外,PerfTracker还提供了丰富的参数选择,和对诊断结果的交互查询功能,方便定位性能问题根因。

查看分析结果

AI Master

如果准备工作中关于AI Master可选部分均已下载并安装,则分析结果会在aimaster实例日志中输出。按照下图操作指引,查看日志image

PerfTracker会输出所有对任务性能有一定影响的函数的性能报告,并指示是否有性能异常。不同类型的函数会按如下分类进行展示:

GPU计算函数

GPU Compute:
[2025-03-04 06:04:00,046 PerfTracker] (compute_functions.py 131) INFO: {
    "min/median/max GPU utilization (in [0,1])": [
        0.27586059769318555,
        0.28605496203987174,
        0.2945494558115959
    ],
    "workers with abnormal GPU utilization": {},
    "major_kernel_executions": {
        "void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float)320_1_1|512_1_1": {
            "median cost per execution (ms)": 403.7,
            "bottleneck ratio (in [0,1])": 0.01608086667957405
        },
        "sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize160x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel7_16_1|128_1_1": {
            "median cost per execution (ms)": 130.0,
            "bottleneck ratio (in [0,1])": 0.015779752711771233
        },
        "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_32_1|128_1_1": {
            "median cost per execution (ms)": 132.60000000000002,
            "bottleneck ratio (in [0,1])": 0.013880912782219888
        },
        "void (anonymous namespace)::indexing_backward_kernel<c10::Half, 4>(long const*, long const*, c10::Half const*, c10::Half*, long, long, long, long, bool)256_16_1|32_4_1": {
            "median cost per execution (ms)": 1202.25,
            "bottleneck ratio (in [0,1])": 0.012148757934008617
        },
        "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_24_1|128_1_1": {
            "median cost per execution (ms)": 105.6,
            "bottleneck ratio (in [0,1])": 0.005656117080836238
        }
    },
    "workers with potential GPU issues": [],
    "detailed report": {}
}

报告解读:

  • "min/median/max GPU utilization (in [0,1])"表明该任务所有WorkerGPU利用率最高为29.4%,最低为27.5%,中位数为28.6%。

  • "workers with abnormal GPU utilization"为空,表明GPU利用率没有显著离群的Worker(如果非空,则会列出离群的Worker号及其GPU利用率)。

  • "major_kernel_executions"列出了几个总耗时较长的GPU Kernel执行情况,包括平均单次执行耗时(median cost per execution),以及占端到端性能的百分比(bottleneck ratio)。

  • "workers with potential GPU issues"会列出GPU核函数执行较慢的Worker号,若为空则表示所有Worker都正常。

  • "detailed report"会在"workers with potential GPU issues"非空时,列出具体哪个Worker执行的哪个Kernel函数比正常Worker慢,以及慢多少。

GPU显存操作函数

GPU memory operations:
[2025-03-04 06:04:00,048 PerfTracker] (gpu_mem.py 37) INFO: {
    "Memcpy DtoD (Device -> Device)": {
        "avg bottleneck ratio (in [0,1])": 0.010486858246092,
        "abnormal_workers": {
            "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.010614755325049817,
            "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.0105935370201344,
            "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.010571838319204434,
            "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010551186610995748,
            "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.010408514784026183,
            "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.010394903160689894,
            "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.010387693451926115,
            "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.010372437296709398
        }
    }
}

报告解读:

  • "avg bottleneck ratio (in [0,1])"表明该任务在监控期间Memcpy DtoD的平均时间占比为1.048%。

  • "abnormal_workers"表明其中8WorkerMemcpy DtoD函数耗时异常。对于GPU显存操作函数,当bottleneck ratio(即运行时长除去和计算overlap的部分)大于0.01(1%)即被认为异常。

集合通信

Communication:
{
    "nvlink ring send": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 73.75921390374332,
                    "throughput std (%)": 11.384679144385027
                }
            },
            "abnormal_workers": []
        }
    },
    "nvlink ring recv": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_3_66933.json",
                "different from other workers": 2,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.27346865947352955,
                    "avg throughput (%)": 72.70337362637363,
                    "throughput std (%)": 12.658093406593407
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie sendrecv send": {
        "ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 3,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.07248997985478658,
                    "avg throughput (%)": 46.667,
                    "throughput std (%)": 14.636000000000001
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie sendrecv recv": {
        "ncclDevKernel_SendRecv(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_7_66936.json",
                "different from other workers": 1,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.0643436909425455,
                    "avg throughput (%)": 54.833333333333336,
                    "throughput std (%)": 14.166666666666666
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie ring send": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 41.36698734177215,
                    "throughput std (%)": 14.653768987341774
                }
            },
            "abnormal_workers": []
        }
    },
    "pcie ring recv": {
        "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
            "example_of_normal_worker": {
                "worker": "job_x08j11173.cloud.sqa.na131_0_66930.json",
                "different from other workers": 0,
                "features": {
                    "bottleneck ratio (in [0,1])": 0.2743985289797289,
                    "avg throughput (%)": 41.5311475409836,
                    "throughput std (%)": 15.282721311475411
                }
            },
            "abnormal_workers": []
        }
    }
}

该报告按照不同的通信类型将集合通信函数进行分类,然后输出每种通信函数的性能分析,其中:

  • "example_of_normal_worker"列出了该函数执行的常态性能参数,包括bottleneck ratio(指占端到端时间的占比,已除去和计算overlap的时间)、avg throughputthroughput std。

  • "abnormal_workers"若非空,则会列出所有该通信函数性能异常的Worker及其性能指标。

CUDA runtime

CUDA Runtime:
[2025-03-04 06:04:00,047 PerfTracker] (cuda_runtimes.py 43) INFO: {
    "cudaLaunchKernel": {
        "avg bottleneck ratio (in [0,1])": 0.039727736621541394,
        "avg execution time / monitoring duration (in [0,1])": 0.06956947111288565,
        "abnormal_workers": {
            "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.05342638907019616,
            "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.05125160206973098,
            "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.04770049253555521,
            "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.04358845044879828,
            "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.042635952262081556,
            "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.0354174573296689,
            "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.023585242093250733,
            "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.02021630716304934
        }
    }
}

报告解读:

  • "avg bottleneck ratio (in [0,1])"表明该任务在监控期间cudaLaunchKernel的平均时间占比(已排除和计算overlap的部分)为3.97%。

  • "avg execution time / monitoring duration (in [0,1])"表示cudaLaunchKernel的平均时间占比(不排除和计算overlap的部分)为6.95%。

  • "abnormal_workers"表明其中8WorkercudaLaunchKernel函数耗时异常。对于CUDA Runtime函数、bottleneck ratio(即运行时长除去和计算overlap的部分)大于0.01(1%)即被认为异常。

Python函数

Python functions:
[2025-03-04 06:04:00,048 PerfTracker] (python_functions.py 43) INFO: {
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: backward_step|megatron/core/pipeline_parallel/schedules.py: custom_backward|<built-in method run_backward of torch._C._EngineBase object at 0x>": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.16970858578301054,
        "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.16821543761561655,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.16787961852913025,
        "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.16769273336153187,
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.14482595694389258,
        "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.10359829140378449,
        "job_x08j11173.cloud.sqa.na131_1_122484.json": 0.06543764774209325,
        "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.06217541348063737
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
        "job_x08j11173.cloud.sqa.na131_9_122488.json": 0.02471835416438489,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.02022024568555683,
        "job_x08j11173.cloud.sqa.na131_2_122482.json": 0.015394834126935101,
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.011625367332189284
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step": {
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.012272193902698852
    },
    "autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward|LinearWithGradAccumulationAndAsyncCommunicationBackward|torch/autograd/function.py: apply|torch/cuda/amp/autocast_mode.py: decorate_bwd|megatron/core/tensor_parallel/layers.py: backward|<built-in method matmul of Tensor object at 0x>|aten::matmul|aten::mm": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.014066713574814782,
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.013168949365116213,
        "job_x08j11173.cloud.sqa.na131_8_122483.json": 0.013000378454189552,
        "job_x08j11173.cloud.sqa.na131_5_122489.json": 0.012500119397472594,
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.012470581043494208
    },
    "autograd::engine::evaluate_function: FastLayerNormFNBackward|FastLayerNormFNBackward|torch/autograd/function.py: apply|apex/contrib/layer_norm/layer_norm.py: backward|<built-in method ln_bwd of PyCapsule object at 0x>": {
        "job_x08j11173.cloud.sqa.na131_0_122485.json": 0.010127612754279463
    },
    "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving": {
        "job_x08j11173.cloud.sqa.na131_2_122487.json": 0.01041679269251709
    },
    "autograd::engine::evaluate_function: torch::autograd::AccumulateGrad": {
        "job_x08j11173.cloud.sqa.na131_8_122486.json": 0.013633967050768714
    }
}

该报告列出了所有执行时间占比大于1%(除去与GPU计算、通信等overlap的时间)的Python函数,按照函数名聚类,每个函数下列出了所有执行时间占比大于1%的Worker,以及该函数分别在这些Worker上的执行时间占比。

此外,PerfTracker会输出rank0 Worker的详细函数性能指标,作为示例来展示当前任务有哪些函数运行,以及PerfTracker收集了哪些性能特征,便于您进行进一步查询更多细节(详见附录中的“对性能报告进行交互查询”章节)。

{
    "pattern": {
        "GPU memory operation": {
            "details": {
                "Memcpy HtoD (Pageable -> Device)": {
                    "bottleneck": 1.8938920771644163e-05
                },
                "Memcpy DtoH (Device -> Pageable)": {
                    "bottleneck": 7.615863884767547e-05
                },
                "Memcpy DtoD (Device -> Device)": {
                    "bottleneck": 0.007481276660560544
                },
                "Memset (Device)": {
                    "bottleneck": 0.0005826740305488821
                }
            },
            "summary": {
                "bottleneck": 0.008159048250728745
            }
        },
        "communication": {
            "details": {
                "nvlink sendrecv send": {},
                "nvlink sendrecv recv": {},
                "nvlink ring send": {
                    "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
                        "bottleneck": 0.3221236413339125,
                        "throughput_avg": 72.84106601466992,
                        "throughput_std": 11.46433251833741,
                        "comm_duration": 409
                    }
                },
                "nvlink ring recv": {
                    "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
                        "bottleneck": 0.3221236413339125,
                        "throughput_avg": 72.09130120481929,
                        "throughput_std": 14.103665060240965,
                        "comm_duration": 415
                    }
                },
                "pcie sendrecv send": {},
                "pcie sendrecv recv": {},
                "pcie ring send": {
                    "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
                        "bottleneck": 0.3221236413339125,
                        "throughput_avg": 43.22107518796992,
                        "throughput_std": 15.588227067669173,
                        "comm_duration": 665
                    }
                },
                "pcie ring recv": {
                    "ncclDevKernel_AllReduce_Sum_f16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)": {
                        "bottleneck": 0.3221236413339125,
                        "throughput_avg": 42.52839670658683,
                        "throughput_std": 17.446902694610777,
                        "comm_duration": 668
                    }
                },
                "others": {}
            },
            "summary": {
                "bottleneck": 0.6444608492212074
            }
        },
        "Python_function": {
            "thread71813": {
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: backward_step|megatron/core/pipeline_parallel/schedules.py: custom_backward|<built-in method run_backward of torch._C._EngineBase object at 0x>": {
                    "bottleneck ratio (in [0,1])": 0.04466643725052002,
                    "execution time / monitoring duration (in [0,1])": 0.27638775949343625
                },
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
                    "bottleneck ratio (in [0,1])": 0.023676874610643994,
                    "execution time / monitoring duration (in [0,1])": 0.17463498252380863
                },
                "megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward|nn.Module: ParallelTransformerLayer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
                    "bottleneck ratio (in [0,1])": 0.007685575231437642,
                    "execution time / monitoring duration (in [0,1])": 0.01774576876303058
                },
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step": {
                    "bottleneck ratio (in [0,1])": 0.007029966208129874,
                    "execution time / monitoring duration (in [0,1])": 0.48153616407069133
                },
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step": {
                    "bottleneck ratio (in [0,1])": 0.00670236317435654,
                    "execution time / monitoring duration (in [0,1])": 0.010474029098241421
                },
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|torch/utils/_contextlib.py: decorate_context|megatron/optimizer/optimizer.py: step|megatron/optimizer/optimizer.py: clip_grad_norm": {
                    "bottleneck ratio (in [0,1])": 0.0061998773402663215,
                    "execution time / monitoring duration (in [0,1])": 0.01125777805357861
                },
                "megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|nn.Module: DistributedDataParallel_0|torch/nn/modules/module.py: _call_impl|megatron/core/distributed/distributed_data_parallel.py: forward|nn.Module: Float16Module_0|torch/nn/modules/module.py: _call_impl|megatron/model/module.py: forward|nn.Module: GPTModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/gpt_model.py: forward|nn.Module: TransformerLanguageModel_0|torch/nn/modules/module.py: _call_impl|megatron/model/language_model.py: forward|nn.Module: ParallelTransformer_0|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward|nn.Module: ParallelTransformerLayer_1|torch/nn/modules/module.py: _call_impl|megatron/model/transformer.py: forward": {
                    "bottleneck ratio (in [0,1])": 0.005476168793332855,
                    "execution time / monitoring duration (in [0,1])": 0.017565244582058315
                },
                "pretrain_gpt.py: <module>|megatron/training.py: pretrain|megatron/training.py: train|megatron/training.py: train_step|megatron/core/pipeline_parallel/schedules.py: forward_backward_pipelining_without_interleaving|megatron/core/pipeline_parallel/schedules.py: forward_step|pretrain_gpt.py: forward_step|pretrain_gpt.py: get_batch|megatron/core/tensor_parallel/data.py: broadcast_data": {
                    "bottleneck ratio (in [0,1])": 0.005102225847033157,
                    "execution time / monitoring duration (in [0,1])": 0.005930299936091217
                }
            },
            "thread74140": {
                "autograd::engine::evaluate_function: torch::autograd::AccumulateGrad": {
                    "bottleneck ratio (in [0,1])": 0.00576267033947411,
                    "execution time / monitoring duration (in [0,1])": 0.028585681692349724
                },
                "autograd::engine::evaluate_function: LinearWithGradAccumulationAndAsyncCommunicationBackward|LinearWithGradAccumulationAndAsyncCommunicationBackward|torch/autograd/function.py: apply|torch/cuda/amp/autocast_mode.py: decorate_bwd|megatron/core/tensor_parallel/layers.py: backward|<built-in method matmul of Tensor object at 0x>": {
                    "bottleneck ratio (in [0,1])": 0.005251722434400817,
                    "execution time / monitoring duration (in [0,1])": 0.01923912281366278
                },
                "autograd::engine::evaluate_function: GeLUFunctionBackward": {
                    "bottleneck ratio (in [0,1])": 0.005025261296663285,
                    "execution time / monitoring duration (in [0,1])": 0.04228174505633724
                }
            }
        },
        "GPU Compute": {
            "summary": {
                "bottleneck": 0.22080323589594392
            },
            "details": {
                "sm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize160x128x32_stage4_warpsize2x2x1_tensor16x8x16_kernel7_16_1|128_1_1": {
                    "avg_cost": 129.8,
                    "bottleneck": 0.011231585929107188
                },
                "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_32_1|128_1_1": {
                    "avg_cost": 133.4,
                    "bottleneck": 0.010950725763621315
                },
                "ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nt16_24_1|128_1_1": {
                    "avg_cost": 109.5,
                    "bottleneck": 0.007726273763308622
                },
                "void at::native::vectorized_elementwise_kernel<4, at::native::CUDAFunctor_add<c10::Half>, at::detail::Array<char*, 3> >(int, at::native::CUDAFunctor_add<c10::Half>, at::detail::Array<char*, 3>)100864_1_1|128_1_1": {
                    "avg_cost": 176.0,
                    "bottleneck": 0.0012749520281166412
                },
                "void at::native::(anonymous namespace)::multi_tensor_apply_kernel<at::native::(anonymous namespace)::TensorListMetadata<1>, at::native::(anonymous namespace)::UnaryOpFunctor<float, 1, 1, 0>, at::native::_amp_foreach_non_finite_check_and_unscale_cuda_(c10::ArrayRef<at::Tensor>, at::Tensor&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float)#1}>(at::native::(anonymous namespace)::TensorListMetadata<1>, at::native::(anonymous namespace)::UnaryOpFunctor<float, 1, 1, 0>, at::native::_amp_foreach_non_finite_check_and_unscale_cuda_(c10::ArrayRef<at::Tensor>, at::Tensor&, at::Tensor const&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(float)#1})320_1_1|512_1_1": {
                    "avg_cost": 116.4,
                    "bottleneck": 0.006687856766956983
                },
                "void multi_tensor_apply_kernel<TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float>(long, int volatile*, TensorListMetadata<4>, AdamFunctor<float, float, int>, float, float, float, float, float, float, adamMode_t, float)320_1_1|512_1_1": {
                    "avg_cost": 407.1,
                    "bottleneck": 0.022835502981469677
                }
            }
        },
        "cuda_runtime": {
            "details": {
                "cudaLaunchKernel": {
                    "bottleneck": 0.011437093367267582,
                    "proportion": 0.047973495181857964
                },
                "cudaEventQuery": {
                    "bottleneck": 0.0003695104329276106,
                    "proportion": 0.0016428506379998565
                },
                "cudaMemcpyAsync": {
                    "bottleneck": 0.0031047741392663465,
                    "proportion": 0.1264205197968136
                },
                "cudaStreamSynchronize": {
                    "bottleneck": 0.0002381468548093979,
                    "proportion": 0.00025708577558104204
                },
                "cudaStreamIsCapturing": {
                    "bottleneck": 8.180001950305883e-05,
                    "proportion": 0.00041746216849836924
                },
                "cudaEventRecord": {
                    "bottleneck": 0.0002449971027480777,
                    "proportion": 0.0009364691887936391
                },
                "cudaStreamWaitEvent": {
                    "bottleneck": 0.00022323749164874186,
                    "proportion": 0.0010831450858336067
                },
                "cudaStreamGetCaptureInfo_v2": {
                    "bottleneck": 3.7877841543288325e-05,
                    "proportion": 6.326405449251348e-05
                },
                "cudaLaunchKernelExC": {
                    "bottleneck": 0.0010468790673347137,
                    "proportion": 0.002914981975788806
                },
                "cudaPointerGetAttributes": {
                    "bottleneck": 1.1685717071865549e-05,
                    "proportion": 1.6521186205051293e-05
                },
                "cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags": {
                    "bottleneck": 6.165223144811824e-05,
                    "proportion": 0.00035742176009464626
                },
                "cudaOccupancyMaxActiveBlocksPerMultiprocessor": {
                    "bottleneck": 0.00017649462336127966,
                    "proportion": 0.0006084631992592061
                },
                "cudaFuncSetAttribute": {
                    "bottleneck": 0.0001652118620505129,
                    "proportion": 0.0003554069812891522
                },
                "cudaMemsetAsync": {
                    "bottleneck": 0.0006604444924409529,
                    "proportion": 0.0027936922916980637
                },
                "cudaDeviceSynchronize": {
                    "bottleneck": 0.0009050386394279318,
                    "proportion": 0.3238394269646713
                },
                "cudaPeekAtLastError": {
                    "bottleneck": 0.0,
                    "proportion": 0.0
                },
                "cudaDeviceGetAttribute": {
                    "bottleneck": 1.611823044395248e-06,
                    "proportion": 2.01477880549406e-06
                }
            },
            "summary": {
                "bottleneck": 0.018766455705893872
            }
        }
    }
}

本地可视化

  1. 参考分析模式生成并保存分析结果至--output-dir参数配置的目录中。

    命令执行成功后,系统会在该目录中生成<时间戳>/PerfDisplay文件夹。

  2. 在容器中,将PerfDisplay文件夹拷贝到数据源的挂载目录/mnt/data中,并参考命令行工具ossutil 2.0,将PerfDisplay目录下载到本地。

  3. 在终端中,进入PerfDisplay文件夹中并运行sudo python3 app.py(Linux可能不需要sudo),然后在浏览器中打开http://127.0.0.1:5000/,即可在可视化页面中查看任务性能报告,如下图所示image

附录:PerfTracker更多参数使用说明

PerfTracker使用:采集模式

  • 参数说明

    • --wait:指定等几个step开始采集,如果不指定该参数,则默认从下一个step开始采集。

    • --steps:指定要采集几个step的数据,默认值为2,推荐2-3。

    • --output-dir:指定trace存放的默认目录,默认值为/tmp/perftracker/rawtrace

  • 使用示例

    以下是开启PerfTracker采集模式的配置示例,开启采集模式后会输出以下提示信息:

    image.png

    • 使用默认配置,PerfTracker会采集2step,并且在采集结束时将采集trace保存于/tmp/perftracker/output目录下:

      c4d_perftracker --trigger-on
    • 指定--steps参数值为10,Perftracker将采集10step,并将trace保存于/tmp/perftracker/output目录下:

      c4d_perftracker --trigger-on --steps 10
    • 指定--wait参数值为10,PerfTracker将等10step才开始采集,并将trace保存于/tmp/perftracker/output目录下:

      c4d_perftracker --trigger-on --wait 10
    • 指定--output-dir,PerfTracker将采集数据保存到您指定的目录下:

      c4d_perftracker --trigger-on --output-dir /path/to/trace
    • 混合参数配置示例:

      c4d_perftracker --trigger-on --wait 3 --steps 5 --output-dir /path/to/trace
  • 输出示例

    输出文件为JSON格式,每个worker对应一个文件,每个文件通常几百MB2-3GB不等,取决于监控时长和任务复杂度。可在https://ui.perfetto.dev/进行可视化,如下图所示:

    image

PerfTracker使用:分析模式

  • 参数说明

    • --steps:指定要采集多少个step的数据,默认值为2,推荐2-3。

    • --wait:指定等多少个step开始采集,默认值为5。

    • --output-dir:可视化页面和trace文件保存的目录,默认值为/tmp/perftracker/output。若不添加该参数,将在产出分析报告后删除原始trace文件。

    • --save-raw-trace:是否保存worker的原始trace(每个worker对应一个文件,大小为几百MB2-3GB),取值为none(不保存)、master(只保存rank0节点)或all(全部保存)。若未赋值,则默认为none。

  • 使用示例

    • 全部使用默认参数,PerfTracker会在接到指令后,等待5step后,采集2step的监控trace,生成性能报告后删除原始trace:

      c4d_perftracker --trigger-on --auto-analyze
    • 指定了--wait的值后,PerfTracker会等待指定的step再开始采集一个轮次:

      c4d_perftracker --trigger-on --auto-analyze --wait 10
    • 指定--steps参数值为3,PerfTracker将采集3step数据:

      c4d_perftracker --trigger-on --auto-analyze --steps 3
    • 如果命令行参数中有--save-raw-trace,trace将会被自动保存,默认路径为/tmp/perftracker/output

      c4d_perftracker --trigger-on --auto-analyze --save-raw-trace all
    • mastertrace自动保存到您指定的路径/path/to/trace

      c4d_perftracker --trigger-on --auto-analyze --save-raw-trace master --output-dir /path/to/trace
    • 如果命令行参数中--save-raw-traceall,所有workertrace都将被保存:

      c4d_perftracker --trigger-on --auto-analyze --output-dir --save-raw-trace all
    • 混合参数配置示例:

      c4d_perftracker --trigger-on --auto-analyze --wait 5 --steps 10 --output-dir /path/to/trace --save-raw-trace all

      执行该命令后,PerfTracker将等待5step,开始采集接下来的10step数据,所有workertrace会被保存到/path/to/trace目录下,并将trace的分析报告展示在AIMaster容器中。

  • 对性能报告进行交互查询

    为方便获取更详细的信息,PerfTracker提供了一些交互命令,用于在master-0节点上进行操作。已支持的交互命令如下所示。

    • 查看已有trace的批次

      c4d_perftracker --ls

      该命令会显示出当前产出过几次性能分析报告。命令输出结果示例如下:

      image.png

    • 查询区间性能分析报告,该结果会在aimaster节点日志中展示。

      c4d_perftracker --ana-cmd "--analyzers performance_diagnosis --ranks <ranks>" --collection-id <id>

      参数说明如下:

      参数

      是否必选

      描述

      --ranks

      指定区间,可以有多个,默认值为所有rank。

      --collection-id

      指定要分析的trace批次的ID,若不指定,则默认为最新一批trace。

      如下示例表示对默认最新一批采集记录中的rank1、2、4、5做了区间分析:

      c4d_perftracker --ana-cmd "--analyzers performance_diagnosis --ranks [1,2] [4,5]"
    • 逐个展示每个Worker上某个函数的性能特征,并支持选择部分Worker以及按任意关键字进行排序。该结果会在aimaster节点日志中展示。可选的函数和关键字请参见AI Master输出附带的rank0节点示例。

      c4d_perftracker --ana-cmd "--analyzers get_details --ranks <ranks> --target-key <key>" --collection-id <id>

      参数说明如下:

      参数

      是否必填

      描述

      --target-key

      对关键词进行排序时,可以选择大类关键词,例如GPU compute、Python_function,也可以选择细分类中的一个关键词,例如ncclDevKernel_ReduceScatter_Sum_bf16_RING_LLncclDevComm*unsigned longncclWork*)或Memcpy DtoH(Device -> Pinned)等。如果需要对细分类进行排序,关键词可以是函数或事件名的子串。如果该子串匹配多个Python函数名,将按照第一个匹配到的函数进行排序。

      --ranks

      指定区间,可以有多个,默认值为所有rank。

      --collection-id

      指定要分析的trace批次的ID,若不指定,则默认为最新一批trace。

      如下示例表示对第一批采集记录中的rank0、1、5、7、8reduce scatter进行展示,按照默认关键字排序。

      c4d_perftracker --ana-cmd "--analyzers get_details --ranks [0,1] 5 [7,8] --target-key ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL(ncclDevComm*, unsigned long, ncclWork*)" --collection-id 1

      如下示例表示对第一批采集中的rank0、1、2、3、8、9、10、11Python函数megatron/model/transformer.py: forward进行展示,按照默认关键字排序。

      c4d_perftracker --ana-cmd "--analyzers get_details --ranks [0,3] [8,11] --target-key megatron/model/transformer.py: forward" --collection-id 1

PerfTracker使用:快照模式

  • 使用示例

    快照模式没有参数,以下是唯一使用方式:

    c4d_perftracker --trigger-on --auto-analyze --hang
  • 输出示例

    只有一项:Top 10 unusual stacks。将展示最多10个与其他worker差异最大的函数栈,同时附带了具有此类函数栈的worker数量及其具体标识。

    如下示例中表示job_x08j11***.cloud.sqa.na131_2_5364.json是唯一拥有这样函数栈的worker,是最与众不同的一个worker。

    {
        "Top 10 unusual stacks": [
            [
                {
                    "5364": [
                        "    (Python) File \"/workspace/Megatron-LM-core_v0.4.0/pretrain_gpt.py\", line 248, in <module>: pretrain(train_valid_test_datasets_provider,",
                        "    (Python) File \"/workspace/Megatron-LM-core_v0.4.0/megatron/training.py\", line 167, in pretrain: iteration = train(forward_step_func,",
                        "    (Python) File \"/workspace/Megatron-LM-core_v0.4.0/megatron/training.py\", line 772, in train: time.sleep(10000)"
                    ]
                },
                1,
                [
                    "job_x08j11173.cloud.sqa.na131_2_5364.json"
                ]
            ]
        ]
    }