Use AI Profiling in the console

更新时间:
复制 MD 格式

The rise of large language models (LLMs) has increased demand for fine-grained performance monitoring and tuning of AI training and inference workloads. Many GPU-based workloads require online performance analysis of GPU containers. AI Profiling is a non-intrusive performance analysis tool built on eBPF and dynamic process injection. It is designed natively for Kubernetes container environments and supports online monitoring of container processes running GPU workloads. It collects data across multiple dimensions and enables you to start or stop data collection dynamically on live GPU workloads. For production services, a profiling tool that can attach and detach dynamically enables real-time, detailed analysis without modifying application code. This topic explains how to trigger AI Profiling from the console and view the results.

Procedure

Important

AI Profiling in the console uses a whitelist mechanism. Existing user UIDs are automatically added to the whitelist. New users must submit a ticket to join the whitelist before using AI Profiling.

Step 1: Start diagnosis

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Inspections and Diagnostics > Diagnostics.

  3. On the Fault and Performance Diagnosis page, click AI Profiling to open the AI Profiling page.

  4. Click Diagnose to open the AI Profiling Configuration page. Complete the configuration as prompted. The available check items are listed in the table below.

    Collection item

    Description

    Supported environment

    Example chart

    gpu

    GPU kernel information

    • CUDA version: 12.0 to 12.8

    torch

    Inject Torch Profiler to collect Torch-related information

    • Torch version: >= 2.1.0

    • Python version: 3.9 to 3.12

    pystack

    Python method call information

    • Python version: 3.9 to 3.12

    image.png

    snapshot

    GPU memory allocation, fragmentation, and allocation stacks

    • Torch version: >= 2.1.0

    • Python version: 3.9 to 3.12

    rdma

    RDMA monitoring information

    • An RDMA network interface controller (NIC) must be present

    image.png

    dcgm

    DCGM monitoring information

    nvtx

    Custom NVTX markers

    • Torch version: >= 2.1.0

    • NVTX markers are added in your code

    • Python version: 3.9 to 3.12

    N/A

    flops

    FLOPs information in Torch

    • Torch version: >= 2.1.0

    • Python version: 3.9 to 3.12

    N/A

    shapes

    FLOPs information in Torch

    • Torch version: >= 2.1.0

    • Python version: 3.9 to 3.12

    N/A

    net

    TCP send and receive network packet information

    • Kernel version >= 5.10

    N/A

  5. After you configure the parameters for Profiling configuration according to the table below, click Start Diagnosis.

    Configuration item

    Description

    Example value

    Profiling Configuration

    Check items

    Select one or more check items. See the descriptions in step 4 above.

    gpu, torch, pystack, snapshot, rdma, dcgm, nvtx, flops, shapes, net

    Duration

    Keep the duration under 30 seconds. AI Profiling collects data in real time. Long durations may increase memory and disk usage.

    5

Step 2: View results

  1. Wait until the Status column shows Success. Then click Diagnostic Details to view the information.

    image

  2. AI Profiling uses SysOM by default to analyze collected data and display charts. Click Diagnostic Results, then click View Charts to open the SysOM console and view the analysis results.image

    You will see the following views:

    • Multi-process aggregation and analysis

    • Data analysis overview

      image

    • Detailed analysis of GPU CUDA kernel functionsimage

    • CPU/GPU tracing analysis

Step 3: Analyze results

For help analyzing results, see AI Profiling performance analysis examples.