AI Profiling

更新时间:
复制 MD 格式

AI Profiling is an advanced analytics tool that helps you observe, diagnose, and optimize the performance of AI applications across their entire lifecycle. It deeply traces the call paths of AI models during training and inference across the full software stack, including the Python stack, Torch layer, GPU memory, CUDA Runtime, and GPU kernels. It combines this data with fine-grained, operator-level performance metrics (such as FLOPs and time spent on compute, communication, GPU memory, and idle) and resource consumption data to provide an end-to-end solution for developers and operations teams.

Limitations

This feature is currently available on an allowlist basis. To request a trial, please submit a ticket.
  • Region availability

    This feature is currently available only in Chinese mainland and China (Hong Kong).

  • Operating system limitations

    Architecture

    Operating system

    x86

    • Alibaba Cloud Linux 3

    • Alibaba Cloud Linux 2

    • Ubuntu 22.04

    • Ubuntu 24.04

  • Other limitations

    Important

    AI Profiling consumes memory and CPU resources of the target process. The amount of overhead depends on the collection duration and the number of iterations. If the target process has less than 0.5 GB of available memory, data collection is aborted.

    • Machines must be onboarded before you can use AI Profiling.

    • Supported GPUs: A-series, L-series, and T-series.

    • AI jobs are supported in virtual environments or containers, such as ACK or self-managed Kubernetes. When using containers, the job container must not mount the host's /proc directory. Running AI jobs directly in the host's Python environment is not recommended. For support in other scenarios, submit a ticket.

    • Only Python processes that use a GPU can be profiled. Supported Python versions range from 3.9 to 3.12.

    • Supported Torch versions include 2.1.0 or later.

    • Supported CUDA versions range from 12.0 to 12.8, excluding version 12.7.

    • If your AI job uses NVIDIA Nsight Systems (nsys) by default, GPU kernel data cannot be collected.

    • The Python interpreter for the target process must have pip installed.

    Verified environments:

    • Alibaba Cloud Linux 2 + (conda) Python 3.11.13 + pip 25.1.1 + NVIDIA A10 + CUDA Version: 12.4 + 4.19.91-28.2.al7.x86_64

    • Alibaba Cloud Linux 3 + (conda) Python 3.10.18 + pip 25.1 + Tesla T4 + CUDA Version: 12.4 + 5.10.134-19.1.al8.x86_64

    • Alibaba Cloud Linux 3 + (conda) Python 3.9.23 + pip 21.2.4 + Tesla T4 + CUDA Version: 12.4 + 5.10.134-19.1.al8.x86_64

    • Ubuntu 24.04 + (conda) Python 3.12.1 + pip 22.0.4 + NVIDIA A10 + CUDA Version: 12.8 + 6.8.0-63-generic

    • Ubuntu 22.04 + (conda) Python 3.12.11 + pip 22.0.4 + Tesla P4 + CUDA Version: 12.8 + 5.15.0-142-generic

    • Ubuntu 22.04.3 LTS + Python 3.10.12 + NVIDIA A10 + CUDA Version: 12.2 + 5.10.134-18.al8.x86_64 (ACK container)

  • Data richness

    Metric

    Description

    Supported environments

    Example

    Default metric collection

    Enabled by default:

    • GPU kernel

    • Torch

    • Python stack

    • The union of the supported environments for GPU kernel, Torch, and Python stack.

    image.png

    GPU kernel

    Information about GPU operators.

    • Supported hardware: NVIDIA, AMD

    • CUDA versions: 12.0 to 12.8

    image.png

    Torch

    Information from the Torch layer.

    • Supported hardware: NVIDIA, AMD

    • Torch version: 2.1.0 or later

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    Python stack

    Information from the Python stack.

    • Supported hardware: NVIDIA, AMD

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    Profile memory

    Information about Torch GPU memory usage.

    • Note: TensorRT jobs are not supported. If this option is enabled for a TensorRT job, Torch metric collection fails.

    image.png

    GPU memory snapshot

    Collects data on GPU memory allocation, fragmentation, and allocation stack traces.

    • Supported hardware: NVIDIA

    • Torch version: 2.1.0 or later

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    RDMA monitor

    RDMA monitoring information. The following metrics are collected by default:

    • vport_rx_write_requests

    • vport_rx_read_requests

    • An RDMA network interface card (NIC) must be available.

    image.png

    DCGM monitor

    DCGM monitoring information. The following metrics are collected by default:

    • DCGM_FI_DEV_FB_FREE

    • DCGM_FI_DEV_FB_USED

    • DCGM_FI_DEV_GPU_UTIL

    • Supported hardware: NVIDIA

    • Tesla driver

    image.png

    NVTX

    Custom NVTX markers.

    • Supported hardware: NVIDIA, AMD

    • Torch version: 2.1.0 or later

    • Your code contains NVTX markers.

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    FLOPs

    FLOPs information from Torch.

    • Supported hardware: NVIDIA, AMD

    • Torch version: 2.1.0 or later

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    Record shapes

    Record Shapes information from Torch.

    • Supported hardware: NVIDIA, AMD

    • Torch version: 2.1.0 or later

    • Python versions: 3.9 to 3.12

    • pip is installed in the Python interpreter.

    image.png

    TCP network metrics

    Information about sending and receiving TCP network packets.

    • Kernel version 5.10 or later

    image.png

Benefits

  • Zero-instrumentation: Uses a non-intrusive profiling technique that requires no changes to your containers.

  • Rich data collection: Collect a wide range of metrics on demand, including Python call stacks, CPU information, GPU operators, Torch data, GPU memory, and FLOPs. You can also collect RDMA, GPU, and CPU monitoring metrics.

  • Centralized analysis: Raw data is automatically sent to a centralized service for statistical analysis, providing a multi-dimensional view of your job's performance.

  • Ease of use: The entire profiling process is automated and completes in minutes. To trigger AI Profiling, simply configure the instance ID on the console. After collection, the system automatically uploads and analyzes the data, and then displays the report on the console. The built-in timeline view eliminates the need to export data to external tools like Chrome Tracing or Perfetto.

  • Stability: The feature has been deployed and verified in clusters with thousands of GPUs and does not affect the stability of AI jobs.

  • Flexible collection modes: Supports data collection by duration or by iteration. Advanced options also allow you to define custom iteration entry points and skip a specified number of initial iterations.

  • Low overhead: You can configure the richness of the collected data to manage performance impact. The overhead during collection ranges from 5% to 40%.

  • Multi-process support: Supports simultaneous profiling of multiple processes.

Use cases

This topic describes common scenarios where you can use AI Profiling to diagnose issues and take action based on the recommendations.

  • Troubleshoot failures that occur after you deploy an AI application.

  • Investigate why an AI job is running slower than expected.

  • Identify server bottlenecks, such as a time-consuming operator.

Prerequisites

  • If you are a RAM user, ensure that the primary Alibaba Cloud account grants the AliyunECSReadOnlyAccess and AliyunSysomFullAccess system policies to your RAM user.

  • Enable the console service.

    When you log on to the Operating System Console for the first time, click Enable Service.

  • The SysOM component is installed. For installation instructions, see Component Management.

Procedure

  1. Go to the Operating System Console > AI Profiling page.

  2. Select or enter the required parameters and click Start Analysis.

    Parameters

    • Instance ID: Choose the ID of an onboarded instance under your account. The instance must be equipped with a GPU and be running an AI job.

    • You can specify an AI job by entering the AI job PID or AI job process name. You can enter multiple AI job PIDs or AI job process names. Use a comma (,) to separate multiple AI job PIDs or AI job process names. If you enter both an AI job PID and an AI job process name, the system processes the union of both during analysis.

      Note

      You can run the top command to find the process ID (in the PID column) and process name (in the COMMAND column) of your AI job.

      You can also run the nvidia-smi command to view this information.
    • Data richness: Select the types of data to collect based on your needs. Options include GPU operators, Python call stacks, CPU information, Torch GPU memory, and FLOPs. You can select multiple options.

    • Analysis mode:

      • duration mode

        Collects data for the time interval specified in Collection duration.

      • iteration mode

        • Iteration range: By default, iterations 0 to 10 are collected. You can configure the profiler to skip a specified number of initial iterations. This iteration count is specific to the data collection module and is independent of the iteration count within the AI job.

        • Iteration entry module: For example, transformers.trainer.

        • Iteration entry function:

          • For vLLM inference scenarios, the default is LLMEngine.step.

          • For training scenarios, the default is Optimizer.step.

    • Collection duration: The duration for data collection in duration mode. The default is 2,000 ms, and the supported range is 1,000 ms to 5,000 ms.

  3. In the Analysis Records section, click View Report for the desired record.

Interpreting the results

  • Analysis suggestions

    This section provides recommendations based on the analysis of your AI job.

    These suggestions typically include an assessment of overall GPU utilization, identification of the most time-consuming functions or modules, and recommendations for further analysis.

  • CPU/GPU summary

    Displays device information, function call time statistics (in microseconds), and GPU utilization, as shown in the following figure.

    image

  • GPU kernel analysis

    This section displays charts for GPU kernel call times and Tensor Core usage (us), along with detailed statistics for each GPU kernel, as shown in the following figure.

    image

  • Iteration statistics and differential analysis

    • AI iteration statistics

      This feature uses iteration markers to anchor the training or inference process and perform separate statistical analysis for each iteration. It calculates the loss value and the time spent on computation, storage, and communication for each iteration. The data is summarized in a bar chart to help you visually identify iterations with abnormal gradients or unusually high communication times. You can drag the slider to view data for specific iterations, as shown in the following figure.

      image

      • For training jobs, the system uses Optimizer.step as the default iteration delimiter.

      • For inference jobs, the system uses the LLMEngine.step function as the default delimiter.

    • AI differential analysis

      The timeline data collected by AI Profiling can be complex and large (gigabyte-scale), making manual analysis difficult. In performance comparison or anomaly analysis scenarios, traditional methods struggle to quickly identify differences between two datasets. AI differential analysis compares the kernel function duration and call count differences between an abnormal and a normal iteration. This allows you to precisely locate the root-cause functions responsible for performance differences across iterations, enabling rapid bottleneck identification and optimization.

      Select a Baseline step and a Comparison step, and then click Submit. For example, you can select a normal iteration as the baseline and an abnormal iteration for comparison, as shown in the following figure.

      image

      Parameter

      Description

      Name

      The root-cause function.

      Baseline step

      The duration, call count, and their respective percentages for the selected iteration.

      Comparison step

      Duration difference

      The difference in duration and percentage between the Baseline Step and the Comparison Step.

      Call count difference

      The difference in call count and percentage between the Baseline Step and the Comparison Step.

  • CPU/GPU tracing analysis

    The built-in timeline view eliminates the need for external tools like Chrome Tracing or Perfetto. You can analyze configurable data types, including Python call stacks, CPU information, GPU operators, Torch data, GPU memory, and FLOPs.

    image.png

    image.png