Use AI Profiling in the console-Container Service for Kubernetes(ACK)-阿里云帮助中心

The rise of large language models (LLMs) has increased demand for fine-grained performance monitoring and tuning of AI training and inference workloads. Many GPU-based workloads require online performance analysis of GPU containers. AI Profiling is a non-intrusive performance analysis tool built on eBPF and dynamic process injection. It is designed natively for Kubernetes container environments and supports online monitoring of container processes running GPU workloads. It collects data across multiple dimensions and enables you to start or stop data collection dynamically on live GPU workloads. For production services, a profiling tool that can attach and detach dynamically enables real-time, detailed analysis without modifying application code. This topic explains how to trigger AI Profiling from the console and view the results.

Procedure

Important

AI Profiling in the console uses a whitelist mechanism. Existing user UIDs are automatically added to the whitelist. New users must submit a ticket to join the whitelist before using AI Profiling.

Step 1: Start diagnosis

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Inspections and Diagnostics > Diagnostics.
On the Fault and Performance Diagnosis page, click AI Profiling to open the AI Profiling page.

Click Diagnose to open the AI Profiling Configuration page. Complete the configuration as prompted. The available check items are listed in the table below.

Collection item	Description	Supported environment	Example chart
gpu	GPU kernel information	CUDA version: 12.0 to 12.8
torch	Inject Torch Profiler to collect Torch-related information	Torch version: >= 2.1.0 Python version: 3.9 to 3.12
pystack	Python method call information	Python version: 3.9 to 3.12
snapshot	GPU memory allocation, fragmentation, and allocation stacks	Torch version: >= 2.1.0 Python version: 3.9 to 3.12
rdma	RDMA monitoring information	An RDMA network interface controller (NIC) must be present
dcgm	DCGM monitoring information
nvtx	Custom NVTX markers	Torch version: >= 2.1.0 NVTX markers are added in your code Python version: 3.9 to 3.12	N/A
flops	FLOPs information in Torch	Torch version: >= 2.1.0 Python version: 3.9 to 3.12	N/A
shapes	FLOPs information in Torch	Torch version: >= 2.1.0 Python version: 3.9 to 3.12	N/A
net	TCP send and receive network packet information	Kernel version >= 5.10	N/A

After you configure the parameters for Profiling configuration according to the table below, click Start Diagnosis.

Configuration item		Description	Example value
Profiling Configuration	Check items	Select one or more check items. See the descriptions in step 4 above.	gpu, torch, pystack, snapshot, rdma, dcgm, nvtx, flops, shapes, net
Profiling Configuration	Duration	Keep the duration under 30 seconds. AI Profiling collects data in real time. Long durations may increase memory and disk usage.	5

Step 2: View results

Wait until the Status column shows Success. Then click Diagnostic Details to view the information.
AI Profiling uses SysOM by default to analyze collected data and display charts. Click Diagnostic Results, then click View Charts to open the SysOM console and view the analysis results.

You will see the following views:
- Multi-process aggregation and analysis
- Data analysis overview
- Detailed analysis of GPU CUDA kernel functions
- CPU/GPU tracing analysis

Step 3: Analyze results

For help analyzing results, see AI Profiling performance analysis examples.