This topic describes how to use the SanityCheck feature in DLC.
Overview
In AI training scenarios, you may encounter the following issues:
Resource failures that interrupt jobs and waste GPU resources: A job might spend significant time on initialization, such as loading a model checkpoint, only to fail due to faulty resources. Investigating the issue and resubmitting the job wastes GPU resources.
Insufficient tools for performance diagnosis and testing: If model training performance degrades during a job, a slow node might be the cause, but it can be difficult to identify quickly. Additionally, users often lack convenient and reliable benchmarks for testing the GPU compute and communication performance of machines in a resource group.
To address these issues, DLC provides the SanityCheck feature to check the health and performance of compute resources for distributed training jobs. You can enable this feature when you create a DLC training job. SanityCheck performs a comprehensive check on the training resources, automatically isolates faulty nodes, and triggers automated backend maintenance workflows. This process reduces the likelihood of issues during the initial phase of training and improves the job success rate. After the checks are complete, SanityCheck generates a report on GPU compute and communication performance. The report helps you identify factors that may degrade training performance, improving overall diagnostic efficiency.
Limitations
Currently, this feature is available only for PyTorch training jobs that use Lingjun intelligent computing resources. These jobs must use all GPUs on each allocated machine. Lingjun intelligent computing resources are available only to allowlisted users. To request access, contact your account manager.
Enable health checks
Use the console
When you create a DLC training job in the PAI console, you can enable health checks by configuring the following key parameters. After you create the job, the system checks the resource health and availability before providing the results.
The key parameters are described as follows:
In the Resource Information section:
Parameter
Description
Resource Type
Select Lingjun Intelligence Resources.
Source
Select Resource Quota.
Resource Quota
Select an existing resource quota for Lingjun intelligent computing resources. For information about how to create a resource quota, see Create a resource quota.
Framework
Select PyTorch.
Job Resource
The job must be configured to use all GPUs on each machine.
In the Fault Tolerance and Diagnosis section, turn on the Health Check switch and configure the following parameters:
Parameter
Description
Check Time
Before Job Runs (default): Runs a pre-check on compute nodes after the system allocates resources to the job and before it executes your code.
After Job Restarts: Runs a health check after AIMaster automatic fault tolerance restarts a job due to an exception.
NoteTo use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine.
Before Job Runs + After Job Restarts: Runs a health check both before the job runs and after the job restarts.
NoteTo use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine.
Check Items
The check items are grouped into four categories: compute performance check, node communication check, compute and communication overlap check, and model simulation verification. For more information about the check items and recommended scenarios, see Appendix: Check items.
By default, GPU GEMM (for checking GPU GEMM performance) and All-Reduce (for checking node communication performance and identifying slow or faulty nodes) are enabled.
You can search for or select check items from a list. You can also use a quick configuration template to select a predefined set of check items.
Maximum Check Duration
The maximum time allowed for the health check. The default is 60 minutes. If the check times out, the system triggers the configured policy for handling check exceptions.
Exception Handling Policy
If a health check fails, the system handles the job according to the selected policy:
End job: If a faulty or suspicious node is identified, the job is terminated and marked as Check Failed.
Add to blocklist and rerun: If a faulty or suspicious node is identified, the system automatically adds the node to a blocklist, restarts the job, and reruns the checks until all checks pass.
Maximum Restart Count
When the processing policy is set to 'add to blocklist and rerun', you can configure the maximum number of restarts. The default value is 10. If the maximum number of restarts is exceeded, the task automatically fails.
Other Configurations
This parameter is empty by default. You can use it to configure advanced parameters.
Use the API
When you call the CreateJob API operation, add the following two parameters to the Settings parameter to enable health checks.
Parameter | Description | Example |
EnableSanityCheck | Specifies whether to enable SanityCheck for the job. Valid values:
|
|
SanityCheckArgs | The execution arguments for SanityCheck. The following options are available:
|
|
View check results
Health check status
The health check process for a DLC job includes the following statuses:
Checking: The health check is in progress.
Check Failed: The check fails if a faulty node is detected or if the check times out.
Check Passed: After all health checks pass, the job's status changes to Running.
View health check results
Use the console
On the details page of a DLC job, go to the Event tab and click Health Check to view the check progress and results.
The health check includes items such as Preparing Check Environment, GPU GEMM, GPU Kernel Launch, All-Reduce-Single-Node, MatMul/All-Reduce Overlap, and Mini GPT-Single-Node. A green check mark indicates that an item has passed.
Click the Restart History tab to view the number of restarts, the restart reason, and the result.
Use the API
GetJobSanityCheckResult: Retrieves the results of a specific SanityCheck run for a DLC job.
ListJobSanityCheckResults: Retrieves the results of all SanityCheck runs for a DLC job.
Configure message notifications
You can create a notification rule in the event notification settings of your PAI workspace. For Event Type, select DLC Job > Automatic Fault Tolerance. For information about how to configure other parameters, see Message Notification. The system sends a notification if a health check fails.
For instructions on how to create message notification rules in a workspace, see Event Notification Settings.
Appendix: Check items
The estimated check durations are based on a two-machine setup and are for reference only. Actual times may vary.
Check item | Description (scenarios) | Estimated duration | |
Compute performance check | GPU GEMM | Tests GPU GEMM performance. This check can identify:
| 1 minute |
GPU Kernel Launch | Tests the latency of GPU kernel launches. This check can identify:
| 1 minute | |
Node communication check | All-Reduce | Tests node communication performance and identifies slow or faulty communication nodes. For various collective communication patterns, it identifies:
| Single collective communication check 5 minutes |
All-to-All | |||
All-Gather | |||
Multi-All-Reduce | |||
Network Connectivity | Tests network connectivity for the head or tail nodes to identify nodes with connectivity issues. | 2 minutes | |
Compute and communication overlap check | MatMul/All-Reduce Overlap | Tests single-node performance when communication and compute kernels overlap. This check can identify:
| 1 minute |
Model simulation verification | Mini GPT | Verifies the reliability of the AI system by using model simulation. It identifies:
| 1 minute |
Megatron GPT | 5 minutes | ||
ResNet | 2 minutes | ||