SanityCheck: Compute health checks

更新时间:
复制 MD 格式

This topic describes how to use the SanityCheck feature in DLC.

Overview

In AI training scenarios, you may encounter the following issues:

  • Resource failures that interrupt jobs and waste GPU resources: A job might spend significant time on initialization, such as loading a model checkpoint, only to fail due to faulty resources. Investigating the issue and resubmitting the job wastes GPU resources.

  • Insufficient tools for performance diagnosis and testing: If model training performance degrades during a job, a slow node might be the cause, but it can be difficult to identify quickly. Additionally, users often lack convenient and reliable benchmarks for testing the GPU compute and communication performance of machines in a resource group.

To address these issues, DLC provides the SanityCheck feature to check the health and performance of compute resources for distributed training jobs. You can enable this feature when you create a DLC training job. SanityCheck performs a comprehensive check on the training resources, automatically isolates faulty nodes, and triggers automated backend maintenance workflows. This process reduces the likelihood of issues during the initial phase of training and improves the job success rate. After the checks are complete, SanityCheck generates a report on GPU compute and communication performance. The report helps you identify factors that may degrade training performance, improving overall diagnostic efficiency.

Limitations

Currently, this feature is available only for PyTorch training jobs that use Lingjun intelligent computing resources. These jobs must use all GPUs on each allocated machine. Lingjun intelligent computing resources are available only to allowlisted users. To request access, contact your account manager.

Enable health checks

Use the console

When you create a DLC training job in the PAI console, you can enable health checks by configuring the following key parameters. After you create the job, the system checks the resource health and availability before providing the results.

The key parameters are described as follows:

  • In the Resource Information section:

    Parameter

    Description

    Resource Type

    Select Lingjun Intelligence Resources.

    Source

    Select Resource Quota.

    Resource Quota

    Select an existing resource quota for Lingjun intelligent computing resources. For information about how to create a resource quota, see Create a resource quota.

    Framework

    Select PyTorch.

    Job Resource

    The job must be configured to use all GPUs on each machine.

  • In the Fault Tolerance and Diagnosis section, turn on the Health Check switch and configure the following parameters:

    Parameter

    Description

    Check Time

    • Before Job Runs (default): Runs a pre-check on compute nodes after the system allocates resources to the job and before it executes your code.

    • After Job Restarts: Runs a health check after AIMaster automatic fault tolerance restarts a job due to an exception.

      Note

      To use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine.

    • Before Job Runs + After Job Restarts: Runs a health check both before the job runs and after the job restarts.

      Note

      To use this option, you must turn on the Automatic Fault Tolerance switch. For more information, see AIMaster: An elastic and automatic fault tolerance engine.

    Check Items

    The check items are grouped into four categories: compute performance check, node communication check, compute and communication overlap check, and model simulation verification. For more information about the check items and recommended scenarios, see Appendix: Check items.

    • By default, GPU GEMM (for checking GPU GEMM performance) and All-Reduce (for checking node communication performance and identifying slow or faulty nodes) are enabled.

    • You can search for or select check items from a list. You can also use a quick configuration template to select a predefined set of check items.

    Maximum Check Duration

    The maximum time allowed for the health check. The default is 60 minutes. If the check times out, the system triggers the configured policy for handling check exceptions.

    Exception Handling Policy

    If a health check fails, the system handles the job according to the selected policy:

    • End job: If a faulty or suspicious node is identified, the job is terminated and marked as Check Failed.

    • Add to blocklist and rerun: If a faulty or suspicious node is identified, the system automatically adds the node to a blocklist, restarts the job, and reruns the checks until all checks pass.

    Maximum Restart Count

    When the processing policy is set to 'add to blocklist and rerun', you can configure the maximum number of restarts. The default value is 10. If the maximum number of restarts is exceeded, the task automatically fails.

    Other Configurations

    This parameter is empty by default. You can use it to configure advanced parameters.

Use the API

When you call the CreateJob API operation, add the following two parameters to the Settings parameter to enable health checks.

Parameter

Description

Example

EnableSanityCheck

Specifies whether to enable SanityCheck for the job. Valid values:

  • true: Enables SanityCheck.

  • false: Disables SanityCheck.

"EnableSanityCheck" : "true"

SanityCheckArgs

The execution arguments for SanityCheck. The following options are available:

  • --sanity-check-timing: The check timing. Valid values:

    • BeforeJobRunning (default): Runs the check before the initial job execution and after any subsequent restart.

    • AfterJobFaultTolerant: Runs the check after the job's fault tolerance mechanism triggers.

    • BeforeAndAfterRestart: Runs the check both before the job runs and after the job restarts.

  • --sanity-check-max-time: The maximum check duration in minutes.

  • --sanity-check-timeout-ops: The action to take if a check times out. Valid values:

    • MarkJobFail (default): Ends the job with a Failed status.

    • KeepJobHang: Suspends the job.

  • --sanity-check-envs: Sets environment variables for the check. This parameter is empty by default. Use the format key=value. Separate multiple variables with a comma (,). For example: key1=value1,key2=value2.

  • --sanity-check-timeout-per-test: The timeout period for each individual test, in minutes. The default is 6 minutes.

  • --micro-benchmarks: Specifies the micro-benchmarks to run. Separate multiple items with a comma (,).

  • --model-benchmarks: Specifies the model benchmarks to run. Separate multiple items with a comma (,).

  • --max-num-of-sanity-check-restart: The maximum number of reruns after adding nodes to the blocklist. For example, --max-num-of-sanity-check-restart=10 enables the rerun mode. If SanityCheck identifies a faulty node, the system automatically adds the node to a blocklist and reruns the checks until all checks pass. The maximum number of reruns is 10.

"SanityCheckArgs" : "--sanity-check-timing=AfterJobFaultTolerant --sanity-check-timeout-ops=MarkJobFail"

View check results

Health check status

The health check process for a DLC job includes the following statuses:

  • Checking: The health check is in progress.

  • Check Failed: The check fails if a faulty node is detected or if the check times out.

  • Check Passed: After all health checks pass, the job's status changes to Running.

View health check results

Use the console

On the details page of a DLC job, go to the Event tab and click Health Check to view the check progress and results.

The health check includes items such as Preparing Check Environment, GPU GEMM, GPU Kernel Launch, All-Reduce-Single-Node, MatMul/All-Reduce Overlap, and Mini GPT-Single-Node. A green check mark indicates that an item has passed.

Click the Restart History tab to view the number of restarts, the restart reason, and the result.

Use the API

Configure message notifications

You can create a notification rule in the event notification settings of your PAI workspace. For Event Type, select DLC Job > Automatic Fault Tolerance. For information about how to configure other parameters, see Message Notification. The system sends a notification if a health check fails.

Note

For instructions on how to create message notification rules in a workspace, see Event Notification Settings.

Appendix: Check items

Note

The estimated check durations are based on a two-machine setup and are for reference only. Actual times may vary.

Check item

Description (scenarios)

Estimated duration

Compute performance check

GPU GEMM

Tests GPU GEMM performance. This check can identify:

  • Faulty GPUs: compute errors or hangs.

  • Slow nodes: low compute TFLOPS.

1 minute

GPU Kernel Launch

Tests the latency of GPU kernel launches. This check can identify:

  • Faulty nodes: kernel launch errors or hangs.

  • Slow nodes: high kernel launch latency.

1 minute

Node communication check

All-Reduce

Tests node communication performance and identifies slow or faulty communication nodes. For various collective communication patterns, it identifies:

  • Faulty communication nodes: communication errors or hangs.

  • Slow communication nodes: low communication bandwidth.

Single collective communication check

5 minutes

All-to-All

All-Gather

Multi-All-Reduce

Network Connectivity

Tests network connectivity for the head or tail nodes to identify nodes with connectivity issues.

2 minutes

Compute and communication overlap check

MatMul/All-Reduce Overlap

Tests single-node performance when communication and compute kernels overlap. This check can identify:

  • Faulty nodes: overlap computation errors or hangs.

  • Slow nodes: high latency for overlapped computations.

1 minute

Model simulation verification

Mini GPT

Verifies the reliability of the AI system by using model simulation. It identifies:

  • Faulty nodes: training loss anomalies, training hangs, or training errors.

  • Slow nodes: high latency per training step.

1 minute

Megatron GPT

5 minutes

ResNet

2 minutes