Enable AIMaster-based fault tolerance monitoring for PAI distributed training

更新时间:
复制 MD 格式

A PAI Deep Learning Containers (DLC) Job is considered compliant if AIMaster-based fault tolerance monitoring is enabled. This rule does not apply if no training Jobs exist.

Risk level

The default risk level is High.

You can change the risk level as needed.

Detection logic

  • A PAI Deep Learning Containers (DLC) Job is considered compliant if AIMaster-based fault tolerance monitoring is enabled.

  • If no training Jobs exist, this rule does not apply.

Rule details

Parameter

Description

Rule name

Enable AIMaster-based fault tolerance monitoring for PAI distributed training

Rule identifier

pai-dlc-error-monitoring-ai-master-enabled

Tag

[PAIWorkspace]

Automatic remediation

Not supported

Rule trigger

Periodic, every 24 hours

Supported resource types

[ACS::PAIWorkspace::Workspace]

Input parameters

None

Remediation guide

For more information about remediation, see AIMaster: Elastic Automatic Fault Tolerance Engine.