These notifications alert you when an underlying machine node for your Lingjun resources fails. If you receive a notification, promptly terminate the jobs on the affected node to allow the node self-healing process to complete.
Background
When the system detects an abnormal node, it uses node self-healing to automatically fail over to a standby machine. This ensures high availability for your resources. You can enable notifications for the following two scenarios:
-
Node scheduling disabled: The system detected an unhealthy node. The node is temporarily disabled for scheduling.
-
Node self-healing blocked: Jobs running on an abnormal node prevent the node self-healing process. You must take the following actions:
-
DSW instance: Manually save your environment and shut down the instance, or configure an automatic restart policy in the workspace's scheduling configuration.
-
DLC job: Manually stop the job.
-
Limitations
This feature is available only for Lingjun resources.
Subscribe to notifications
When the system detects that scheduling is disabled to a node or that your job is running on an abnormal node, you can receive notifications by internal message, email, text message, or bot. We recommend enabling these notifications to ensure you receive them promptly.
Text message, email, and internal message notifications
-
Log on to the PAI console.
-
In the upper-right corner, click the
icon to open the Message Center. -
In the left navigation pane, choose Message Settings > Common Settings.
-
In the Message Type column, find Product O&M Notification. Select the checkboxes for Internal Message, Email, or Text Message, and ensure a recipient is added. You can also click Modify in the Actions column to configure more contacts.
After you complete the configuration, if the system detects an abnormal node, it sends a notification that includes the name of the abnormal node, the resource quota, and information about the jobs running on the node.
Bot notifications
The bot receiving platform provided by Alibaba Cloud Message Center currently supports DingTalk, WeCom, Lark, and Slack. For more information, see Alibaba Cloud Message Center.
-
Log on to the PAI console.
-
In the upper-right corner, click the
icon to open the Message Center. -
In the left navigation pane, choose Message Settings > Bot Subscription Management.
-
Add a bot. If you have already added a bot, skip this step.
-
In the upper-right corner of the Bot Subscription Management page, click Bot Management.
-
On the Bot Management page, use the links for each platform at the top of the page to obtain the bot webhook information. Then, click Add Bot to complete the process.
-
Click Test in the Actions column to test the bot's connectivity. A "test is successful" message confirms that the connection is working.
-
-
On the Bot Subscription Management page, in the row where the Message Type is Product O&M Notification, click Modify in the Actions column.
-
Go to the Message Recipient Bot tab, select the bot that you added, and then click Save.
Customize message receiving rules
This feature is available only to allowlisted users. To use this feature, contact your account manager.
Configure different notification bots and receiving rules to meet your business needs. For example, you may have one group that tracks only PAI-related messages and another group that tracks all product O&M notifications from Alibaba Cloud. To do this, configure two different bots:
-
Bot test1: Receives all messages. No receiving rules are needed.
-
Bot test2: Receives only messages that contain the PAI keyword. Receiving rules are required. Configure the bot as follows:
-
In the navigation pane on the left of the Message Center page, choose Message Settings > Bot Management.
-
In the Actions column for the Product O&M Notification message type, click Modify.
-
Go to the Message Bots tab. In the Receiving Rules column for the target recipient, click Edit. On the Configure Custom Receiving Rules page, set the allowlist keywords, such as PAI. You will then receive only PAI-related notifications.
To configure multiple recipients at once, select them and click Batch Edit Rules.
-
When you are finished, click OK. Then, on the Modify Message Receiving Configuration page, click Save. After the configuration is saved, the Receiving Rules column shows Allowlist Enabled.
Procedure
After you receive a notification that node self-healing is blocked, follow these steps to clear DSW instances and DLC jobs from the abnormal node. This ensures that the node replacement proceeds normally.
Migrate DSW instances
Method 1: Manual migration
If a DSW instance on an abnormal node is open in your browser, a pop-up window appears. It prompts you to save the environment and shut down the instance as soon as possible to allow for node self-healing.
Method 2: Automatic migration
Automatic migration is currently supported in the China (Ulanqab) and Singapore regions.
-
Log on to the PAI console.
-
In the navigation pane on the left, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
-
On the Workspace Details page, in the Workspace Configuration section on the right, select Scheduling Settings.
-
In the DSW section, turn on the Enable Automatic Instance Migration from Abnormal Node switch.
After you enable this feature, the system automatically shuts down and restarts an instance when its underlying node becomes abnormal. This supports the node self-healing process and ensures the integrity and availability of your resources. During the restart, your environment image is saved, but running processes cannot be recovered.
If a DSW instance on an abnormal node is open in your browser, a pop-up window prompts you to save the environment and shut down the instance. The window also shows the time remaining before the instance automatically restarts to allow for node self-healing.
Stop DLC jobs
-
Click the details link in the internal message, email, or text message to go to the resource quota page.
-
On the resource quota page, on the Nodes tab, find the target node and click Details in the Job Count column.
-
Click the name of a DLC job to go to its details page. Then, in the upper-right corner, click More > Stop to stop the DLC job.
-
Click Clone. The job is recreated with the same configuration and scheduled on a healthy node. For more information, see Clone a training job.