Monitor and configure alerts for job timeouts-MaxCompute(MaxCompute)-阿里云帮助中心

MaxCompute allows you to monitor job runtimes by configuring threshold-based alert rules. When a job's runtime exceeds the specified threshold, the system notifies the designated alert contacts, helping you quickly identify and address abnormal jobs and improve operational efficiency.

Monitoring metrics

The following metrics are used to monitor job runtimes.

Job runtime
- Monitors all jobs within a MaxCompute project. If a job's runtime, including wait time, exceeds the specified threshold, the system sends an alert to the designated alert contacts.
- Use case: Ideal for projects where analysts run short data retrieval jobs. This monitor helps you quickly detect issues such as resource contention or high computational load when a job takes too long.
Job runtime_SQL type
- Monitors all SQL jobs within a MaxCompute project. If an SQL job's runtime, including wait time, exceeds the specified threshold, the system sends an alert to the designated alert contacts.
- Use case: Essential for production projects. This monitor helps you quickly address SQL job timeouts and prevent delays in business workflows.

Applicability

Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Ulanqab), China (Chengdu), China (Hong Kong), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), Japan (Tokyo), Germany (Frankfurt), Indonesia (Jakarta), UK (London), and Singapore.
Permissions: To configure alerts as a RAM user, grant the AliyunCloudMonitorFullAccess policy in the RAM console in addition to the standard CloudMonitor permissions. For more information, see Manage permissions for a RAM user.

Set up monitoring and alerting

Activate the Alibaba Cloud CloudMonitor service.
登录云监控控制台。
Create an alert contact.
1. 在左侧导航栏选择Alerts > Alert Contacts。
2. 在Alert Contacts页面，选择Alert Contacts页签。
3. 单击Create Alert Contact，在弹出的Set Alert Contact窗口，填写相关信息。
For more information, see Create an alert contact or an alert contact group.
Create an alert rule.
1. 在左侧导航栏选择Alerts > Alert Rules。
2. 在Alert Rules页面，单击Create Alert Rule。
3. 在弹出的Create Alert Rule窗口，配置报警规则。Product选择MaxCompute_Common。
For information about other parameters for the alert rule, see Parameter description.

Handle an alert

When a job's runtime exceeds the threshold, an alert is sent to the designated alert contacts. Follow these steps to handle the alert:

Log in to the MaxCompute console and select a region in the upper-left corner.
In the left-side navigation pane, choose Observation O&M > Jobs.
Use the InstanceID from the alert to find the timed-out job.
(Optional) If the job is still in the Running state, decide whether it needs to continue. You can terminate the job if necessary. For more information, see Job O&M.
If the job was submitted from a DataWorks node (the ExtPlantFrom value for the instance is DataWorks):

Go to Operation Center in DataWorks, view the job details, and resolve the timeout based on your business needs. For more information, see Manage auto-triggered tasks.
If the job was not submitted from a DataWorks node:

On the Job O&M page, find the instance in the list, and then click Actions in the Actions column to view job details and troubleshoot the timeout. For more information, see Use Logview 2.0 to view job run information.