MaxCompute allows you to monitor job runtimes by configuring threshold-based alert rules. If a job's runtime exceeds the specified threshold, the system sends an alert to the designated alert contacts. This helps you quickly identify and address abnormal jobs, improving operational efficiency. This topic describes the monitoring metrics for job timeouts, how to configure alerts, and how to handle them.
Monitoring metrics
The following metrics are available for monitoring job runtimes.
-
Job runtime
-
This metric monitors all jobs within a MaxCompute project. If a job's runtime, including its wait time, exceeds the specified threshold, the system sends an alert to the designated alert contacts.
-
Use case: This metric is ideal for MaxCompute projects where analysts run data retrieval jobs that are typically short. By configuring this monitor, you can quickly check for issues such as resource contention or high computational load if a job takes too long to run.
-
-
Job runtime_SQL type
-
This metric monitors all SQL jobs within a MaxCompute project. If an SQL job's runtime, including its wait time, exceeds the specified threshold, the system sends an alert to the designated alert contacts.
-
Use case: This metric is essential for production projects. Configuring this monitor helps you quickly address job timeouts and prevent delays in your business workflows.
-
Applicability
-
Supported regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Ulanqab), China (Chengdu), China (Hong Kong), US (Silicon Valley), US (Virginia), Malaysia (Kuala Lumpur), Japan (Tokyo), Germany (Frankfurt), Indonesia (Jakarta), UK (London), and Singapore.
-
Permissions: To configure alerts as a RAM user, grant the user the
AliyunCloudMonitorFullAccesspolicy in the RAM console. This policy is required in addition to the standard CloudMonitor permissions. For more information about RAM permissions, see Manage permissions for a RAM user.
Set up monitoring and alerting
-
Activate the Alibaba Cloud CloudMonitor service.
-
Log on to the Cloud Monitor console.
-
Create an alert contact.
-
In the navigation pane on the left, choose .
-
On the Alert Contacts page, click the Alert Contacts tab.
-
Click Create Alert Contact. In the Set Alert Contact window, enter the required information.
For more information, see Create an alert contact or an alert contact group.
-
-
Create an alert rule.
-
In the navigation pane on the left, choose .
-
On the Alert Rules page, click Create Alert Rule.
-
In the Create Alert Rule dialog box, for Product, select MaxCompute_Common.
For information about other parameters for the alert rule, see Parameter description.
-
Handle an alert
When a job's runtime exceeds the threshold, an alert is sent to the designated alert contacts. Follow these steps to handle the alert:
-
Log on to the MaxCompute console, and select a region in the upper-left corner.
-
In the navigation pane on the left, choose .
-
Use the InstanceID from the alert to find the timed-out job.
-
(Optional) If the job is still in the Running state, decide whether it needs to continue. You can terminate the job if necessary. For more information, see Job O&M.
-
If the job was submitted from a DataWorks node (the ExtPlantFrom value for the instance is DataWorks):
Go to Operation Center in DataWorks, view the job details, and resolve the timeout based on your business needs. For more information, see Manage auto-triggered tasks.
-
If the job was not submitted from a DataWorks node:
On the Job O&M page, find the instance in the list, and then click Actions in the Actions column to view job details and troubleshoot the timeout. For more information, see Use Logview 2.0 to view job run information.