Monitor your MaxCompute subscription resources and pay-as-you-go job consumption, and Tunnel uploads and downloads to understand your resource health. This helps you proactively upgrade resources or plan jobs. You can also set alert rules. When a resource metric meets the specified conditions, Cloud Monitor automatically sends an alert, helping you detect and handle issues promptly.
Monitoring and alert solutions
MaxCompute provides the following methods for monitoring and alerts:
-
Use Cloud Monitor to configure metrics for subscription resources, real-time job consumption, Tunnel data transfer volumes, and job runtimes.
-
Use a dashboard to observe real-time charts and understand the changes in each metric.
-
Create custom alert rules and add alert contacts. When a metric reaches or exceeds a specified threshold, Cloud Monitor automatically sends an alert to the specified contacts. Notifications can be sent by phone, SMS, email, or DingTalk chatbot.
-
Log on to the MaxCompute console. The Overview shows the number of alerts for each metric in the Alert and Risk Warnings section.
For more information about Cloud Monitor activation and billing, see Billing and Plans.
For more information about monitoring job runtimes, see Job timeout monitoring and alerts.
-
-
Use the MaxCompute client to monitor single SQL consumption and cumulative daily SQL consumption. For more information, see Single SQL consumption limit and Cumulative daily SQL consumption limit.
-
Use Alibaba Cloud Expenses and Costs to monitor high consumption. For more information, see Consumption monitoring, alerts, and control.
Metrics
The following table describes the metric types and metrics supported by MaxCompute.
|
Metric type |
Metric category |
Metric |
Description |
|
MaxCompute-Subscription Compute Quota |
level1 |
Level 1 quota CPU utilization |
The CPU utilization of a level-1 quota as a percentage of the total CUs (reserved CUs + elastic reserved CUs). Unit: %. Data is collected every minute. |
|
Level 1 quota CPU usage |
The total CPU usage of a level-1 quota. Unit: core. Data is collected every minute. |
||
|
Level 1 quota memory utilization |
The memory utilization of a level-1 quota as a percentage of the total memory (reserved memory + elastic reserved memory). Unit: %. Data is collected every minute. |
||
|
Level 1 quota memory usage |
The memory usage of a level-1 quota. Unit: MB. Data is collected every minute. |
||
|
level2 |
Level 2 quota CPU utilization |
The CPU utilization of a level-2 quota as a percentage of the total CUs (reserved Min CUs + elastic reserved CUs). Unit: %. Data is collected every minute. |
|
|
Level 2 quota CPU usage |
The total CPU usage of a level-2 quota. Unit: core. Data is collected every minute. |
||
|
Level 2 quota memory utilization |
The memory utilization of a level-2 quota as a percentage of the total memory (reserved Min memory + elastic reserved memory). Unit: %. Data is collected every minute. |
||
|
Level 2 quota memory usage |
The memory usage of a level-2 quota. Unit: MB. Data is collected every minute. |
||
|
Level 2 quota waiting jobs |
The number of waiting jobs in a level-2 quota. Unit: count. Data is collected every minute. |
||
|
MaxCompute-General |
Tunnel |
Tunnel download traffic_Project level |
A real-time metric that monitors download traffic at the project level. You can set a maximum download traffic (bytes/min). An alert is triggered when the traffic reaches or exceeds this threshold. |
|
Tunnel upload traffic_Project level |
A real-time metric that monitors upload traffic at the project level. You can set a maximum upload traffic (bytes/min). An alert is triggered when the traffic reaches or exceeds this threshold. |
||
|
Tunnel cumulative daily download volume_Project level |
A metric that monitors the cumulative download volume for a project in a single day. You can set a maximum data volume (MB). An alert is triggered when the volume reaches or exceeds this threshold. |
||
|
Tunnel cumulative daily upload volume_Project level |
A metric that monitors the cumulative upload volume for a project in a single day. You can set a maximum data volume (MB). An alert is triggered when the volume reaches or exceeds this threshold. |
||
|
Current Tunnel concurrency (Slots)_Project level |
The number of concurrent slots currently in use by the selected project. An alert is triggered when this number reaches or exceeds the threshold. |
||
|
Current Tunnel concurrency (Slots)_Tenant level |
The number of concurrent slots currently in use by the selected tenant. An alert is triggered when this number reaches or exceeds the threshold. |
||
|
Job |
Job runtime |
Monitors all jobs within a MaxCompute project. If a job's runtime, including its waiting time, exceeds the configured threshold, the system sends an alert notification to the specified contacts based on the alert rule. Important
Jobs with a runtime of less than 1 minute cannot be monitored. |
|
|
Job runtime_SQL type |
Monitors all SQL jobs within a MaxCompute project. If a SQL job's runtime, including its waiting time, exceeds the configured threshold, the system sends an alert notification to the specified contacts based on the alert rule. Important
Jobs with a runtime of less than 1 minute cannot be monitored. |
||
|
Job runtime_SQL type_Submitter |
Monitors the runtime, including waiting time, of all SQL jobs in a MaxCompute project. When a SQL job's runtime exceeds the threshold, the system sends an alert notification to the specified contacts. The alert content includes the job submitter's information to help identify the job owner. Important
Jobs with a runtime of less than 1 minute cannot be monitored. |
||
|
Cost |
StorageAPIRead daily consumption |
A metric that monitors the cumulative daily consumption of Storage API reads at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold. Note
Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB. |
|
|
StorageAPIRead monthly consumption |
A metric that monitors the cumulative monthly consumption of Storage API reads at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold. Note
Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB. |
||
|
StorageAPIWrite daily consumption |
A metric that monitors the cumulative daily consumption of Storage API writes at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold. Note
Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB. |
||
|
StorageAPIWrite monthly consumption |
A metric that monitors the cumulative monthly consumption of Storage API writes at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold. Note
Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB. |
||
|
Daily consumption of pay-as-you-go jobs (CNY) |
A metric that monitors the cumulative daily cost of SQL and MapReduce jobs at the project level. You can set a maximum daily spending limit in CNY. An alert is triggered when the cost reaches or exceeds this threshold. |
||
|
Monthly consumption of pay-as-you-go jobs (CNY) |
A metric that monitors the cumulative monthly cost of SQL and MapReduce jobs at the project level. You can set a maximum monthly spending limit in CNY. An alert is triggered when the cost reaches or exceeds this threshold. |
||
|
Storage |
Standard storage size_Project level |
The size of Standard storage for the project. Unit: GB. Data is collected every hour. |
|
|
Infrequent Access (IA) storage size_Project level |
The size of Infrequent Access (IA) storage for the project. Unit: GB. Data is collected every hour. |
||
|
Infrequent Access (IA) storage access percentage in the last 30 days_Project level |
The value is calculated using the following formula: |
||
|
Archive storage size_Project level |
The size of Archive storage for the project. Unit: GB. Data is collected every hour. |
||
|
Archive storage access percentage in the last 180 days_Project level |
The value is calculated using the following formula: |
||
|
MaxCompute-Subscription Data Transmission Service |
Not applicable |
Level 1 quota concurrent slot utilization |
Monitors the usage of a selected exclusive resource group. You can configure an alert rule with a concurrency percentage threshold. The system sends an alert to the specified contacts based on the alert rule. |
|
Level 1 quota concurrent slot count |
Monitors the usage of a selected exclusive resource group. You can configure an alert rule with a concurrency count threshold. The system sends an alert to the specified contacts based on the alert rule. |
You can configure a dashboard or an alert rule for any metric. For more information, see Configure a dashboard or Configure an alert rule.
Dashboard configuration
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, choose .
-
On the Custom Dashboards page, click Create Dashboard. In the Create Dashboard dialog box, enter a Board Name, select a Folder, and then click OK.
-
Click the name of the dashboard that you created. On the details page, click Add Chart.
-
In the upper-right corner of the page, select a chart type. Options include Line, Bar, Stats, Gauge, Meter, Pie, Table, Facet, Stream, and Histogram.
-
In the Metrics section, set Data Source to Cloud Service Monitoring, and then configure the metrics.
For more information about how to manage monitoring charts, see Manage monitoring charts in a custom dashboard.
Alert rule configuration
You can set an alert rule for any of the metrics.
This example shows how to configure an alert for a quota group. The goal is to receive a notification when the CU or memory utilization of a MaxCompute subscription quota group exceeds a certain threshold. Assume the quota group to be monitored is configured with 150 CUs. One core at full utilization equals 100%, so the maximum usage is 15,000%. You can set the alert threshold to >12,000%. If you receive an alert, it indicates that the quota group is approaching full capacity and new jobs may be queued. You can then upgrade the quota group or plan your jobs to reduce the load. To configure the alert rule for this scenario, follow these steps:
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, click .
-
On the Alert Rules page, click Create Alert Rule.
-
On the Create Alert Rule page, configure the parameters based on the scenario. For more information, see Create an alert rule. For details on how to configure alert contacts, see Create an alert contact or an alert contact group.
The key parameters for this scenario are described in the following table:
Parameter
Description
Product
Select MaxCompute_Subscription from the drop-down list.
Resource Range
Select Instances.
Associated Resources
Click Add Instance. In the Add Instance panel, select the subscription quota group in your MaxCompute project's region, and then click OK. For more information about quota groups, see Manage computing resources (quotas).
Rule Description
Click , and configure the following parameters in the Configure Rule Description panel:
-
Alert Rule: Enter a name for the alert rule.
-
Metric Type: Select Metric.
-
Metric: Select the corresponding CPU usage metric from the drop-down list.
Note-
If the associated resource is a level-1 quota group, select . If it is a level-2 quota group, select .
-
You can also monitor the number of waiting jobs. If CPU usage is high and a large number of jobs have been waiting for N consecutive collection periods, you may need to intervene manually.
-
-
-
Click Confirm.
Related documentation
To learn how to configure job timeout monitoring and handle timeout alerts, see Job timeout monitoring and alerts.