Monitoring and alerts

更新时间:
复制 MD 格式

Monitor your MaxCompute subscription resources and pay-as-you-go job consumption, and Tunnel uploads and downloads to understand your resource health. This helps you proactively upgrade resources or plan jobs. You can also set alert rules. When a resource metric meets the specified conditions, Cloud Monitor automatically sends an alert, helping you detect and handle issues promptly.

Monitoring and alert solutions

MaxCompute provides the following methods for monitoring and alerts:

Metrics

The following table describes the metric types and metrics supported by MaxCompute.

Metric type

Metric category

Metric

Description

MaxCompute-Subscription Compute Quota

level1

Level 1 quota CPU utilization

The CPU utilization of a level-1 quota as a percentage of the total CUs (reserved CUs + elastic reserved CUs). Unit: %. Data is collected every minute.

Level 1 quota CPU usage

The total CPU usage of a level-1 quota. Unit: core. Data is collected every minute.

Level 1 quota memory utilization

The memory utilization of a level-1 quota as a percentage of the total memory (reserved memory + elastic reserved memory). Unit: %. Data is collected every minute.

Level 1 quota memory usage

The memory usage of a level-1 quota. Unit: MB. Data is collected every minute.

level2

Level 2 quota CPU utilization

The CPU utilization of a level-2 quota as a percentage of the total CUs (reserved Min CUs + elastic reserved CUs). Unit: %. Data is collected every minute.

Level 2 quota CPU usage

The total CPU usage of a level-2 quota. Unit: core. Data is collected every minute.

Level 2 quota memory utilization

The memory utilization of a level-2 quota as a percentage of the total memory (reserved Min memory + elastic reserved memory). Unit: %. Data is collected every minute.

Level 2 quota memory usage

The memory usage of a level-2 quota. Unit: MB. Data is collected every minute.

Level 2 quota waiting jobs

The number of waiting jobs in a level-2 quota. Unit: count. Data is collected every minute.

MaxCompute-General

Tunnel

Tunnel download traffic_Project level

A real-time metric that monitors download traffic at the project level. You can set a maximum download traffic (bytes/min). An alert is triggered when the traffic reaches or exceeds this threshold.

Tunnel upload traffic_Project level

A real-time metric that monitors upload traffic at the project level. You can set a maximum upload traffic (bytes/min). An alert is triggered when the traffic reaches or exceeds this threshold.

Tunnel cumulative daily download volume_Project level

A metric that monitors the cumulative download volume for a project in a single day. You can set a maximum data volume (MB). An alert is triggered when the volume reaches or exceeds this threshold.

Tunnel cumulative daily upload volume_Project level

A metric that monitors the cumulative upload volume for a project in a single day. You can set a maximum data volume (MB). An alert is triggered when the volume reaches or exceeds this threshold.

Current Tunnel concurrency (Slots)_Project level

The number of concurrent slots currently in use by the selected project. An alert is triggered when this number reaches or exceeds the threshold.

Current Tunnel concurrency (Slots)_Tenant level

The number of concurrent slots currently in use by the selected tenant. An alert is triggered when this number reaches or exceeds the threshold.

Job

Job runtime

Monitors all jobs within a MaxCompute project. If a job's runtime, including its waiting time, exceeds the configured threshold, the system sends an alert notification to the specified contacts based on the alert rule.

Important

Jobs with a runtime of less than 1 minute cannot be monitored.

Job runtime_SQL type

Monitors all SQL jobs within a MaxCompute project. If a SQL job's runtime, including its waiting time, exceeds the configured threshold, the system sends an alert notification to the specified contacts based on the alert rule.

Important

Jobs with a runtime of less than 1 minute cannot be monitored.

Job runtime_SQL type_Submitter

Monitors the runtime, including waiting time, of all SQL jobs in a MaxCompute project. When a SQL job's runtime exceeds the threshold, the system sends an alert notification to the specified contacts. The alert content includes the job submitter's information to help identify the job owner.

Important

Jobs with a runtime of less than 1 minute cannot be monitored.

Cost

StorageAPIRead daily consumption

A metric that monitors the cumulative daily consumption of Storage API reads at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold.

Note

Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB.

StorageAPIRead monthly consumption

A metric that monitors the cumulative monthly consumption of Storage API reads at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold.

Note

Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB.

StorageAPIWrite daily consumption

A metric that monitors the cumulative daily consumption of Storage API writes at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold.

Note

Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB.

StorageAPIWrite monthly consumption

A metric that monitors the cumulative monthly consumption of Storage API writes at the project level. Unit: GiB. An alert is triggered when the consumption reaches or exceeds the threshold.

Note

Each tenant receives a free monthly quota of 1 TB for data reads and writes. Monitoring begins after consumption exceeds 1 TB.

Daily consumption of pay-as-you-go jobs (CNY)

A metric that monitors the cumulative daily cost of SQL and MapReduce jobs at the project level. You can set a maximum daily spending limit in CNY. An alert is triggered when the cost reaches or exceeds this threshold.

Monthly consumption of pay-as-you-go jobs (CNY)

A metric that monitors the cumulative monthly cost of SQL and MapReduce jobs at the project level. You can set a maximum monthly spending limit in CNY. An alert is triggered when the cost reaches or exceeds this threshold.

Storage

Standard storage size_Project level

The size of Standard storage for the project. Unit: GB. Data is collected every hour.

Infrequent Access (IA) storage size_Project level

The size of Infrequent Access (IA) storage for the project. Unit: GB. Data is collected every hour.

Infrequent Access (IA) storage access percentage in the last 30 days_Project level

The value is calculated using the following formula: (Cumulative data volume accessed from IA storage in the last 30 days + Cumulative data volume converted to IA storage in the last 30 days) / Current IA storage size of the project.

Archive storage size_Project level

The size of Archive storage for the project. Unit: GB. Data is collected every hour.

Archive storage access percentage in the last 180 days_Project level

The value is calculated using the following formula: (Cumulative data volume accessed from Archive storage in the last 180 days + Cumulative data volume converted to Archive storage in the last 180 days) / Current Archive storage size of the project.

MaxCompute-Subscription Data Transmission Service

Not applicable

Level 1 quota concurrent slot utilization

Monitors the usage of a selected exclusive resource group. You can configure an alert rule with a concurrency percentage threshold. The system sends an alert to the specified contacts based on the alert rule.

Level 1 quota concurrent slot count

Monitors the usage of a selected exclusive resource group. You can configure an alert rule with a concurrency count threshold. The system sends an alert to the specified contacts based on the alert rule.

You can configure a dashboard or an alert rule for any metric. For more information, see Configure a dashboard or Configure an alert rule.

Dashboard configuration

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Dashboard > Custom Dashboard.

  3. On the Custom Dashboards page, click Create Dashboard. In the Create Dashboard dialog box, enter a Board Name, select a Folder, and then click OK.

  4. Click the name of the dashboard that you created. On the details page, click Add Chart.

  5. In the upper-right corner of the page, select a chart type. Options include Line, Bar, Stats, Gauge, Meter, Pie, Table, Facet, Stream, and Histogram.

  6. In the Metrics section, set Data Source to Cloud Service Monitoring, and then configure the metrics.

For more information about how to manage monitoring charts, see Manage monitoring charts in a custom dashboard.

Alert rule configuration

You can set an alert rule for any of the metrics.

This example shows how to configure an alert for a quota group. The goal is to receive a notification when the CU or memory utilization of a MaxCompute subscription quota group exceeds a certain threshold. Assume the quota group to be monitored is configured with 150 CUs. One core at full utilization equals 100%, so the maximum usage is 15,000%. You can set the alert threshold to >12,000%. If you receive an alert, it indicates that the quota group is approaching full capacity and new jobs may be queued. You can then upgrade the quota group or plan your jobs to reduce the load. To configure the alert rule for this scenario, follow these steps:

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, click Alerts > Alert Rules.

  3. On the Alert Rules page, click Create Alert Rule.

  4. On the Create Alert Rule page, configure the parameters based on the scenario. For more information, see Create an alert rule. For details on how to configure alert contacts, see Create an alert contact or an alert contact group.

    The key parameters for this scenario are described in the following table:

    Parameter

    Description

    Product

    Select MaxCompute_Subscription from the drop-down list.

    Resource Range

    Select Instances.

    Associated Resources

    Click Add Instance. In the Add Instance panel, select the subscription quota group in your MaxCompute project's region, and then click OK. For more information about quota groups, see Manage computing resources (quotas).

    Rule Description

    Click Add Rule > Simple Metric, and configure the following parameters in the Configure Rule Description panel:

    • Alert Rule: Enter a name for the alert rule.

    • Metric Type: Select Metric.

    • Metric: Select the corresponding CPU usage metric from the drop-down list.

      Note
      • If the associated resource is a level-1 quota group, select level1 > Level 1 quota CPU usage. If it is a level-2 quota group, select level2 > Level 2 quota CPU usage.

      • You can also monitor the number of waiting jobs. If CPU usage is high and a large number of jobs have been waiting for N consecutive collection periods, you may need to intervene manually.

  5. Click Confirm.

Related documentation

To learn how to configure job timeout monitoring and handle timeout alerts, see Job timeout monitoring and alerts.