ApsaraDB for OceanBase provides an alerting feature that supports alerts for various dimensions, such as OceanBase clusters, data assessment, data transmission, and data development. You can use the built-in alert metrics to meet basic alerting requirements. This topic describes the details of each alert.
Alert information
Each alert page contains the following information:
Name | Description |
|---|---|
Alert description | Describes the meaning of each alert and the scenarios that trigger it. |
Rule information | Describes the rules that trigger each alert, including Monitoring Metric, Metric Meaning, Recommended Threshold, Duration, and Detection Period. Trigger rule: The system checks the monitoring metric once every detection period. An alert is reported if the monitoring metric value exceeds the default threshold and this state persists for the specified duration. |
Impact on the system | Describes the potential impact on the system when an alert occurs. |
Possible causes | Describes the causes of the alert to help you locate the problem and handle the alert. |
Solution | Follow the specific method provided for each alert. For more information, see Add an alert rule. |
For more information about adding alert rules, see Add an alert rule.
Concepts
Alert object
An alert object is the entity monitored by an alert task. It uniquely identifies the object of an alert. An alert object can be an OceanBase cluster, a machine, or a service.
The format of an alert object is the alert rule name and the faulty instance, such as disk_log_usage_instance (Instance: integration_22-ob2).
Alert scope
The alert scope defines the range of an alert and corresponds to the metric scope.
The alert scope includes OceanBase Cluster (OBCluster), data assessment, data transmission, and data development.
Rule description
ApsaraDB for OceanBase lets you configure alert rules for tenant monitoring data details and node monitoring data details. The resource scope and monitoring metrics for each rule are listed below. You can configure them in Monitoring and Alerts as required. We recommend that you follow our best practices.
The monitoring metrics for configuring alerts for tenant metrics are as follows:
Metric | Metric Name | Corresponding Alert Metric |
|---|---|---|
Memory usage | memory_usage | Tenant / Tenant Memory Usage |
CPU usage | cpu_usage_percent | Tenant / CPU Usage |
Disk usage | disk_ob_data_size | Cluster / Maximum Disk Usage Note: Because storage usage is not isolated between tenants, you can only configure disk usage at the cluster level. |
Total connections | total_sessions | Configuring alert policies is not supported. |
Read/write connections | readwrite_sessions | Configuring alert policies is not supported. |
Read-only connections | readonly_sessions | Configuring alert policies is not supported. |
Write requests | tps | Tenant / Write Requests |
Read requests | QPS | Tenant / Read Requests |
Write request response time | tps_rt | Tenant / Write Request Response Time |
Read request response time | qps_rt | Tenant / Read Request Response Time |
Wait queue | request_queue_rt | Tenant / Wait Queue |
Transaction commits | trans_user_trans_count | Tenant / Transaction Commits |
Transaction response time | trans_commit_rt | Tenant / Transaction Commit Response Time |
The monitoring metrics for configuring alerts for node metrics are as follows:
Monitoring metrics | Metric Name | Corresponding Alert Metric |
|---|---|---|
CPU usage | cpu_util | Node / CPU Usage |
Load | load_load1 | Node / Load |
Machine memory usage | machine_mem_used_percent | Node / Machine Memory Usage |
Disk read | io_read_bytes | Node / Disk Read |
Disk write | io_write_bytes | Node / Disk Write |
Disk I/O wait | io_await | Node / Disk I/O Wait |
Inbound packet rate | traffic_bytin | Node / Inbound Packet Rate |
Outbound packet rate | traffic_bytout | Node / Outbound Packet Rate |
Retransmission rate | tcp_retran | Node / Retransmission Rate |
Total connections | total_sessions | Configuring alert policies is not supported. |
Read/write connections | readwrite_sessions | Configuring alert policies is not supported. |
Read-only connections | readonly_sessions | Configuring alert policies is not supported. |
Alert levels
Each alert metric has a corresponding alert level.
Level | English Meaning | Chinese Meaning | Notification Method | Description |
|---|---|---|---|---|
1 | Critical | Critical | Phone call + Text message + Email + DingTalk Robot | System availability has decreased and requires immediate repair to prevent a complete outage. Alternatively, the system is still available but is about to become unavailable. Take action to prevent further loss of availability. For example, the machine memory usage is greater than 90% for 3 minutes. |
2 | Warning | Warning | Text message + Email + DingTalk Robot | Key system performance metrics are declining but have not yet reached the warning threshold. Investigate to find potential problems and prevent a warning. (This is a reserved type. No alert metrics currently match this level.) |
3 | Info | Standard | Email + DingTalk Robot | This is an operational reminder, not a true alert. It is typically triggered when an administrator performs an important operation, such as taking a cluster offline. When an alert at this level is resolved, no alert recovery notification is sent. |