How to configure CloudMonitor alert rules for a Hologres instance-Hologres(Hologres)-阿里云帮助中心

Hologres integrates with CloudMonitor to provide visibility into the resource utilization, operational status, and health of your instances. You can monitor key metrics and configure alert rules to detect anomalies and respond quickly.

Prerequisites

You have purchased a Hologres instance.

Recommendations

CloudMonitor supports metrics by Hologres instance type, including Hologres follower instance, Hologres acceleration instance, Hologres standard instance, and Hologres warehouse instance. Each type has dedicated metrics for monitoring and troubleshooting. We recommend that you monitor by specific instance type. On the Hologres page in the CloudMonitor console, use the region selector at the top and switch between instance types in the tab bar. You can click Create Alert Rule or View Alert Rules to manage alerts.

CloudMonitor metrics

For more information about the Hologres instance metrics that are supported by CloudMonitor, see Metrics in the Hologres console.

View metrics

View metrics in the CloudMonitor console.

Log on to the CloudMonitor console.
In the left-side navigation pane, click Cloud Service Monitoring.
In the Big Data section, click the target instance type, such as Hologres follower instance, Hologres acceleration instance, Hologres standard instance, or Hologres warehouse instance, to open the Hologres monitoring dashboard.
Click the icon next to the region and select your region.
Click an Instance ID or click Monitoring Chart in the Actions column to view the instance's metrics.

Note
You can specify a time range to view instance metrics. Monitoring data is retained for up to 30 days.

Monitoring and alerting practices

Initiative alert

You can enable initiative alert in CloudMonitor to apply default alert rules to all your instances. After you enable this feature, alert rules are created for metrics such as CPU usage, disk usage, memory usage, and connection count. These rules apply to all Hologres instances under your Alibaba Cloud account (primary account). The default alert rules are:

If the average connection usage (Info) is greater than or equal to 95% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
If the average storage usage (Warn) is greater than 90% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
If the average memory usage (Warn) is greater than or equal to 90% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
If the average CPU usage (Info) is greater than or equal to 99% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.

Note

By default, the alert cycle is 5 minutes. You can customize this setting.

Create an alert rule

In addition to the default initiative alert rules, you can create custom alert rules for other metrics. To create an alert rule:

Log on to the CloudMonitor console.
In the left-side navigation pane, choose Alerts > Alert Rule.
On the Alert Rule page, click Create Alert Rules and configure alert information as prompted. For more information, see Create an alert rule.

Alert setting best practices

Instance CPU usage (%)

Indicates whether your Hologres resources are bottlenecked or fully utilized. Recommended alert rules:

Alert rules:
- Critical: Trigger an alert if the instance CPU usage is 99% or higher for 60 consecutive 1-minute periods. If usage stays consistently high, scale out.
- Warn: Trigger an alert if the instance CPU usage is 99% or higher for 10 consecutive 1-minute periods. Check whether high CPU usage is caused by business changes.
Do not trigger an alert on a single occurrence of 100% CPU usage. A brief spike does not necessarily indicate overload — it can represent efficient resource utilization.

Do not set the CPU alert threshold too low. System components may consume resources even when no tasks are running.

Worker CPU usage (%)

Indicates whether a resource bottleneck exists on any worker node and how fully the resources are used. Recommended alert rules:

Alert rules:
- Critical: Trigger an alert if the CPU usage of a worker node is 99% or higher for 60 consecutive 1-minute periods. If usage stays consistently high, scale out.
- Warn: Trigger an alert if the CPU usage of a worker node is 99% or higher for 10 consecutive 1-minute periods. Check whether high CPU usage is caused by business changes.
Do not trigger an alert on a single occurrence of 100% CPU usage on a worker node. A brief spike does not necessarily indicate overload — it can represent efficient resource utilization.
Do not set the CPU alert threshold too low. System components may consume resources even when no tasks are running.

Instance memory usage (%)

Reflects overall memory usage of the instance. Recommended alert rules:

Alert rules:
- Critical: Trigger an alert if the instance memory usage is 99% or higher for 60 consecutive 1-minute periods. If usage stays consistently high, scale out.
- Warn: Trigger an alert if the instance memory usage is 99% or higher for 10 consecutive 1-minute periods. Check whether high memory usage is caused by business changes.
Do not set the memory alert threshold too low. Memory is used for queries, metadata, and caches, so some memory is consumed even when the instance is idle.

Worker memory usage (%)

Reflects memory usage of individual worker nodes. Recommended alert rules:

Alert rules:
- Critical: Trigger an alert if the memory usage of a worker node is 99% or higher for 60 consecutive 1-minute periods. If usage stays consistently high, scale out.
- Warn: Trigger an alert if the memory usage of a worker node is 99% or higher for 10 consecutive 1-minute periods. Check whether high memory usage is caused by business changes.
Do not set the memory alert threshold too low. Memory is used for queries, metadata, and caches, so some memory is consumed even when the instance is idle.

Max FE connection usage (%)

Warn: Trigger an alert if the maximum FE connection usage is 95% or higher for 5 consecutive 1-minute periods. This helps you monitor connection usage and clear idle connections promptly.

Max FE binlog WAL sender usage (%)

Reflects the maximum WAL sender usage across all FEs. Recommended alert rule:

Warn: Trigger an alert if the maximum FE WAL sender usage is 95% or higher for 5 consecutive 1-minute periods.

Longest active query time (milliseconds)

Monitors for long-running queries on your instance. Recommended alert rule:

Warn: Trigger an alert if the longest active query time is 3,600,000 milliseconds or longer for 10 consecutive 1-minute periods.

Longest active serverless computing query time (milliseconds)

Monitors serverless computing tasks. If a task runs too long, you can terminate it. Recommended alert rule:

Warn: Trigger an alert if the longest active query time in serverless computing is 3,600,000 milliseconds or longer for 10 consecutive 1-minute periods.

Failed query QPS (count/s)

Shows the number of failed queries on the instance. Recommended alert rule:

Warn: Trigger an alert if the Failed Query QPS is 10 count/s or higher for 10 consecutive 1-minute periods. If a large number of queries fail, check the failure details in slow query logs and optimize accordingly.

FE replay time (milliseconds)

Indicates the replay time for each FE. Excessive replay time suggests a slow or stuck FE, which can cause queries to stall. Recommended alert rules:

Alert rule:

Warn: Trigger an alert if the FE Replay Running Time is 300,000 milliseconds or longer for 10 consecutive 1-minute periods. Check the Active Queries page in the HoloWeb console for long-running queries and try to terminate them.
Do not set the FE replay time threshold too low. FE replay occurs whenever instance metadata is modified, and a replay time of a few seconds is normal.

Instance sync lag (milliseconds)

Displayed only for follower instances. Shows the data synchronization latency between primary and follower instances. Recommended alert rule:

Warn: Trigger an alert if the instance sync lag is 600,000 milliseconds or longer for 10 consecutive 1-minute periods.

Tables with missing stats by DB (count/s)

Reflects the quality of the auto-analyze feature. If a table has missing statistics for an extended period, manually run the ANALYZE command on the table. For more information, see ANALYZE and AUTO ANALYZE. Recommended alert rules:

Alert rule:

Warn: Trigger an alert if the number of tables that are missing statistics in each database is 10 or higher for 60 consecutive 1-minute periods.
Do not set the alert threshold too low. If an instance has many tables, the auto-analyze process may be slow.

Troubleshoot common monitoring issues

If a metric fluctuates unexpectedly or an alert is triggered, see FAQ about monitoring metrics for troubleshooting guidance.

Access metrics by using APIs

CloudMonitor also provides custom dashboards and APIs for more flexible access to metrics.

To access CloudMonitor by using APIs, see Cloud Service Monitoring.
To use a custom monitoring dashboard, see Manage custom dashboards.
To access Hologres monitoring by using Application Real-Time Monitoring Service (ARMS), see Integration guide.

Grant permissions to a RAM user

By default, RAM users cannot view metrics in CloudMonitor. You must grant the required permissions.

Log on to the Resource Access Management (RAM) console with your Alibaba Cloud account (primary account) and grant one of the following policies to the RAM user. For more information, see Grant permissions to a RAM user.

Note

You can also select policies based on your business requirements.

Policy name	Description
AliyunCloudMonitorFullAccess	Permissions to fully manage CloudMonitor.
AliyunCloudMonitorReadOnlyAccess	Read-only permissions on CloudMonitor.
AliyunCloudMonitorMetricDataReadOnlyAccess	Permissions to read time series metric data in CloudMonitor.