Hologres integrates with the Cloud Service Monitoring feature of CloudMonitor, offering complete visibility into the resource utilization, operational status, and health of your instances. This integration allows you to receive timely alerts for anomalies and respond quickly to ensure your applications run smoothly. This topic describes how to use CloudMonitor to monitor the metrics of Hologres instances and configure alert rules.
Prerequisites
You have purchased a Hologres instance.
Recommendations
CloudMonitor now supports displaying metrics based on Hologres instance types, including Hologres follower instance, Hologres acceleration instance, Hologres standard instance, and Hologres warehouse instance. Different instance types have dedicated metrics to help you better monitor and handle business exceptions. We recommend that you switch to monitoring by specific instance type for an improved monitoring experience. On the Hologres page in the CloudMonitor console, a region selector is available at the top. The tab bar allows you to switch between monitoring views for different instance types, including Hologres follower instance, Hologres acceleration instance, Hologres standard instance, and Hologres warehouse instance. You can click Create Alert Rule or View Alert Rules to manage alerts.
CloudMonitor metrics
For more information about the Hologres instance metrics that are supported by CloudMonitor, see Metrics in the Hologres console.
View metrics
You can view metrics in the CloudMonitor console.
-
Log on to the CloudMonitor console.
-
In the left-side navigation pane, click Cloud Service Monitoring.
-
In the Big Data section, click the target instance type, such as Hologres follower instance, Hologres acceleration instance, Hologres standard instance, or Hologres warehouse instance, to open the Hologres monitoring dashboard.
-
Click the
icon next to the region and select your region. -
Click an Instance ID or click Monitoring Chart in the Actions column to view the instance's metrics.
NoteYou can specify a time range to view instance metrics. Monitoring data is retained for up to 30 days.
Monitoring and alerting practices
Initiative alert
You can enable the initiative alert feature in CloudMonitor to set default alert rules for all your instances. After this feature is enabled, alert rules are created for metrics such as CPU usage, disk usage, memory usage, and the number of connections. These rules apply to all Hologres instances under your Alibaba Cloud account (primary account) and help you quickly identify issues based on important metrics. The default alert rules are described as follows:
-
If the average connection usage (Info) is greater than or equal to 95% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
-
If the average storage usage (Warn) is greater than 90% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
-
If the average memory usage (Warn) is greater than or equal to 90% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
-
If the average CPU usage (Info) is greater than or equal to 99% for three consecutive cycles, CloudMonitor sends a notification to the alert contact group.
By default, the alert cycle is 5 minutes. You can customize this setting.
Create an alert rule
In addition to the default initiative alert rules, you can create alert rules for more metrics based on your business requirements. Follow these steps to create an alert rule:
-
Log on to the CloudMonitor console.
-
In the left-side navigation pane, choose .
-
On the Alert Rule page, click Create Alert Rules and configure alert information as prompted. For more information, see Create an alert rule.
Alert setting best practices
Instance CPU usage (%)
This metric helps determine if your Hologres resources are bottlenecked or fully utilized. Recommended alert rules:
-
Alert rules:
-
Critical: Trigger an alert if the instance CPU usage is 99% or higher for 60 consecutive 1-minute periods. This helps you monitor the resource level of the cluster. If usage remains consistently high, you need to scale out.
-
Warn: Trigger an alert if the instance CPU usage is 99% or higher for 10 consecutive 1-minute periods. This practice helps you promptly check whether high CPU usage is caused by business changes.
-
-
Do not configure an alert that is triggered by a single occurrence of 100% CPU usage. A brief spike to 100% does not necessarily indicate a system overload or an anomaly but can represent efficient resource utilization.
-
Do not set the CPU alert threshold to a low value. Even when no tasks are running, system components may be active and consume a specific amount of resources.
Worker CPU usage (%)
This metric indicates whether a resource bottleneck exists on any worker node in your Hologres instance and reflects how fully the resources are being used. Recommended alert rules:
-
Alert rules:
-
Critical: Trigger an alert if the CPU usage of a worker node is 99% or higher for 60 consecutive 1-minute periods. This helps you monitor the resource level of each worker node. If usage remains consistently high, you need to scale out.
-
Warn: Trigger an alert if the CPU usage of a worker node is 99% or higher for 10 consecutive 1-minute periods. This practice helps you promptly check whether high CPU usage is caused by business changes.
-
-
Do not configure an alert that is triggered by a single occurrence of 100% CPU usage of a worker node. A brief spike to 100% does not necessarily indicate a system overload or an anomaly but can represent efficient resource utilization.
-
Do not set the CPU alert threshold to a low value. Even when no tasks are running, system components may be active and consume a specific amount of resources.
Instance memory usage (%)
This metric reflects the memory usage of the instance. Recommended alert rules:
-
Alert rules:
-
Critical: Trigger an alert if the instance memory usage is 99% or higher for 60 consecutive 1-minute periods. This helps you monitor the memory level of the cluster. If usage remains consistently high, you need to scale out.
-
Warn: Trigger an alert if the instance memory usage is 99% or higher for 10 consecutive 1-minute periods. This practice helps you promptly check whether high memory usage is caused by business changes.
-
-
Do not set the memory alert threshold to a low value. Memory is used not only for running queries but also for metadata and caches. A specific amount of memory is consumed even when the instance is idle.
Worker memory usage (%)
This metric reflects the memory usage of a worker node. Recommended alert rules:
-
Alert rules:
-
Critical: Trigger an alert if the memory usage of a worker node is 99% or higher for 60 consecutive 1-minute periods. This helps you monitor the memory level of the cluster. If usage remains consistently high, you need to scale out.
-
Warn: Trigger an alert if the memory usage of a worker node is 99% or higher for 10 consecutive 1-minute periods. This practice helps you promptly check whether high memory usage is caused by business changes.
-
-
Do not set the memory alert threshold to a low value. Memory is used not only for running queries but also for metadata and caches. A specific amount of memory is consumed even when the instance is idle.
Max FE connection usage (%)
Warn: Trigger an alert if the maximum FE connection usage is 95% or higher for 5 consecutive 1-minute periods. This helps you monitor the cluster's connection usage and clear idle connections promptly.
Max FE binlog WAL sender usage (%)
This metric reflects the maximum wal sender usage across all FEs. Recommended alert rule:
Warn: Trigger an alert if the maximum FE wal sender usage is 95% or higher for 5 consecutive 1-minute periods. This helps you monitor the wal sender usage of the cluster.
Longest active query time (milliseconds)
This metric helps you monitor for long-running queries on your instance. Recommended alert rule:
Warn: Trigger an alert if the longest active query time is 3,600,000 milliseconds or longer for 10 consecutive 1-minute periods.
Longest active serverless computing query time (milliseconds)
This metric helps you monitor tasks that use serverless computing. If a task runs for too long, you can terminate it promptly. Recommended alert rule:
Warn: Trigger an alert if the longest active query time in serverless computing is 3,600,000 milliseconds or longer for 10 consecutive 1-minute periods.
Failed query QPS (count/s)
This metric shows the number of failed queries on the instance. Configure an alert rule for failed queries to monitor the query execution status. Recommended alert rule:
Warn: Trigger an alert if the Failed Query QPS is 10 count/s or higher for 10 consecutive 1-minute periods. If a large number of queries fail on the instance, we recommend that you view the failure details in slow query logs and perform targeted optimization.
FE replay time (milliseconds)
This metric indicates the replay time for each FE. An excessive replay time indicates a slow replay. This can be caused by a stuck FE, which in turn causes queries to stall. This issue requires immediate attention. We recommend that you configure the following alert rules:
-
Alert rule:
Warn: Trigger an alert if the FE Replay Running Time is 300,000 milliseconds or longer for 10 consecutive 1-minute periods. If an alert is triggered, go to the Active Queries page in the HoloWeb console to check for long-running queries and try to terminate them.
-
Do not set the alert threshold for the FE replay time to a low value. FE replay occurs whenever metadata in the instance is modified. An FE replay time of a few seconds is normal.
Instance sync lag (milliseconds)
This metric is displayed only for follower instances and shows the data synchronization latency between the primary and follower instances. Recommended alert rule:
Warn: Trigger an alert if the instance sync lag is 600,000 milliseconds or longer for 10 consecutive 1-minute periods.
Tables with missing stats by DB (count/s)
This metric reflects the quality of the auto-analyze feature. If a table has missing statistics for an extended period of time, you can manually run the ANALYZE command on the table. For more information, see ANALYZE and AUTO ANALYZE. We recommend that you configure the following alert rules:
-
Alert rule:
Warn: Trigger an alert if the number of tables that are missing statistics in each database is 10 or higher for 60 consecutive 1-minute periods.
-
Do not set the alert threshold to a low value. If an instance contains a large number of tables, the auto-analyze process may be slow.
Troubleshoot common monitoring issues
If a metric fluctuates unexpectedly or an alert is triggered, you can troubleshoot the issue. For more information, see FAQ about monitoring metrics.
Access metrics by using APIs
In addition to the CloudMonitor console, CloudMonitor provides custom dashboards and APIs to help you access metrics in a more flexible manner.
-
To access CloudMonitor by using APIs, see Cloud Service Monitoring.
-
To use a custom monitoring dashboard, see Manage custom dashboards.
-
To access Hologres monitoring by using Application Real-Time Monitoring Service (ARMS), see Integration guide.
Grant permissions to a RAM user
By default, a RAM user cannot view metrics in CloudMonitor. You must grant the necessary permissions to the RAM user.
You can use your Alibaba Cloud account (primary account) to log on to the Resource Access Management (RAM) console and grant the following policies to a RAM user. For more information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.
You can also select policies based on your business requirements.
|
Policy name |
Description |
|
AliyunCloudMonitorFullAccess |
Permissions to fully manage CloudMonitor. |
|
AliyunCloudMonitorReadOnlyAccess |
Read-only permissions on CloudMonitor. |
|
AliyunCloudMonitorMetricDataReadOnlyAccess |
Permissions to read time series metric data in CloudMonitor. |