Configure cluster alerts

更新时间:
复制 MD 格式

During the operation of an Elasticsearch cluster, issues such as an abnormal cluster status or high node disk usage can impact service availability. By configuring monitoring alerts, you can detect and handle cluster anomalies in real time. Alibaba Cloud Elasticsearch supports two methods for this: one-click alert and custom alert rules in Cloud Monitor.

Enable one-click alert

The one-click alert feature, provided by Cloud Monitor, is disabled by default. When you enable it, the system automatically creates the following alert rules for all Elasticsearch clusters under your Alibaba Cloud account:

  • Abnormal cluster status

  • High node disk usage (>75%)

  • High node JVM heap usage (>85%)

  1. Log on to the Alibaba Cloud Elasticsearch console.

  2. In the left-side navigation pane, click Elasticsearch Clusters.

  3. On the Elasticsearch Clusters page, click Initiative Alert.

  4. In the Initiative Alert dialog box, click Enable Now.

    Note

    If the button displays Disable Now, the one-click alert feature is already enabled, and you can skip the remaining steps.

  5. On the Cloud Monitor console, turn on the Initiative Alert switch for the Elasticsearch service.

  6. (Optional) Return to the Alibaba Cloud Elasticsearch console to verify that one-click alert is enabled.

    1. On the Elasticsearch Clusters page, click the ID of the target instance.

    2. In the left-side navigation pane, choose Monitoring and Logs > Cluster Monitoring .

    3. Click the Basic Monitoring tab and check the status of Initiative Alert in the upper-right corner.

      If the status of Initiative Alert is Enabled, the feature is active.

Configure Cloud Monitor alerts

The one-click alert feature uses a fixed template for its rules. To customize metrics, thresholds, and notification methods, you can create a custom alert rule in Cloud Monitor.

  1. Go to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Alerts > Alarm Rules.

  3. Click Create Alert Rule.

  4. On the Create Alert Rule page, configure the alert rule.

    The following example shows how to configure an alert rule for three combined metrics: cluster status, node disk usage, and node heap memory usage. For parameters not mentioned, use the default values. For detailed parameter descriptions, see Create an alert rule.

    Parameter

    Description

    Product

    Select Elasticsearch .

    Resource Range

    Select Cluster .

    Associated Resources

    Add the instances that you want to monitor.

    Rule Description

    Click Add Rule > Combined Metrics, and then configure the following parameters in the Configure Rule Description panel:

    • Metric Type: Select Combined Metrics.

    • Alert Level: Select Warning.

    • Multi-metric Alert Condition:

      Note

      This example configures three monitoring metrics. Click Add Metric to add more metric conditions.

      • Metric 1: Select Cluster ID > Cluster Status and set the condition to >=2.

      • Metric 2: Select nodeName > Node Disk Usage and set the condition to average >=75%.

      • Metric 3: Select nodeName > Node Heap Memory Usage_ES Business and set the condition to average >=85%.

    • Relationship Between Metrics: Select Generate alerts if one of the conditions is met (||).

    • Alert Threshold Triggers: Select 3 Consecutive Cycles (1 Cycle = 1 Minute).

    You can also configure a disk usage alert by using a single-metric alert rule. For more information, see Example: Configure a disk alert.

    Alarm Contact Group

    Select an existing alert contact group. If you have not created one, see Create an alert contact or an alert contact group.

    Parameter

    Description

    Metric Type

    Select Combined Metrics.

    Alert Level

    Select Warning.

    Multi-metric Alert Condition

    Click Add Metric to add a new metric and configure the following three monitoring metrics:

    • Metric 1: Select Cluster ID > Cluster Status, and set the threshold to >= 2.

    • Metric 2: Select nodeName > Node Disk Usage, and set the average value to >= 75%.

    • Metric 3: Select nodeName > Node Heap Memory Usage_ES Business, and set the average value to be greater than or equal to 85%.

    Relationship Between Metrics

    Select Generate alerts if one of the conditions is met (||).

    Alert Threshold Triggers

    Select 3 consecutive periods (1 period = 1 minute).

    To configure a single-metric alert rule (for example, a disk usage alert), see Example: Configure a disk alert.

    Expand Advanced Settings and enter a publicly accessible URL in the Alert Callback field. Cloud Monitor pushes alert information to this URL using POST requests. Only the HTTP protocol is supported. For more information, see Use alert callbacks for threshold-triggered alerts.

    When configuring alert rules, you can refer to the following monitoring metrics. For more information, see Metric descriptions and troubleshooting suggestions.

    Metric

    Necessity

    Recommended threshold

    Description

    cluster status

    Required

    Value >= 2

    The cluster statuses Green, Yellow, and Red correspond to the numerical values 0.00, 1.00, and 2.00, respectively. Configure the alert metric based on these numerical values.

    node disk usage (%)

    Required

    Average >= 75%

    Should not exceed 80%.

    node heap memory usage (%)

    Required

    Average >= 85%

    Should not exceed 90%. In the rule description, this metric is displayed as Node Heap Memory Usage_ES Business.

    node CPU utilization (%)

    Optional

    Average >= 95%

    -

    node load_1m

    Optional

    Use 80% of the number of CPU cores as a reference value.

    -

    cluster query QPS (count/second)

    Optional

    Use actual test results as a reference.

    -

    cluster write QPS (count/second)

    Optional

    Use actual test results as a reference.

    -

    full GC count (count)

    Optional

    Abnormal if the value is not 0.

    -

    exception count (count)

    Optional

    Abnormal if the value is not 0.

    -

    snapshot status

    Optional

    Abnormal if the value is 2.

    Normal if the value is -1 or 0.

  5. Click OK.

    After the alert rule is created, members of the specified alert contact group receive notifications when an alert is triggered. For information about how to configure notification methods, see Receive alert notifications in a DingTalk group.

Example: Configure a disk alert

A disk usage alert is one of the most common single-metric alert scenarios. When node disk usage exceeds the configured threshold, you must promptly expand the storage capacity or clear data to prevent service unavailability due to full disks.

Follow the steps in Configure Cloud Monitor alerts to create an alert rule. In the Rule Description section, choose Add Rule >Simple Metric. The following table provides an example configuration.

Parameter

Example

Alert Rule Name

Disk Usage Alert

Metric Type

Select Simple Metric.

Metrics

Select nodeName > Node Disk Usage.

Threshold and Alert Level

  • Critical: Average over 3 consecutive cycles >= 80%

  • Warning: Average over 3 consecutive cycles >= 75%

  • Info: Average over 3 consecutive cycles >= 70%

Chart Preview

A preview of the monitoring chart for the selected metric.