Alert rules

更新时间:
复制 MD 格式

Create alert rules to monitor your applications and services. When conditions are met, CloudMonitor notifies you through contacts, chatbots, webhooks, or action integrations.

Prerequisites

  • You have enabled the required observability monitoring services, such as Prometheus, Application Monitoring, and Log Service.

  • You have created a notification contact.

Create an alert rule

  1. Log in to the CloudMonitor 2.0 console. In the left-side navigation pane, choose All Features > Alert Center.

  2. On the Alert Center page, choose Alert management > Alert rules.

  3. On the Alert rules page, click Create alert rule.

  4. In the Create alert rule panel, configure the following parameters.

    1. Rule name: A name to identify the alert rule.

    2. Monitoring type: The type of service or resource to monitor.

      • Managed Service for Prometheus/Cloud Synthetic Monitoring

        Parameter

        Description

        Data Source Type

        The source of the monitoring data.

        Region

        The region where the data source resides.

        Prometheus instance

        The instance to which the alert rule applies.

        Detection condition definition method

        Custom PromQL: Create a custom PromQL query. PromQL Function Usage Examples.

        Configure based on predefined metrics:

        • Metric group: Select a metric group.

        • Metric: Select a metric.

        • Detection condition: Set the detection condition by specifying a comparison operator and a value. p50, p75, p90, and p99 represent percentiles.

        • PromQL preview: Preview the PromQL query for the predefined metric.

        Severity level

        Set the severity level for the alert rule.

        • P1: Critical: Issues affecting core service availability with widespread impact.

        • P2: Error: Partial service failures affecting availability.

        • P3: Warning: Potential issues that could cause service errors or affect your business.

        • P4: Information: Low-priority events. Default level.

        Duration

        How long a condition must persist before an alert triggers. Prevents false alarms from transient fluctuations.

        Alert detection period

        The execution interval of the alert rule. The default value is 60 seconds, which means the check is performed once per minute.

        Content

        You can use Go template syntax to customize the content of alert messages. For example: Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, Current value {{ printf "%.2f" $value }}%

        Labels

        Custom key-value pairs for categorizing and filtering alert rules. For example: env: production and team: sre.

        Annotations

        Additional information for the alert rule, such as long-form descriptions or runbook links. For example: description: CPU usage is high and runbook_url: https://wiki.xxx.com/runbook\.

      • Application Monitoring

        Parameter

        Description

        Data Source Type

        The type of the data source.

        Region

        The region where the data source resides.

        Application

        The application to monitor.

        Metric group

        The metric group for the application.

        Interface name

        Interface matching method: Traverse, Equals, Not Equals, Regex Match, Regex Not Match, or No Dimension.

        Interface call type

        Detection condition method

        Single condition:

        • Set the time range to the Last N Minutes, the call type, the calculation method, and a comparison operator.

        • Set the thresholds for different severity levels: Critical, Error, Warning, and Information.

        Multiple conditions:

        • Multi-alert trigger rule: Select Any condition is met or All conditions are met.

        • Detection condition 1: Same parameters as a single condition.

        • Add detection condition: Add more conditions as needed.

        • Severity level: Valid values: P1: Critical, P2: Error, P3: Warning, and P4: Information.

        Alert detection period

        The execution interval of the alert rule. The default value is 60 seconds, which means the check is performed once per minute.

        Content

        The customizable content of alert notifications.

        Tags

        Custom key-value pairs for categorizing and filtering alert rules. For example: env: production and team: sre.

        Annotations

        Additional information for the alert rule, such as long-form descriptions or runbook links. For example: description: CPU usage is high and runbook_url: https://wiki.xxx.com/runbook\.

      • Large model observability

        Parameter

        Description

        Data Source Type

        The data source type, which is automatically set to UModel.

        Entity Type

        The type of entity to monitor.

        Metric set

        The set of metrics to evaluate, such as AI application operational metrics, GenAI model metrics, or AI application traffic metrics.

        Detection condition

        Set the threshold that triggers the alert.

        Severity level

        The severity level of the alert. Valid values are P1: Critical, P2: Error, P3: Warning, and P4: Information.

        Duration

        The duration the condition must persist before an alert triggers.

        Alert detection period

        The execution interval of the alert rule. The default value is 60 seconds, which means the check is performed once per minute.

        Content

        The customizable content of alert notifications.

        Tags

        Custom key-value pairs for categorizing and filtering alert rules. For example: env: production and team: sre.

        Annotations

        Additional information for the alert rule, such as long-form descriptions or runbook links. For example: description: CPU usage is high and runbook_url: https://wiki.xxx.com/runbook\.

      • Container Insights/ECS Insights/Hologres Insights/AI Training Service Insights/Database Insights

        Parameter

        Description

        Data Source Type

        The source of the monitoring data.

        Region

        The region where the data source resides.

        Prometheus instance

        The instance to which the alert rule applies.

        Detection condition definition method

        Custom PromQL: Create a custom PromQL query. PromQL Function Usage Examples.

        Configure based on predefined metrics:

        • Metric group: Select a metric group.

        • Metric: Select a metric.

        • Detection condition: Set the detection condition by specifying a comparison operator and a value.

        • PromQL preview: Preview the PromQL query for the predefined metric.

        Severity level

        Set the severity level for the alert rule.

        • P1: Critical

        • P2: Error

        • P3: Warning

        • P4: Information

        Duration

        The duration the condition must persist before an alert triggers.

        Alert detection period

        The execution interval of the alert rule. The default value is 60 seconds, which means the check is performed once per minute.

        Run detection after data is complete

        Select a detection method.

        Content

        You can use Go template syntax to customize the alert message content. For example: Namespace: {{$labels.namespace}} / Pod: {{$labels.pod_name}} / Container: {{$labels.container}} CPU usage {{$labels.metrics_params_opt_label_value}} {{$labels.metrics_params_value}}%, current value {{ printf "%.2f" $value }}%

        Tags

        Custom key-value pairs for categorizing and filtering alert rules. For example: env: production and team: sre.

        Annotations

        Additional information for the alert rule, such as long-form descriptions or runbook links. For example: description: CPU usage is high and runbook_url: https://wiki.xxx.com/runbook\.

      • Log Audit

        Parameter

        Description

        Select template

        ActionTrail: Select an ActionTrail template.

        Host audit: Select a host audit template.

        Container audit: Select a container audit template.

        Query and statistics

        Single query: Query by configuring log-related parameters.

        Set operation: Configure set operations across multiple resource sets.

        Detection Logic

        Add conditions and set the data matching method and severity level.

        Severity level

        Valid values: Critical, Error, Warning, and Information.

        Consecutive hits

        Specify the number of consecutive times the condition must be met to trigger an alert.

        Alert detection period

        The execution interval of the alert rule. The default value is 60 seconds, which means the check is performed once per minute.

        Tags

        Custom key-value pairs for categorizing and filtering alert rules. For example: env: production and team: sre.

        Annotations

        Additional information for the alert rule, such as long-form descriptions or runbook links. For example: description: CPU usage is high and runbook_url: https://wiki.xxx.com/runbook\.

      • Log Service: Same parameters as the Log Audit monitoring type.

    3. Alert notification.

      • Notification recipient: The recipients to notify when an alert is triggered.

        • Contact: Individual contacts to notify.

        • Contact group: A group of contacts to notify.

        • DingTalk: Sends alerts to a DingTalk group chatbot.

        • WeCom: Sends alerts to a WeCom chatbot.

        • Lark: Sends alerts to a Lark chatbot.

        • Slack: Sends alerts by using Slack.

        • Custom webhook: Sends alerts by using a custom webhook.

      • Integrate with ARMS alert management: Integrates with Application Real-Time Monitoring Service (ARMS) to manage alert lifecycles.

        Note

        Alert events are sent to the ARMS Alert O&M center by default. Configure notifications there.

      • Action integration: The service to trigger for automated incident response, such as Log Service, lightweight message queues, Function Compute, and third-party services like PagerDuty and webhooks.

      • Notification silence period: The time to wait before resending a notification for an unresolved alert. Valid values: 1, 5, 10, 15, 30, and 50 minutes, and 1, 3, 6, 12, and 24 hours.

        Note

        Example: With Notification silence period set to 12 hours, CloudMonitor resends the notification after 12 hours if the alert persists.

      • Effective period: The time window when the alert rule is active. Notifications are sent only during this period.

        Note
        • Alerts triggered outside the effective period are recorded in history but do not generate notifications.

        • The notification period can be set within a 24-hour range and can span across days, for example, from 23:00 to 01:00 on the next day.

Manage alert rules

  1. The Alert Rules page lists all alert rules with the following information.

    Parameter

    Description

    Alert status

    The current status of the rule. Valid values:

    - Ok: No alert condition triggered.

    - Alarm: Alert condition triggered, alert active.

    - NoData: No monitoring data available.

    Rule name/ID

    The display name and unique identifier (UUID) of the alert rule.

    Enabled status

    Whether the rule is enabled. Enabled rules are evaluated at the configured interval; disabled rules are not.

    Service source

    The service that the rule applies to.

  2. You can search for alert rules using the following parameters:

    • Monitoring type

    • Rule name/ID

    • Alert status

    • Enabled status

    • More filters: Search by tag or notification contact.

  • Edit: To edit an alert rule, select it and click edit in the actions column, modify the rule, and click OK.

  • Enable/Disable: Toggle the switch in the enabled status column.

  • Delete: Click delete image in the actions column.

    Warning

    This action cannot be undone. Proceed with caution.