Notification policy best practices

更新时间:
复制 MD 格式

A notification policy is the core dispatching unit of the Cloud Monitor 2.0 Alert Center. It subscribes to observability events from alert rules and external integrations, then applies noise reduction, routing, lifecycle management, and automated actions. Follow these best practices to meet complex notification requirements with a minimal number of policies.

Background information

When alert rules or external integrations continuously generate alert events, you need timely notifications. However, requirements vary by business, alert level, and time of day, and you also need to manage the alert lifecycle — claiming, muting, and escalating alerts. A Cloud Monitor 2.0 notification policy provides a unified solution:

  • Precise, on-demand subscription: Use event subscription conditions (Any, All, or Custom expressions) to subscribe to specific events.

  • Noise reduction and aggregation: Group events with the same dimensions into a single issue by specifying aggregation fields. This reduces alert storms and notification fatigue.

  • Routing and dispatch: Use routing rules to dispatch events to different notification objects and channels based on their labels.

  • Lifecycle management: Manage the entire alert lifecycle with recovery notifications, auto-recovery, repeated notifications, and escalation policies to prevent missed alerts and excessive notifications.

  • Action integration: Automatically execute actions, such as webhooks, Function Compute, and ITSM ticket creation, when an alert is triggered or recovered.

  • AI assistance: Use the built-in digital assistant to automatically provide root cause analysis (RCA).

Note

In this document, an event refers to an observability event, which is structured data generated by an alert rule or integration. An issue is the entity for notification and management, created after a notification policy applies noise reduction and aggregation. A single notification policy can be shared by multiple alert rules, and any changes to the policy take effect immediately for all rules that reference it.

Notification policy processing flow

A notification policy requires an alert event source. Cloud Monitor 2.0 supports three types of event sources:

Source type

Description

Cloud Monitor alert rules

Cloud Monitor 2.0 is natively integrated with its own alert rules, including rules for cloud service monitoring, Prometheus, application monitoring, AI Agent observability, and user experience monitoring. When a rule is triggered, the generated event can be subscribed to by a notification policy.

Event integrations

You can ingest events from third-party sources like self-managed Prometheus, Grafana, Zabbix, SkyWalking, and custom webhooks into the Alert Center. Once an event is ingested, it can be subscribed to by a notification policy.

Cloud Monitor 1.0 and SLS alert events

Alert events from Cloud Monitor 1.0 and SLS are automatically written to the Cloud Monitor 2.0 Event Center.

Note

Cloud Monitor 2.0 supports integrations with third-party products, allowing you to send events from services like self-managed Prometheus, Grafana, Zabbix, and SkyWalking to the Event Center for unified management. After you configure an integration, notification policies can match events from that source.

Event subscription

Event subscription is the entry point for a notification policy. Each event carries structured labels such as alert level, alert source, service name, resource type, cluster name, namespace, or custom tags. Subscriptions filter events by these labels, and only matching events proceed to subsequent processing steps.

Matching mode

Description

Use case

Any

An event matches if it meets any of the subscription conditions (OR).

Send multiple types of alerts to the same recipient, or handle events from multiple sources for a single team.

All

An event must meet all of the subscription conditions (AND).

Create fine-grained subscriptions for a specific environment, team, and alert level.

Custom

Define custom expressions with condition numbers, such as (1 AND 2) OR (3 AND 4).

Handle complex filtering that requires two or more independent branches of conditions.

Notification configuration

Notification configuration is the core of a notification policy and consists of three sub-modules that execute in order:

  1. Noise reduction and aggregation: Select one or more labels as aggregation fields. Events with the same values for these fields are aggregated into a single issue. These aggregation fields are a prerequisite for subsequent routing.

  2. Routing rules: After aggregation, dispatch the issue to specific notification objects (contacts, contact groups, or on-call schedules) and notification channels based on labels. You can define multiple rules, which are matched in order. Because the rules are not mutually exclusive, an issue can match multiple rules.

  3. Notification templates: Select content templates for different channels (Phone, SMS, Email, DingTalk, Lark, or WeCom). If you do not configure a template for a channel, the system default is used.

Lifecycle management

Lifecycle management determines what happens if an alert is not recovered or acknowledged:

Configuration

Recommendation

Recovery notification

Unless the event is not important, enable this setting to keep relevant personnel aware of status changes.

Auto-recovery

If the alert rule can generate a recovery event, select Alerts do not auto-recover. If the event source does not send recovery events (such as a custom webhook), select Scheduled auto-recovery.

Repeated notification

We recommend enabling repeated notifications for high-priority alerts (P0/P1) to prevent them from being missed. We do not recommend enabling it for P3/P4 alerts.

Escalation policy

For critical services, configure an escalation policy to automatically escalate the issue to a shift lead or manager if it is not claimed within a specified time.

Action integration

Action integration is disabled by default. When enabled, you can execute actions when an alert is triggered or recovered. Common actions include webhooks (for DingTalk, Lark, or WeCom group bots), HTTP callbacks, Function Compute, and ITSM tickets. This enables automated, closed-loop alert handling such as auto-scaling, automatic restarts, and automatic ticket creation or closure.

Notification policy design principles

Before configuring policies, plan your strategy based on the following principles to significantly reduce future maintenance costs.

  • Align policies with recipient groups: Instead of creating a notification policy for each alert rule, identify "who receives which types of alerts" and create one policy per recipient group (such as an SRE team, a business team's DingTalk group, or a production on-call schedule). Then, use event subscription conditions to select the relevant events.

  • Plan policies based on three dimensions: level, domain, and time. Common dimensions are alert level (P0/P1 vs. P2/P3), business domain (namespace, service name, or label), and time (business hours vs. non-business hours). By combining routing conditions and effective times in routing rules, a single policy can cover multiple combinations without requiring a separate policy for each scenario.

  • Always configure a catch-all routing rule: Add a catch-all rule with the condition "Unlimited" at the end of your routing rule list. Direct it to an SRE team or a central recipient. This ensures that no event is silently dropped, even if it does not match any preceding rules.

Note

Best practice: When adding a new alert rule, you only need to adjust the rule itself, not the notification policy. When personnel or teams change, you only need to update the notification objects in the policy, not the alert rules. This is the core value of the many-to-many decoupling design between notification policies and alert rules in Cloud Monitor 2.0.

Scenario 1: Alerting individuals

A basic scenario for individual developers or small teams.

Send events from several Prometheus alert rules in a cluster to a DingTalk group. Group members can directly claim, mute, or resolve alerts from the interactive message card.

Requirements

  • You have already integrated Prometheus (or a custom integration) and configured several alert rules, such as container_cpu_alert-test and container_memory_alert-test.

  • You want all events from these two rules in the alert-dev cluster to be sent to the team's DingTalk group. Alerts from other clusters should not be sent.

  • Each rule should trigger a separate notification. For each active alert, a reminder should be sent every 5 minutes.

Procedure

  1. In Notification management > Notification objects, create a DingTalk group bot or a contact. For more information, see Notification objects.

  2. Configure event subscription to filter for the target cluster.

    Set Matching Mode to All and add conditions to exclude events from other clusters and rules:

    • cluster_name equals alert-dev.

    • alert_name IN (container_cpu_alert-test, container_memory_alert-test).

  3. Configure noise reduction and aggregation to group by rule.

    Add _cms_rule_id (which groups by alert rule ID) as an aggregation field. Events from the two rules will create two separate issues, each with its own trigger, notification, and recovery lifecycle.

  4. Configure a routing rule to send notifications to the DingTalk group.

    Add a routing rule: set Routing Condition to Unlimited, Notification Object Type to Bot / DingTalk Bot, and Target Object to the object you created in step 1. For Notification Method, select the "DingTalk" checkbox.

  5. Configure repetition and recovery settings.

    Enable Recovery Notification. Set Auto-recovery to Alerts do not auto-recover, as you will rely on the recovery events from the alert rule. For Repeated notification, set it to repeat at an interval of 5 minutes.

  6. Save the policy and wait for an alert event to be generated, or trigger one with a test event. You will receive an alert card in your DingTalk group.

Handle alerts in DingTalk

The DingTalk card provides action buttons such as Claim / Mute / Resolve / Follow / Create Ticket. The first time you perform an action, you must bind your phone number, which must be registered as a notification object in Cloud Monitor 2.0, and enter a verification code from the card.

Scenario 2: Route alerts to business teams

Distribute alerts based on domain for SRE and multiple business teams.

An SRE team configures alert rules centrally but needs to dispatch alerts to different DingTalk groups and contact groups based on the business team. They also need to use different notification frequencies for test and production environments.

Requirements

  • The organization includes an SRE team and three business teams (A, B, and C). Their respective resources are configured with the tags team-a, team-b, and team-c.

  • The clusters are divided into a test cluster, cluster-dev, and two production clusters, cluster-prod-1 and cluster-prod-2.

  • Alerts from the test environment are sent only to the corresponding business team's DingTalk group with a low frequency (repeated every 30 minutes).

  • Alerts from the production environment are sent to both the SRE team (via phone, SMS, and DingTalk) and the corresponding business team (via DingTalk) with a high frequency (repeated every 5 minutes).

Solution

This setup would typically require a separate policy for each team and environment. With Cloud Monitor 2.0, you can use multiple routing rules with routing conditions within one policy per environment to handle all dispatching. This reduces many policies to just two, resulting in a cleaner structure.

Policy 1: Test environment (cluster-dev)

Stage

Key configuration

Event subscription

All: cluster_name equals cluster-dev.

Noise reduction and aggregation

Aggregation fields: cluster_name, alert_name, tag.

Routing rules

① Routing condition tag = team-a, object = Business A DingTalk group, channel: DingTalk
② Routing condition tag = team-b, object = Business B DingTalk group
③ Routing condition tag = team-c, object = Business C DingTalk group
Catch-all: Routing condition = Unlimited, object = SRE DingTalk group (to receive alerts from unassigned namespaces)







Repetition / escalation

Repeated notification every 30 minutes; do not configure an escalation policy.

Policy 2: Production environment (cluster-prod-1 / cluster-prod-2)

Stage

Key configuration

Event subscription

Any: cluster_name = cluster-prod-1, cluster_name = cluster-prod-2.

Noise reduction and aggregation

Aggregation fields: cluster_name, alert_name, tag.

Routing rules

① Routing condition = Unlimited, object = SRE contact group (Phone + SMS) + SRE DingTalk group (DingTalk)
② Routing condition tag = team-a, object = Business A DingTalk group
③ Routing condition tag = team-b, object = Business B DingTalk group
④ Routing condition tag = team-c, object = Business C DingTalk group







Repetition / escalation

Repeated notification every 5 minutes; an escalation policy can also be added (see Scenario 3 for details).

In Cloud Monitor 2.0, a single event can match multiple routing rules simultaneously. All matching rules take effect because they are not mutually exclusive. This allows you to dispatch alerts to both SRE and business teams in parallel within a single policy, eliminating the need for multiple policies.

Scenario 3: On-call and emergency response

On-call rotation with emergency escalation.

During business hours, three engineers (A, B, and C) are on a two-day rotation. After hours, engineer D is on call. By default, an SMS notification is sent. If an alert is not claimed within 10 minutes, it is escalated to a phone call. If it is still not claimed after 20 minutes, the team manager is notified by phone.

Procedure

  1. Create contacts

    Create contacts for engineers A, B, C, D, and the team manager. Also, configure a DingTalk group as a notification object in advance.

  2. Create an on-call schedule

    In Notification management > On-call schedules, create the schedule:

    • Shift 1: Members A, B, and C receive alerts from 09:00 to 18:00 daily, with a rotation every 48 hours.

    • Shift 2: Member D receives alerts from 18:00 to 09:00 daily, with no rotation.

  3. Create an escalation policy

    In Notification management > Escalation policies, create an escalation chain:

    • Level 1: If an alert is not claimed within 10 minutes, notify the on-call person via phone.

    • Level 2: If the alert is still not claimed after 20 minutes, notify the team manager via phone.

  4. Create a notification policy

    Configure event subscription and noise reduction and aggregation as described in Scenario 1 or 2. In the routing rule, set Notification Object Type to "On-call schedule" and select the schedule you just created. For the notification method, select "SMS" only. From the Escalation Policy dropdown, select the policy you created in step 3.

Note

You can enable an escalation policy and repeated notifications at the same time. The escalation policy expands the audience by level, while repeated notifications continuously remind the original recipient over time. The two features work independently and do not conflict.

Scenario 4: Action integrations

Automatically execute external actions when an alert is triggered or recovered.

When an alert is triggered, automatically call Function Compute to scale out resources. When the alert is recovered, the system automatically closes the corresponding ITSM ticket.

Typical action patterns

Trigger

Action

Actions on trigger

  1. Call a webhook to send a notification to a Lark or WeCom group bot.

  2. Call Function Compute to automatically scale out, restart a container, or clear a cache.

  3. Call an HTTP API to create an ITSM ticket.

  4. Call a custom business API to trigger service degradation or rate limiting.

Actions on recovery

  1. Send a recovery message to an IM group.

  2. Trigger a resource scale-in.

  3. Close an ITSM ticket.

  4. Notify a monitoring dashboard to mark the event as recovered.

Procedure

  1. In Integration management, create an action target (such as a webhook, Function Compute, or HTTP callback) in advance. This is a prerequisite for configuring an action integration.

  2. Edit a notification policy, go to the Action Integration section, and enable the "Action Integration" switch.

  3. In the Actions on Trigger area, select one or more of the action targets you created. Do the same in the Actions on Recovery area.

  4. After you save the policy, you can go to Integration management > Call log to view the execution status and failure reasons for each action.

Note

Ensure that your action target uses idempotent logic to prevent side effects like repeated scaling operations or duplicate tickets from network retries or repeated notifications. For actions that involve changes, perform a canary release using a test notification policy first.

Anti-patterns and troubleshooting

The following are common configuration anti-patterns. Review your notification policies against these patterns to avoid common pitfalls.

Anti-pattern 1: One-to-one binding

Not recommended
  • Maintaining a separate notification policy for each alert rule.

  • The number of notification policies grows exponentially as more rules are added, leading to high maintenance costs.

  • Changing a recipient requires modifying each policy individually.

Recommended practice
  • Create notification policies based on recipient groups. A single policy should serve a single group of recipients.

  • To add a new alert rule, simply update the event subscription conditions to include it.

  • To change recipients or notification frequency, modify only the notification policy, not the alert rules.

Anti-pattern 2: Overusing repeated notifications

Not recommended
  • Setting a 1-minute repeat interval for all alerts, regardless of their level, creates excessive noise.

  • Low-priority alerts drown out genuinely urgent P0/P1 alerts.

  • Responders gradually start ignoring all notifications.

Recommended practice
  • Configure different settings based on alert level: a 5-minute repeat interval plus an escalation policy for P0/P1, a 30-minute interval for P2, and no repetition for P3/P4.

  • Use a longer repeat interval at night to avoid unnecessary wake-ups.

  • Integration with the digital assistant's Root Cause Analysis (RCA) provides more decision information in the first notification.

Anti-pattern 3: No catch-all rule

Not recommended
  • Relying solely on silence rules to suppress noise without a catch-all routing rule.

  • If a silence rule is misconfigured, all alerts can be silently dropped.

  • When new resources or namespaces are added without a corresponding route, no one receives the alerts.

Recommended practice
  • Include a catch-all routing rule with the condition "Unlimited" in every notification policy, directed to an SRE inbox.

  • Use silence rules only for known, specific noise. Always set an end time to prevent permanent silencing.

  • Periodically review the hit rate of your notification policies and remove any routing rules that are not being used.

FAQ

Routing rule limits per policy

A single notification policy supports a large number of routing rules, sufficient for most scenarios. Split rules for different business domains into separate policies to keep each policy maintainable. For specific limits, refer to the quotas displayed in the console.

Routing rules vs. alert rules

An alert rule gains notification capabilities by associating with a notification policy. A single alert rule can be associated with multiple notification policies. The routing rules inside the policy dispatch matched events to specific notification objects, creating an upstream-downstream relationship: an alert rule specifies which policy to use, and the policy's routing rules determine who receives the alert.

Time-based routing

Create two routing rules within the same notification policy. For routing rule 1, set Effective Days to "Monday to Friday" and the notification object to Team A. For routing rule 2, set Effective Days to "Saturday, Sunday" and the notification object to On-call Schedule B. You do not need to create multiple notification policies.

Escalation policies vs. repeated notifications

No, they do not conflict. A repeated notification continuously reminds the original notification object at a fixed interval, while an escalation policy notifies higher-level personnel if the alert is not claimed within a specified time. For high-priority alerts, enable both to ensure nothing is missed.

Policies for different environments

Create one notification policy per environment. Use event subscriptions to filter events by cluster_name or a custom environment label. Within each policy, handle variations for different business teams or time periods by using multiple routing rules and effective time settings, eliminating the need for a separate policy per team.

Effective time for policy changes

Changes take effect immediately for all alert rules that use the policy. You do not need to re-associate the rules. Before making changes, evaluate the impact on all alert rules that share the policy.