Alerting

更新时间:
复制 MD 格式

Simple Log Service (SLS) Alerting is an AIOps platform for alert monitoring, noise reduction, incident management, and notification dispatching.

Architecture

Alerting consists of three subsystems: alert monitoring, alert management, and action management.

Alerting features

Key features

Category

Subcategory

Feature

Description

Alert monitoring

Basic capabilities

Query and analyze logs

Run queries and analyses using the Query syntax and SQL-92.

Query and analyze time-series data

Run analyses using SQL-92 and PromQL. Syntax for querying and analyzing time-series data.

Machine learning

Use AIOps algorithms for prediction, anomaly detection, and root cause analysis. Machine learning syntax.

Correlated monitoring

Correlated monitoring across multiple Logstores or MetricStores

Use SQL JOIN statements or set operations for correlated monitoring across multiple Logstores or MetricStores.

Correlated monitoring between Logstores and MetricStores

Use SQL JOIN statements or set operations to implement correlated monitoring between Logstores and MetricStores.

Cross-Project correlated monitoring

Use set operations to implement cross-Project correlated monitoring.

Cross-region correlated monitoring

Use set operations to implement cross-region correlated monitoring.

Cross-account correlated monitoring

Use set operations to implement cross-account correlated monitoring.

Allowlist/denylist monitoring

Use Resource Data for allowlist/denylist monitoring.

Monitoring rule orchestration

Configure no-data alerts

Configure no-data alerts.

Set alert severity

Set static and dynamic alert severity levels.

Set labels and annotations

Custom labels and annotations. Variables supported in annotation values.

Grouped evaluation

Group query and analysis results.

Alert recovery

Enable resolved alert notifications.

Set consecutive trigger threshold

Set a consecutive trigger threshold to suppress alerts.

Disable monitoring tasks

Temporarily or permanently disable monitoring tasks.

Paused tasks can resume automatically at a scheduled time.

Alert management

Noise reduction

Deduplicate identical alerts

Within a time window, you can deduplicate identical alerts or delay their notifications. For more information, see Deduplicate alerts based on fingerprints.

Alert grouping

Grouping policies combine alerts with the same attributes into a single notification. Multiple alert grouping methods.

Alert silence

Create silence policies to prevent matching alerts from triggering notifications during a specified period.

Action management

Action policy

Dynamic dispatching of notification channels

Dynamically dispatch alert notifications to users, user groups, or on-duty groups through specific channels. Action policy.

Recipients

Users

Individual users. For more information, see Create users and user groups.

User groups

A group that contains multiple users. For more information, see Create users and user groups.

On-duty groups

Create on-duty groups that include users and user groups. Arrange rotating on-call shifts based on periods and working hours. For more information, see Create an on-duty group.

Channel calendars

Holiday awareness

Automatically adjusts notification methods during holidays.

On-call schedule

Rotation

Automate on-call rotations for users and user groups on a specified cycle.

Override

Configure temporary shift overrides for a specific period.

Holiday awareness

Automatically adjust rotation or override schedules based on holidays.

Independent calendars

Configure separate, resettable calendars for on-duty groups.

Notification channels

SMS notifications

Sends alert content through SMS messages.

Voice notifications

Sends alert content through voice calls.

Email notifications

Sends alert content through email.

DingTalk notifications

Sends alert content through a DingTalk chatbot.

Webhook notifications

Sends alert notifications to a custom webhook address by using HTTP or HTTPS calls.

Use webhooks to extend notification channels to platforms such as WeCom, Lark, and Slack.

Message Center

Sends alert content through the Alibaba Cloud Message Center.

Alert analysis

Global Alert Center

Execution history report for alert monitoring rules

Provides execution history reports for alert monitoring rules to help with troubleshooting.

Alert Rule Center

Provides a dashboard to view the overall execution status and triggered alert status of alert rules.

Alert Trace Center

Provides a dashboard that shows the entire alert lifecycle, from generation and management to final notification.

Alert Troubleshooting Center

Provides a troubleshooting center that displays errors from various stages, including alert monitoring, management, and notification, to simplify debugging.

Centralized storage

Centralized alert storage allows you to easily view received, processed, and sent alerts and their related logs.

After you initialize the alerting feature, SLS automatically creates a Project named sls-alert-<ACCOUNT_ID>-<REGION> and a Logstore named internal-alert-center-log in the selected region to store alerts.

Benefits

  • Easy to start and scale

    SLS provides end-to-end log and time-series data processing: ingest, store, query, analyze, visualize, and alert. After importing data, create monitoring tasks, notification channels, and alert policies within minutes.

    Scale your alerting configuration from small teams to enterprise-wide scenarios.

  • High availability and reliability

    Built on SLS infrastructure, Alerting provides 99.9% service availability and over 99.99999999% data reliability for alert-related data.

  • Low cost and maintenance-free

    Alert monitoring and incident management are currently free. Only SMS and voice call notifications incur a small fee.

    As a fully managed SaaS service, Alerting eliminates the operational overhead of running your own alerting system.

  • Fast response to issues

    Intelligent monitoring and incident management accelerate alert response and issue resolution, reducing business disruption losses.

Use cases

DevOps

Monitor all stages of the development lifecycle. Track Kubernetes logs, application logs, and metrics across development, staging, and production environments. When errors or anomalies such as latency spikes are detected, the responsible developers are notified immediately.

Built-in rule templates in SLS applications such as Log Audit Service and SLB Log Center simplify monitoring setup.

ITOps

Monitor stability metrics such as response time, load, and error rates in real time. Alerting supports noise reduction, grouping, and dynamic dispatching based on custom dimensions. Alerts are automatically assigned to the on-call engineer based on schedules and calendars, with automated workflows for resolution notifications, status updates, and escalation.

AIOps

Combine SLS machine learning with Alerting to monitor log and time-series data. SLS provides over a dozen ML algorithms — smoothing, prediction, decomposition, clustering, and pattern mining — applicable directly in alert monitoring rules. For more information, see Machine learning functions. The ML service uses streaming statistics or graph algorithms to detect anomalies and route them to the alerting system.智能运维

SecOps

Continuously monitor audit and security data to identify compliance anomalies and threat events. Alerting automatically dispatches notifications based on event severity and source, and supports workflow automation such as security posture dashboards.

SLS Log Audit Service automates cross-account collection of compliance and security logs from major Alibaba Cloud products, with built-in threat intelligence integration and nearly 100 monitoring rule templates.

BizOps

Track business metrics such as user activity, ad click-through rates, and cloud product bills to detect anomalies like unusual charges. Log on to the Billing Management console to view details.

Key concepts

Term

Description

Logstore

Logstores store log data with query and analysis capabilities (SQL-92). Alert monitoring depends on this feature.

MetricStore

MetricStores store time-series data with query and analysis capabilities (SQL-92 and PromQL). Alert monitoring depends on this feature.

alert

When used alone, it refers to an alert event. For example, after an alert monitoring rule triggers one or more alerts, they are passed to the action management system through the alert management system.

When combined with other words, "alert" refers to a subsystem, feature, entity, or module of the alerting feature, such as the alert monitoring system or an alert monitoring rule.

Alert monitoring

A subsystem responsible for generating alerts. The alert monitoring system consists of alert monitoring rules and Resource Data.

It periodically evaluates data based on alert monitoring rules, assesses query and analysis results according to rule orchestration logic, and triggers alerts or resolved alerts, which are then sent to the alert management system.

Alert management

A subsystem responsible for noise reduction and managing alert statuses. The alert management system consists of alert policies, incident management, and alert overview dashboards.

The alert management system processes received alerts by routing, deduplicating, silencing, and grouping them based on alert policies before sending them to the action management system. It also supports setting incident stages and assignees.

Action management

A subsystem responsible for managing alert notification channels and recipients. The action management system consists of action policies, content templates, calendars, users, user groups, on-duty groups, and channel quotas.

The action management system dynamically dispatches alerts to specific notification channels based on action policies, which then notify the target users, user groups, or on-duty groups. It also supports customizing alert notification content.

Alert monitoring

The alert monitoring system generates alerts and consists of alert monitoring rules and Resource Data.

Alert monitoring

Term

Description

Alert monitoring rule

An alert monitoring rule contains query and analysis statements, target objects (Logstores, MetricStores, and Resource Data), and monitoring orchestration settings. Create an alert monitoring rule.

Resource Data

An independent, editable, table-like storage structure for resource configurations and custom data used by the alerting system. Primarily used for correlated queries, such as allowlist/denylist scenarios.

Create Resource Data.

Alert severity

A non-identifying attribute indicating alert seriousness. Levels: Critical, High, Medium, Low, and Report. Set alert severity.

Grouped evaluation

A parameter in an alert monitoring rule. The system groups query and analysis results by specified fields, evaluating each group independently against the trigger condition. This lets a single rule monitor multiple targets, with each group managed as a separate alert and incident. Configure grouped evaluation.

Evaluation expression

A computational expression that uses a specific evaluation syntax to configure alert trigger conditions or dynamically assess alert severity.

Evaluation expressions support logical comparisons and calculations using fields from query and analysis results. A true result indicates a match. Configure an evaluation expression.

Alert label

An identifying attribute of an alert in key-value format. Define custom labels in an alert monitoring rule. When an alert is triggered, the corresponding label is attached. Labels can be referenced in content templates and used as alert attributes for management and notification dispatching in alert management and action management.

  • When you monitor a MetricStore with grouped evaluation by label, SLS automatically uses the labels from the query and analysis results as the labels for the triggered alert.

  • When you monitor a Logstore with grouped evaluation by one or more fields, SLS automatically uses the key-value pairs of those grouping fields as the alert's labels.

For more information, see Labels.

Alert annotation

A non-identifying attribute of an alert in key-value format. Define custom annotations in an alert monitoring rule. When an alert is triggered, the corresponding annotation is attached. Annotations can be referenced in content templates and used as alert attributes for management and notification dispatching. For more information, see Annotations.

Resolved alert

A special type of alert notification indicating that the alert condition is resolved. A normal alert has a "triggered" status, while a resolved alert has a "resolved" status. When you enable this feature, if the previous check by the alert monitoring system triggered an alert, but the current check's results do not meet the trigger condition, a resolved alert is sent. In high-frequency monitoring scenarios, enabling resolved alerts ensures you are promptly notified when an issue is resolved. For more information, see Configure resolved alerts.

Alert management

The alert management system handles noise reduction and status management. It consists of alert policies, incident management, and alert overview dashboards.

Alert management

Term

Description

Alert policy

A configuration entity in the alert management system and a parameter in an alert monitoring rule. When the alert management system receives an alert (including a resolved alert), it automatically performs noise reduction and grouping based on the alert policy. The resulting grouped alerts are then sent to the action management system for notification.

Alert fingerprint

The alert management system calculates a fingerprint for each alert it processes. Alerts with the same fingerprint are considered identical. The fingerprint is calculated based on the alert's identifying attributes, including the Alibaba Cloud account ID, the Project where the alert resides, the alert rule ID, and the alert labels. For more information, see Deduplicate alerts based on fingerprints.

Alert silence

A configuration item in an alert policy. Based on the silence policy, the system ignores matching alerts during the specified period, suppressing notifications. Alert silence mechanism.

Alert grouping

A configuration item in an alert policy. After receiving alerts, the system groups matching alerts into a set according to the grouping policy. After delay and deduplication, the set is sent to the action management system. Multiple alert grouping methods.

Grouped set

A collection that stores grouped alerts, containing one or more alerts with different fingerprints. After processes like delay and deduplication, the grouped set is sent to the action management system for notification.

Action management

The action management system manages notification channels and recipients. It consists of action policies, content templates, calendars, users, user groups, on-duty groups, and channel quotas.

Action management

Term

Description

Action policy

An action policy is a configuration entity in the action management system. When the action management system receives a grouped set of alerts (including resolved alerts) from the alert management system, it uses the action policy to dynamically dispatch notifications to specific channels. These channels then notify the target users, user groups, or on-duty groups.

Action policy.

Webhook integration

Manage webhook notification channels. Use webhooks directly in action policies. SLS supports DingTalk, WeCom, Lark, Slack, and custom generic webhooks. For more information, see Create a webhook.

Content template

SLS sends alert content based on the defined content template. Each primary channel has a corresponding text template that supports referencing alert attributes with variables. For webhook channels, you can also configure the message entity format to adapt to specific protocols, such as the format required by WeCom. For more information, see Create a content template.

Calendar

An independent configuration in the action management system, including a global default calendar and custom calendars.

  • The global default calendar defines global calendar information, including time zone, holiday synchronization country, weekly workdays, and specific working hours.

    The sending period in an action policy uses the global default calendar to distinguish between workdays and non-workdays, and between working and non-working hours.

  • Custom calendars are used to define custom workdays and holiday periods and are exclusive to on-duty groups.

User

A configuration entity representing a specific recipient. It includes information such as user ID, username, phone number, and email. Use action policies to send alerts to target users. Set target users as assignees in incident management.

Create a user.

User group

A configuration entity representing a virtual collection of users. It includes a user group identifier, group name, and a list of users. A user group can contain one or more users. Use action policies to send alerts to target user groups.

Create a user group.

On-duty group

A configuration entity representing a collection of on-duty users and user groups. It includes an on-duty group identifier, group name, rotation and override configurations, and an associated calendar. An on-duty group can contain one or more users or user groups. Use action policies to send alerts to target on-duty groups.

For information about how to create an on-duty group, see Create an on-duty group.

Rotation

A configuration item in an on-duty group used to set up a rotation schedule for users or user groups. Add multiple rotation schedules to an on-duty group. Rotations support non-continuous time periods and dynamic handovers based on the calendar.

Rotation and override scenarios.

Override

A configuration item in an on-duty group used to set up an override schedule for users or user groups. Add multiple override schedules to an on-duty group.

Rotation and override scenarios.

Limitations

Category

Item

Description

Alert monitoring

Maximum number of alert monitoring rules

Create up to 100 alert monitoring rules per Project.

General limits on query and analysis

Query and analyze data.

Concurrency limits on query and analysis

If a large number of query and analysis operations are performed simultaneously in a Project (for example, through an SDK) and many alert monitoring rules are created, the number of concurrent queries may exceed the Project's limit, causing monitoring to fail. We recommend setting Dedicated SQL to Auto when creating alert monitoring rules to support higher concurrency. If you use Dedicated SQL, ensure the target Project has sufficient Dedicated SQL CUs. To create an alert monitoring rule, see Create an alert monitoring rule. To enable Dedicated SQL, see High-performance and fully accurate queries and analysis (Dedicated SQL).

Query and analysis syntax limits

Only query statements, SQL analysis statements, and SQL+PromQL statements are supported. Phrase query statements and Scan mode query and analysis statements are not supported.

Note

Alert monitoring rules that use phrase queries or Scan mode query and analysis statements (SPL syntax) can be created, but they might fail or produce unexpected results during runtime.

Limits on a single query and analysis result

  • When using only a query statement, the system returns up to 100 rows by default. Trigger conditions based on the number of data rows may be inaccurate. We recommend using the COUNT function for statistics in such cases.

  • An analysis statement returns up to 100 rows by default. If you need more data, use the LIMIT clause.

    If the result of an analysis statement exceeds 1,000 rows, the system only uses the first 1,000 rows for set operations.

  • When there are three query and analysis operations and Set Operation is not set to No Union, only the first 100 rows from each result are used.

  • When there are two or more query and analysis operations, Set Operation is set to No Union, and No-data Alert is enabled, the system only uses the result of the first query and analysis statement to check for no data.

Number of combined queries

1 to 3.

Field value length

If a field value exceeds 1,024 characters, only the first 1,024 characters are used for analysis.

Query and analysis time range

The time span for each query and analysis statement cannot exceed 24 hours.

Resource Data update latency

Updates to Resource Data are not immediate. The changes take effect within 15 minutes.

Alert management

Alert policy evaluation interval

The minimum evaluation interval is 15 seconds. Even if you set a smaller value, checks are performed at 15-second intervals.

Policy matching conditions

In configurations like alert policies and action policies, we recommend using conditions based on Project name, alert rule ID, alert name, severity, or short labels or annotations.

  • For string matching, we recommend using short, plain strings like foobar.

    Matching newlines or the double quote character (") is not supported. For example, foo "bar" cannot be parsed correctly.

  • Regular expression matching does not support glob expressions. For example, *Error is a glob expression, while .*Error is a regular expression.

Number of incidents

Up to 1,000 incidents are retained for 30 days. Older incident data is automatically overwritten.

Incident comments

Each incident can have up to 10 comments. Older comments are automatically overwritten.

Policy configuration update latency

Changes to alert-related policies, such as alert policies, action policies, content templates, users, user groups, and on-duty groups, typically take effect within one minute.

Action management

Notification channels

The following limits apply to notification channels. Exceeding these limits may prevent you from receiving alert notifications. If you do not receive a notification, you can check for related errors in the Global Alert Troubleshooting Center. For more information, see Global Alert Troubleshooting Center.

  • Voice call

    Only mobile phone numbers in the Chinese mainland (+86) are supported.

    Note
    • If an alert voice call is not answered, it will not be redialed. An SMS notification will be sent once instead.

    • You are charged once for each voice call, regardless of whether it is answered. The fallback SMS message does not incur additional charges.

  • DingTalk

    DingTalk chatbots are limited to 20 messages per minute.

  • WeCom

    WeCom chatbots are limited to 20 messages per minute.

  • Lark

    • Lark chatbots are limited to 20 messages per minute.

    • Mention can only be set to No mention or All. It cannot be set to Specific members.

  • Custom webhook

    • The address must be publicly accessible.

    • A webhook call is considered successful only if it returns a 200 status code. All other status codes indicate failure.

  • Function Compute

    Only functions starting with sls-ops- are supported.

Notification channels.

Notification content

Each notification channel has a content length limit. To ensure successful delivery, the system may truncate oversized content. Truncation does not guarantee content integrity or 100% delivery success, especially if the truncated content results in invalid Markdown or HTML. For plain text formats like SMS and voice calls, truncation generally does not cause delivery failure.

Configure content templates according to channel limits to avoid delivery failures caused by oversized content. The limits for each channel are as follows (Chinese characters, English letters, numbers, and punctuation all count as one character):

Note

If a field value exceeds 1,024 characters, only the first 1,024 characters are used.

  • SMS

    The content is limited to 256 characters.

  • Voice call

    The content is limited to 256 characters.

  • Email

    The content is limited to 8 KB.

  • DingTalk

    The content is limited to 8 KB.

  • WeCom

    • The content is limited to 4 KB.

    • When Mention is set to All or Specific members, the content must be in plain text format and does not support Markdown.

  • Lark

    The content is limited to 8 KB.

  • Slack

    Notification content is limited to 8 KB.

  • Custom webhook

    The content is limited to 16 KB.

  • Message Center

    The content is limited to 8 KB.

  • Function Compute

    The content is limited to 16 KB.

  • EventBridge

    The content is limited to 16 KB.

Content template configuration

An incorrect content template configuration may cause rendering to fail and return an error. If you receive an alert notification containing an error message like Template render error, check your template configuration against the Content template syntax (New) and the error message.

Content template variables

The content length is limited to 2 KB. Content exceeding this limit will be truncated.

Billing

Fees are incurred for SMS and voice call notifications. For detailed pricing, see Pricing.

Operation

Description

SMS notifications

Fees are charged per alert SMS notification sent.

Note

Some carriers may split long SMS messages (for example, over 70 characters) into two messages. If your message is long, you may receive two separate messages, but SLS will only charge for one.

Voice notifications

Fees are charged per alert voice call.

Note
  • If an alert voice call is not answered, it will not be redialed. A single SMS notification will be sent instead.

  • You are charged once for the voice call whether it is answered or not. The fallback SMS message does not incur additional charges.