Intelligent network O&M solutions

更新时间:
复制 MD 格式

This document introduces the design goals and common scenarios of the Alibaba Cloud Intelligent Network O&M Solution. It explains how to achieve efficient, proactive, and intelligent network operations and management (O&M) using four key methods: using a central dashboard to obtain a global overview, using alerts to quickly detect and locate issues, using inspections to proactively find and eliminate risks, and using tools to analyze and resolve root causes. This document also provides guidance on selecting monitoring and event platforms that support these capabilities. It includes detailed instructions for configuring dashboards and alerts for various Alibaba Cloud networking services to help you apply these practices to your business.

1. Background and requirements

As digital transformation deepens, the cloud network has become the core infrastructure that supports business operations. However, the growing complexity, dynamic nature, and scale of cloud environments present unprecedented challenges to traditional network O&M:

  • Lack of a global view: Network resources are scattered across different products, which prevents a unified view for monitoring, analysis, and alerting. This fragmentation makes global optimization difficult.

  • Increased complexity: The widespread adoption of technologies such as hybrid cloud, multicloud architectures, microservices, and containerization (such as ACK) makes network topologies increasingly complex and difficult for traditional manual O&M to manage.

  • Difficulty in finding performance bottlenecks: Issues such as traffic bursts, bandwidth bottlenecks, and latency jitter are hard to detect and predict in real-time. This affects the user experience.

  • Difficulty in fault localization: Troubleshooting network link issues across regions, VPCs, accounts, and products is time-consuming. It often relies on experience, resulting in low fault localization efficiency.

  • Increased security risks: The attack surface is expanding. Incorrect configurations or improper change management of security policies, such as security groups and network ACLs, can lead to security vulnerabilities.

  • High O&M costs: Relying on a large amount of manual effort for daily inspections, fault response, and configuration management is inefficient and results in high labor costs.

The Intelligent Network O&M Solution aims to help customers effectively adopt, use, and manage the cloud. The solution focuses on guiding customers through their daily O&M tasks, such as monitoring network metrics, identifying network risks, and analyzing and resolving network anomalies. It also helps them upgrade and optimize network performance to meet the new requirements introduced by business iterations.

2. Target customers

This solution is suitable for the following types of customers:

  • Large enterprises and group customers: Customers with complex hybrid cloud or multicloud architectures, multi-region deployments, and many VPCs and network resources. These customers have extremely high requirements for network stability, security, and O&M efficiency.

  • Internet and technology companies: Businesses with fast iterations and large traffic fluctuations. These businesses have strong demands for network performance, elasticity, and automatic fault recovery, and they strive for high DevOps/NetOps efficiency.

  • Customers in key industries such as finance, government, and healthcare: Customers with extremely strict requirements for network high availability and security compliance. They need to meet rigorous audit and regulatory requirements.

  • Traditional enterprises undergoing digital transformation: Enterprises that are migrating from traditional data centers to the cloud and need to quickly establish modern network O&M capabilities.

  • IT/Network O&M teams: Teams that want to use intelligent tools to improve O&M efficiency, reduce failure rates, and free up human resources to focus on higher-value work.

3. Solution overview

The Intelligent Network O&M Solution recommends four main methods for customers to perform cloud network O&M:

  • Use a central "dashboard" to gather data and understand the global situation.

  • Rely on "alerts" to detect and locate issues.

  • Perform daily "inspections" to find and eliminate potential risks.

  • Use "tools" to analyze and resolve root causes.

3.1 Use a central "dashboard" to gather and understand the global situation

A network dashboard is more than just a data visualization board. It is a cloud network operational hub that integrates monitoring, analysis, decision-making, and collaboration. A dashboard should be designed with the following guidelines:

  1. A network dashboard provides data support for specific roles to solve specific problems:

  • Business owners: Focus on overall availability, Service-Level Agreement (SLA) achievement, and cost trends.

  • O&M engineers: Focus on fault alerts, link status, and performance anomalies.

  • Architecture teams: Focus on topology, capacity usage, and scaling bottlenecks.

Suggestion: You can design different views for different roles, such as an "O&M View", "Business View", and "Architecture View".

  1. Display information in layers based on the network architecture to avoid information overload:

  • Internet access layer: EIPs, Internet Shared Bandwidth, NAT Gateway, and others.

  • The Application Delivery Layer includes services such as CLB, ALB, NLB, and GA.

  • Global networking layer: VPN Gateway, Express Connect, TR, CEN, and others.

Suggestions: 1) Support a three-level top-down drill-down from "Network Product Overview" to "Specific Product Overview" to "Single Instance Details". This lets you move from the overall picture to specific details. 2) Place related metrics on the same chart to increase information density. 3) Use good naming conventions, resource group divisions, and tags to help locate issues quickly.

  1. Focus on key services and metrics. Place important metrics on the dashboard. Other metrics do not require daily attention and are used for analysis only when related issues occur:

  • Traffic: Peak inbound and outbound bandwidth, traffic trends, and others.

  • Availability: SLB health check success rate, status of leased line, VPN, or cross-region links, and others.

  • Performance: Latency, response time, bandwidth utilization, packet loss rate, and others.

  • Cost: Number of EIPs, CDT bandwidth costs, network element CU costs, and others.

Suggestion: You can use "red/yellow/green" colors on the dashboard to indicate health status.

3.2 Rely on "alerts" to detect and locate problems

  • Event subscription mechanism: Subscribe to events that affect your business and set up an alert mechanism. This step helps you discover system anomalies, performance issues, or security threats as soon as they occur.

  • Immediate response process for critical alerts: Create a strict emergency response plan. For alerts marked as "Critical", have a clear plan and assign a dedicated person to coordinate and handle the issue until it is fully resolved.

  • Regularly check the Event Center: Set up a regular check-in schedule to review the history in the Event Center. Analyzing this data can help you identify trending issues or underlying risks in advance and take preventive measures to avoid service interruptions.

3.3 Rely on "inspections" to find and eliminate potential risks

The ability to perform inspections is a crucial step when you build and maintain a network architecture. First, you need to identify and understand various potential risks, including stability risks, security risks, performance risks, and cost waste. Stability risks are mainly caused by incorrect primary/secondary configurations, which can prevent a smooth failover during a failure and affect normal system operations. Unreasonable resource deployment can also lead to a large blast radius, increasing the possibility of a system crash. Security risks involve vulnerabilities in network ACL configurations and overly permissive security group permissions. These issues can create security holes and make the network environment vulnerable. Performance risks often manifest as network path detours, which increase data transmission latency. Frequent traffic overruns indicate that the system may need to be scaled out to meet growing demands. In addition, cost waste is an issue that cannot be ignored. Low resource utilization and incorrect choices among multiple billing methods can result in unnecessary expenses.

To effectively manage these risks, you need to perform regular inspections. In the NIS console, you can conduct network inspections, view historical reports, and initiate new inspections as needed. This process generates a detailed inspection report. We recommend running it once a week to promptly discover and address potential issues. Once an issue is found, you should immediately enter the risk handling stage. In this stage, you can use the NIS console and network inspection tools to view detailed reports, obtain network optimization suggestions, and take corresponding measures to mitigate the risks based on these suggestions. For example, for stability risks, you can optimize primary/secondary configurations and resource deployment. For security risks, you can patch network ACL configuration vulnerabilities and adjust security group permissions. For performance risks, you can optimize network paths and perform necessary scale-outs. For cost waste, you can improve resource utilization and choose more suitable billing methods.

3.4 Use "tools" to analyze and resolve root causes

NIS: Network Intelligence Service (NIS) is an intelligent network service product launched by Alibaba Cloud. It is based on years of large-scale network O&M practices and technological expertise and is designed for complex network scenarios. It integrates network measurement, diagnosis, and optimization to provide end-to-end network observability and intelligent analysis capabilities. NIS helps users quickly locate connectivity, performance, and fault issues across regions and network domains to achieve a cloud network O&M experience that is "visible, fast to check, and manageable". NIS provides a rich set of tools to analyze and resolve root causes. When you observe anomalies on the dashboard, when an alert occurs, or when an inspection report provides optimization suggestions, you can use the tools provided by NIS to perform the following functions:

  • Instance diagnosis: You can detect the configuration and running status of an instance and receive quick fixes based on the diagnosed anomalies.

  • Path analysis: You can analyze end-to-end network connectivity and diagnose connection problems caused by network configuration errors. When a destination is unreachable, you can identify the location and cause of the block. You can keep the network traffic analysis feature enabled to continuously monitor and analyze network traffic based on data such as throughput, packet loss, latency, and user distribution. This helps O&M engineers optimize the business architecture based on traffic conditions.

  • Network traffic analysis: You can monitor real-time and historical traffic data and metrics in the network to help you understand the performance and load of your network applications.

  • Network Insight: You can analyze the real-time operational status of business unit traffic to help you promptly perceive business network anomalies. It also provides network quality assessment data and event impact analysis.

  • Network topology: You can quickly understand your Alibaba Cloud network architecture, perform network configuration validation, and conduct unified O&M for your cloud network resources.

  • Performance monitoring: This feature provides average network latency data within Alibaba Cloud and across the Internet. This helps you choose a region or zone when you set up services.

4. Product portfolio

4.1 Monitoring platform comparison

There are many types of monitoring platforms. We divide them into three main categories based on their ecosystem: Alibaba Cloud CloudMonitor, Prometheus monitoring, and other monitoring platforms (such as Zabbix, the ELK Stack (Elasticsearch, Logstash, and Kibana), and OpenTelemetry). These are further divided into five subcategories: Alibaba Cloud CloudMonitor Basic, Alibaba Cloud Hybrid Cloud Monitoring (part of the Alibaba Cloud CloudMonitor category), ARMS Prometheus+Grafana, self-managed Prometheus+Grafana (part of the Prometheus category), and other monitoring platforms. The comparison is as follows:

Advantages

Disadvantages

Description

CloudMonitor Basic

  • Out-of-the-box, no configuration needed

  • Basic metrics (ECS, RDS, SLB) are free

  • Simple interface

  • Does not support cross-region aggregation

  • Does not support configuration files, weak automation capabilities

  • Limited support for custom metrics

  • Weak visualization capabilities (few chart types)

  • Cannot integrate non-Alibaba Cloud resources

  • Single-region deployment

  • Mainly for Alibaba Cloud services

  • Simple scenarios

Hybrid Cloud Monitoring

  • Supports cross-region resource aggregation

  • Can create advanced custom dashboards

  • Supports more cloud service metrics

  • Supports batch monitoring of hundreds of instances

  • Does not support configuration files, weak automation capabilities

  • Does not support advanced query languages such as PromQL

  • Limited Kubernetes monitoring capabilities

  • Cannot integrate non-Alibaba Cloud resources

  • Mainly for Alibaba Cloud services

  • Multi-region deployment

ARMS Prometheus+Grafana

  • Natively integrated with Grafana for powerful visualization. Supports PromQL for flexible queries

  • Deeply integrated with Kubernetes

  • Supports Remote Write to aggregate multiple clusters

  • Supports custom business metrics (SDK/Exporter)

  • Unified monitoring, supports multicloud/hybrid cloud scenarios

  • Steeper learning curve

  • Cost is based on data write volume and storage duration

  • Recommended as the default choice

Self-managed Prometheus+Grafana

  • Natively integrated with Grafana for powerful visualization. Supports PromQL for flexible queries

  • Deeply integrated with Kubernetes

  • Supports custom business metrics (SDK/Exporter)

  • Unified monitoring, supports multicloud/hybrid cloud scenarios

  • Steeper learning curve

  • For existing self-managed Prometheus deployments

Other monitoring platforms

  • Extremely flexible and customizable

  • Unified monitoring, supports multicloud/hybrid cloud scenarios

  • Complex architecture, difficult to integrate

  • Extremely high learning and maintenance costs

  • Requires a professional team for support

  • Difficult to integrate with Alibaba Cloud native services

  • For existing deployments of other monitoring platforms. Requires unified monitoring, logging, and tracing

  • Requires self-developed data collection plugins for Alibaba Cloud

On the other hand, the monitoring platforms that users use simultaneously are highly fragmented. Teams with insufficient investment in O&M may passively use monitoring platforms, which leads to this result. Teams with professional O&M tend to unify their monitoring platforms, but still cannot avoid fragmentation. According to the Observability Survey 2024, 70% of teams use four or more different monitoring platforms. Prometheus+Grafana is the de facto standard in the current cloud-native ecosystem. It has an active community, rich documentation, convenient configuration, powerful features, and a high degree of integration, making it suitable for most modern application monitoring scenarios.

Alibaba Cloud's ARMS Prometheus is a managed Prometheus + Grafana service. It eliminates the complexity of deploying, maintaining, and scaling a monitoring system yourself. It is out-of-the-box and supports large-scale metric collection, storage, and visual analytics. It is deeply integrated with the Alibaba Cloud ecosystem and can seamlessly connect with various cloud resources and applications such as ECS, Container Service for Kubernetes (ACK), Serverless, and Application Real-Time Monitoring Service (ARMS), achieving full-stack observability. At the same time, ARMS Prometheus provides powerful alerting capabilities. It supports flexible alert rule configuration based on PromQL and can accurately detect metric anomalies such as increased latency, rising error rates, and resource usage exceeding limits. Alert rules support multi-level thresholds, duration judgments, grouping, and deduplication, effectively avoiding false positives and alert storms. When an alert is triggered, it can notify on-call personnel in real-time through various channels such as DingTalk, text messages, email, and Webhooks. It also integrates with the Alibaba Cloud alert center and event center to achieve unified management and a closed-loop response for the alert lifecycle. This helps teams quickly discover and handle potential risks, ensuring business stability.

4.2 Event platform comparison

Both NIS and CloudMonitor can generate "alert" events. The differences between them are as follows:

CloudMonitor

NIS

Scope

All cloud products

Network products

Type

Predefined system events + user-configured thresholds

Predefined system events

Requires configuration

Yes

No

Generation logic

Predefined system events: Key events such as exceeding specifications, interruptions, and anomalies.

User-configured thresholds: Alerts are generated based on the customer's judgment of how a specific metric affects the system. The customer configures these alerts.

Only predefined system events

Custom threshold

Yes

No

Comprehensiveness

Comprehensive

Limited

Notification methods

Support

Phone, email, DingTalk, Lark, WeCom

SLS, Message Queue for Light-weight IoT, Function Compute

Not supported

You can push NIS events to CloudMonitor events and then use CloudMonitor's notification methods.

  • A typical CloudMonitor alert is: When the real-time bandwidth of an EIP reaches 5 Gbps, the application's O&M engineer is notified.

  • A typical NIS alert is: When the bandwidth of an EIP bandwidth plan reaches 95% of its specification, an NIS event is generated. The user can notify the application's O&M engineer through CloudMonitor or poll for this event through an API for automated processing.

CloudMonitor supports system events and user-defined threshold alerts for all cloud products. It has complete alert configuration capabilities, such as metric thresholds, duration judgments, and multi-level alerts. In contrast, NIS only provides predefined system events for network products and does not support user-configured thresholds. Therefore, CloudMonitor is more suitable to serve as an enterprise-level unified alert center.

5. Scenarios

Core application scenarios include the following:

  • Network performance monitoring and optimization.

  • Anomaly detection and intelligent alerting.

  • Quick fault localization and root cause analysis.

  • Resource utilization analysis and cost optimization.

  • Automated network O&M.

  • Unified management of multicloud/hybrid cloud networks.

6. Configuration reference manual

6.1 Dashboard configuration reference

The following section introduces typical dashboard designs for cloud networks, layered by network architecture.

6.1.1 Public network business dashboard

Elastic IP Address (EIP) is the standard way Alibaba Cloud products provide public network access. Most network services support binding a user-provided EIP. However, for historical reasons, some products also support public network services without a user-provided EIP. See the following table and suggestions.

Product

Public network type

Suggestion

ECS

  • EIP

  • Public IP

Use EIPs uniformly.

  • Use EIPs for ECS.

  • Use private network SLB instances with EIPs.

SLB

  • EIP: Private network SLB instances support binding EIPs to provide services over the public network.

  • SLB public IP: Public network SLB instances.

ALB/NLB/NAT

  • EIP

VPN

Public VPN gateways use non-user EIPs.

Set up a separate dashboard.

GA

GA uses non-user EIPs.

For statistical metrics, the inbound and outbound bandwidth/traffic on an EIP always represents the actual bandwidth/traffic transmitted on that EIP. However, the object to monitor for utilization, packet loss, and billing differs based on the combination of EIP + "added to Internet Shared Bandwidth" + "CDT enabled". Refer to the following table for details.

Product

Combination

Description

Monitored object

Should I join?

Internet Shared Bandwidth (cbwp)

Enabled

CDT

EIP

No

No

  • Bandwidth throttling granularity: Single EIP.

  • Billing: Single EIP.

  • Utilization/throttling packet loss: On the EIP

  • Billable item: On the EIP

EIP

No

Yes

  • Bandwidth throttling granularity: Single EIP.

  • Billing: CDT.

  • Utilization/throttling packet loss: On the EIP

  • Billable item: On CDT

EIP

Yes

No

  • Bandwidth throttling granularity: cbwp. The original bandwidth peak of the EIP is invalid and is the same as the bandwidth peak of the Internet Shared Bandwidth instance.

  • Billing: cbwp.

  • Utilization/throttling packet loss: On the cbwp

  • Billable item: On the cbwp

EIP

Yes

Yes

  • Bandwidth throttling granularity: cbwp. The original bandwidth peak of the EIP is invalid and is the same as the bandwidth peak of the Internet Shared Bandwidth instance.

  • Billing: cdt.

  • Utilization/throttling packet loss: On the cbwp

  • Billable item: On CDT

Public network dashboard design reference
  1. EIP dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total EIP rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Throttling packet loss rate: Sum

Right

Mark as red if > 100

EIP bandwidth utilization

Time series

Inbound bandwidth utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Outbound bandwidth utilization: Max, Min, Avg

Left

Top N instances by EIP throttling packet loss

Table

Throttling packet loss rate

Mark as red if > 100

Top N instances by EIP inbound bandwidth utilization

Table

Inbound bandwidth utilization

Mark as yellow if > 50

Mark as red if > 80

Top N instances by EIP outbound bandwidth utilization

Table

Outbound bandwidth utilization

Top N instances by EIP inbound rate

Table

Inbound rate

Top N instances by EIP outbound rate

Table

Outbound rate

  1. Internet Shared Bandwidth dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total EBWP rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Throttling packet loss rate: Sum

Right

Mark as red if > 100

EBWP bandwidth utilization

Time series

Inbound bandwidth utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Outbound bandwidth utilization: Max, Min, Avg

Left

Top N instances by EBWP throttling packet loss

Table

Throttling packet loss rate > 0

Mark as red if > 100

Top N instances by EBWP inbound bandwidth utilization

Table

Inbound bandwidth utilization > 30

Mark as yellow if > 50

Mark as red if > 80

Top N instances by EBWP outbound bandwidth utilization

Table

Outbound bandwidth utilization > 30

Top N instances by EBWP inbound rate

Table

Inbound rate

Top N instances by EBWP outbound rate

Table

Outbound rate

6.1.2 Network element business dashboard

SLB dashboard design reference

SLB has over 54 monitoring metrics. We classify them by horizontal and vertical dimensions.

  • The horizontal dimension can be divided into SLB listener granularity and SLB instance granularity. Most statistical metrics have both listener-granularity and instance-granularity versions. However, 1) resource utilization-related metrics are only available at the instance granularity. 2) Health check-related metrics are only available at the listener granularity.

    • Listener-granularity metric names do not contain "Instance". For example, AliyunSlb_ActiveConnection is the statistic for the number of active connections of a specific SLB listener.

    • Instance-granularity metric names contain "Instance". For example, AliyunSlb_InstanceActiveConnection is the sum of active connection statistics for the entire SLB instance.

  • The vertical dimension can be divided into two main categories, Layer 4 and Layer 7, each with four subcategories.

    • Layer 4 statistical metrics are divided into four categories: health check, resource utilization, traffic, and connections.

    • Layer 7 statistical metrics are divided into four categories: resource utilization, response time, status code, and others.

Classification

Listener granularity

Instance granularity

Layer 4

Health check

AliyunSlb_HealthyServerCount

AliyunSlb_UnhealthyServerCount

Resource utilization

AliyunSlb_InstanceNewConnectionUtilization - New connection utilization of the instance

AliyunSlb_InstanceMaxConnectionUtilization - Maximum connection utilization of the instance

AliyunSlb_InstanceTrafficRXUtilization - Received traffic utilization of the instance

AliyunSlb_InstanceTrafficTXUtilization - Sent traffic utilization of the instance

Traffic

AliyunSlb_TrafficRXNew - New received traffic

AliyunSlb_TrafficTXNew - New sent traffic

AliyunSlb_DropTrafficRX - Dropped traffic at the receiver

AliyunSlb_DropTrafficTX - Dropped traffic at the sender

AliyunSlb_PacketRX - Received packets

AliyunSlb_PacketTX - Sent packets

AliyunSlb_DropPacketRX - Dropped packets at the receiver

AliyunSlb_DropPacketTX - Dropped packets at the sender

AliyunSlb_InstanceTrafficRX - Received traffic of the instance

AliyunSlb_InstanceTrafficTX - Sent traffic of the instance

AliyunSlb_InstanceDropTrafficRX - Dropped traffic at the receiver of the instance

AliyunSlb_InstanceDropTrafficTX - Dropped traffic at the sender of the instance

AliyunSlb_InstancePacketRX - Received packets of the instance

AliyunSlb_InstancePacketTX - Sent packets of the instance

AliyunSlb_InstanceDropPacketRX - Dropped packets at the receiver of the instance

AliyunSlb_InstanceDropPacketTX - Dropped packets at the sender of the instance

Connections

AliyunSlb_ActiveConnection - Current active connections

AliyunSlb_InactiveConnection - Inactive connections

AliyunSlb_NewConnection - New connections

AliyunSlb_MaxConnection - Maximum connections

AliyunSlb_DropConnection - Dropped connections

AliyunSlb_InstanceActiveConnection - Current active connections of the instance

AliyunSlb_InstanceInactiveConnection - Inactive connections of the instance

AliyunSlb_InstanceNewConnection - New connections of the instance

AliyunSlb_InstanceMaxConnection - Maximum connections of the instance

AliyunSlb_InstanceDropConnection - Dropped connections of the instance

Layer 7

Utilization

AliyunSlb_InstanceQpsUtilization - QPS utilization of the instance

Response time

AliyunSlb_Rt - Response time

AliyunSlb_InstanceRt - Response time of the instance

AliyunSlb_InstanceUpstreamRt - Backend response time of the instance

Status code

AliyunSlb_StatusCode2xx - Number of requests with HTTP status code 2xx

AliyunSlb_StatusCode3xx - Number of requests with HTTP status code 3xx

AliyunSlb_StatusCode4xx - Number of requests with HTTP status code 4xx

AliyunSlb_StatusCode5xx - Number of requests with HTTP status code 5xx

AliyunSlb_StatusCodeOther - Number of requests with other HTTP status codes

AliyunSlb_UpstreamCode4xx - Number of times the backend returned a 4xx status code

AliyunSlb_UpstreamCode5xx - Number of times the backend returned a 5xx status code

AliyunSlb_InstanceStatusCode2xx - Number of requests with HTTP status code 2xx for the instance

AliyunSlb_InstanceStatusCode3xx - Number of requests with HTTP status code 3xx for the instance

AliyunSlb_InstanceStatusCode4xx - Number of requests with HTTP status code 4xx for the instance

AliyunSlb_InstanceStatusCode5xx - Number of requests with HTTP status code 5xx for the instance

AliyunSlb_InstanceStatusCodeOther - Number of requests with other HTTP status codes for the instance

AliyunSlb_InstanceUpstreamCode4xx - Number of times the backend of the instance returned a 4xx status code

AliyunSlb_InstanceUpstreamCode5xx - Number of times the backend of the instance returned a 5xx status code

Other

AliyunSlb_Qps - Queries per second (QPS)

AliyunSlb_InstanceQps - Queries per second (QPS) of the instance

High-Level Design Reference for CLB

SLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because CLB metrics have multiple dimensions, we recommend that you select the following dimensions for the TopN section:

  • CLB now consolidates utilization metrics at the instance level.

  • If an SLB instance serves a single business, we recommend displaying the SLB instance in the Top N.

  • If a CLB instance manages various services, you can display its listeners in the TopN list.

Panel name

Type

Metric

Axis

Description

Total SLB rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Inbound drop rate: Sum

Right

Mark as red if > 100

Outbound drop rate: Sum

Right

Mark as red if > 100

Total SLB connections

Time series

Active connections: Sum

Left

Inactive connections: Sum

Left

New connections: Sum

Left

Maximum connections: Sum

Left

Dropped connections: Sum

Right

Set yellow and red markers

Health check

Time series

Healthy servers: Sum

Left

Mark as green

Unhealthy servers: Sum

Left

Set yellow and red markers

SLB instance utilization

Time series

New connection utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Maximum connection utilization: Max, Min, Avg

Left

Inbound bandwidth utilization: Max, Min, Avg

Left

Outbound bandwidth utilization: Max, Min, Avg

Left

Layer 7 QPS utilization: Max, Min, Avg

Left

Layer 7 QPS

Time series

QPS: Sum

Left

Layer 7 response time

Time series

Response time: Max, Min, Avg

Left

Set yellow and red markers

Backend response time: Max, Min, Avg

Left

Layer 7 status code statistics

Time series

2xx: Sum

Left

3xx: Sum

Left

4xx: Sum

Left

Set yellow and red markers

5xx: Sum

Left

Set yellow and red markers

Other: Sum

Left

Upstream4xx: Sum

Left

Set yellow and red markers

Upstream5xx: Sum

Left

Set yellow and red markers

Top N by inbound rate

Table

Inbound rate

Top N by outbound rate

Table

Outbound rate

Top N by inbound drop rate

Table

Inbound drop rate > 0

Mark as red

Top N by outbound drop rate

Table

Outbound drop rate > 0

Mark as red

Top N by maximum connections

Table

Top N by new connections

Table

Top N by dropped connections

Table

Dropped connections > 0

Mark as red

Top N by unhealthy servers

Table

Unhealthy servers > 0

Mark as red

High utilization instances

Table

New connection utilization > 30 OR Maximum connection utilization > 30 OR Inbound bandwidth utilization > 30 OR Outbound bandwidth utilization > 30 OR QPS utilization > 30

Mark as yellow if > 50

Mark as red if > 80

Top N by response time

Table

Response time > Average value × 2

Dynamic threshold marking

Top N by QPS

Table

QPS

Top N by 4xx

Table

4xx

Top N by 5xx

Table

5xx

NLB dashboard design reference

NLB has over 45 monitoring metrics. We classify them by horizontal and vertical dimensions.

  • The horizontal dimension can be divided into NLB listener granularity, NLB VIP granularity, and NLB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions.

    • Listener-granularity metric names do not contain "Instance" or "Vip". For example, AliyunNlb_ActiveConnection is the statistic for the number of active connections of a specific NLB listener.

    • VIP-granularity metric names contain "Vip". For example, AliyunNlb_VipActiveConnection is the statistic for the number of active connections of a specific NLB VIP.

    • Instance-granularity metric names contain "Instance". For example, AliyunNlb_InstanceActiveConnection is the sum of active connection statistics for the entire NLB instance.

  • The vertical dimension can be divided into four main categories: health check, traffic, connections, and others.

Classification

Listener granularity

Instance granularity

VIP granularity

Health check

AliyunNlb_ListenerHeathyServerCount - Number of healthy servers

AliyunNlb_ListenerUnhealthyServerCount - Number of unhealthy servers

AliyunNlb_NlbInstanceHeathyServerCount - Number of healthy servers for the SLB instance

AliyunNlb_InstanceUnhealthyServerCount - Number of unhealthy servers for the instance

N/A

Traffic

AliyunNlb_TrafficRXNew - Received traffic

AliyunNlb_TrafficTXNew - Sent traffic

AliyunNlb_ListenerPacketRX - Received packets

AliyunNlb_ListenerPacketTX - Sent packets

AliyunNlb_DropTrafficRX - Received dropped traffic

AliyunNlb_DropTrafficTX - Sent dropped traffic

AliyunNlb_DropPacketRX - Received dropped packets

AliyunNlb_DropPacketTX - Sent dropped packets

AliyunNlb_InstanceTrafficRX - Instance received traffic
AliyunNlb_InstanceTrafficTX - Instance sent traffic
AliyunNlb_InstancePacketRX - Instance received packets
AliyunNlb_InstancePacketTX - Instance sent packets
AliyunNlb_InstanceDropTrafficRX - Instance received dropped traffic
AliyunNlb_InstanceDropTrafficTX - Instance sent dropped traffic
AliyunNlb_InstanceDropPacketRX - Instance received dropped packets
AliyunNlb_InstanceDropPacketTX - Instance sent dropped packets

Code mode

AliyunNlb_VipTrafficRX - VIP received traffic
AliyunNlb_VipTrafficTX - VIP sent traffic
AliyunNlb_VipPacketRX - VIP received packets
AliyunNlb_VipPacketTX - VIP sent packets
AliyunNlb_VipDropTrafficRX - VIP received dropped traffic
AliyunNlb_VipDropTrafficTX - VIP sent dropped traffic
AliyunNlb_VipDropPacketRX - VIP received dropped packets
AliyunNlb_VipDropPacketTX - VIP sent dropped packets

Code mode

Connections

AliyunNlb_NewConnection - New connections

AliyunNlb_MaxConnection - Maximum connections

AliyunNlb_DropConnection - Dropped connections

AliyunNlb_ActiveConnection - Active connections

AliyunNlb_InactiveConnection - Inactive connections

AliyunNlb_InstanceNewConnection - Instance new connections

AliyunNlb_InstanceMaxConnection - Instance maximum connections

AliyunNlb_InstanceDropConnection - Instance dropped connections

AliyunNlb_InstanceActiveConnection - Instance active connections

AliyunNlb_InstanceInactiveConnection - Instance inactive connections

AliyunNlb_VipNewConnection - VIP new connections

AliyunNlb_VipMaxConnection - VIP maximum connections

AliyunNlb_VipDropConnection - VIP dropped connections

AliyunNlb_VipActiveConnection - VIP active connections

AliyunNlb_VipInactiveConnection - VIP inactive connections

Other

N/A

N/A

AliyunNlb_VipClientResetPacket - Number of VIP client reset packets

AliyunNlb_RealServerResetPacket - Number of VIP server reset packets

NLB dashboard design reference

NLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because NLB has many metric dimensions, we recommend the following for the Top N section dimensions:

  • If an NLB instance serves a single business, we recommend displaying the NLB instance in the Top N.

  • If an NLB instance serves mixed businesses, we recommend displaying the NLB listener in the Top N.

  • Other dimension information is used for problem analysis and is not displayed on the dashboard.

Panel name

Type

Metric

Axis

Description

Total NLB rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Inbound drop rate: Sum

Right

Mark as red if > 100

Outbound drop rate: Sum

Right

Mark as red if > 100

Total NLB connections

Time series

Active connections: Sum

Left

Inactive connections: Sum

Left

New connections: Sum

Left

Maximum connections: Sum

Left

Dropped connections: Sum

Right

Set yellow and red markers

Health check

Time series

Healthy servers: Sum

Left

Mark as green

Unhealthy servers: Sum

Left

Set yellow and red markers

Reset

Time series

Number of client reset packets

Left

Mark as red if > 100

Number of server reset packets

Left

Mark as red if > 100

Top N by inbound rate

Table

Inbound rate

Top N by outbound rate

Table

Outbound rate

Top N by inbound drop rate

Table

Inbound drop rate > 0

Mark as red

Top N by outbound drop rate

Table

Outbound drop rate > 0

Mark as red

Top N by maximum connections

Table

Top N by new connections

Table

Top N by dropped connections

Table

Dropped connections > 0

Mark as red

Top N by unhealthy servers

Table

Unhealthy servers > 0

Mark as red

Top N by ClientReset

Table

Number of client reset packets > 0

Server Reset Top N

Table

Number of server reset packets > 0

ALB dashboard design reference

ALB has over 112 monitoring metrics. We classify them by horizontal and vertical dimensions.

  • The horizontal dimension can be divided into ALB listener granularity, ALB VIP granularity, ALB rule granularity, ALB server group granularity, and ALB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions. A few metrics also provide rule-granularity and server group-granularity.

    • Listener-granularity metric names contain "Listener". For example, AliyunAlb_ListenerQPS is the QPS statistic for a specific ALB listener.

    • VIP-granularity metric names contain "Vip". For example, AliyunAlb_VipQPS is the QPS statistic for a specific ALB VIP.

    • Rule-granularity metric names contain "Rule". For example, AliyunAlb_RuleQPS is the QPS statistic for a specific ALB rule.

    • Server group-granularity metric names contain "ServerGroup". For example, AliyunAlb_ServerGroupQPS is the QPS statistic for a specific ALB server group.

    • Instance-granularity metric names contain "LoadBalancer". For example, AliyunAlb_LoadBalancerQPS is the sum of QPS statistics for the entire ALB instance.

  • The vertical dimension can be divided into six main categories: health check, traffic, connections, response time, status code, and others.

Classification

Listener granularity

Instance granularity

VIP granularity

Rule granularity

Server group granularity

Health check

AliyunAlb_ListenerHealthyHostCount

AliyunAlb_ListenerUnHealthyHostCount

AliyunAlb_LoadBalancerHealthyHostCount

AliyunAlb_LoadBalancerUnHealthyHostCount

AliyunAlb_RuleHealthyHostCount

AliyunAlb_RuleUnHealthyHostCount

AliyunAlb_ServerGroupHealthyHostCount

AliyunAlb_ServerGroupUnHealthyHostCount

Traffic

AliyunAlb_ListenerInBits

AliyunAlb_ListenerOutBits

AliyunAlb_LoadBalancerInBits

AliyunAlb_LoadBalancerOutBits

AliyunAlb_VipInBits

AliyunAlb_VipOutBits

Connections

AliyunAlb_ListenerActiveConnection

AliyunAlb_ListenerInactiveConnection

AliyunAlb_ListenerNewConnection

AliyunAlb_ListenerMaxConnection

AliyunAlb_ListenerRejectedConnection

AliyunAlb_ListenerUpstreamConnectionError

AliyunAlb_LoadBalancerActiveConnection

AliyunAlb_LoadBalancerInactiveConnection

AliyunAlb_LoadBalancerNewConnection

AliyunAlb_LoadBalancerMaxConnection

AliyunAlb_LoadBalancerRejectedConnection

AliyunAlb_LoadBalancerUpstreamConnectionError

AliyunAlb_VipActiveConnection

AliyunAlb_VipInactiveConnection

AliyunAlb_VipNewConnection

AliyunAlb_VipMaxConnection

AliyunAlb_VipRejectedConnection

AliyunAlb_VipUpstreamConnectionError

AliyunAlb_RuleUpstreamConnectionError

AliyunAlb_ServerGroupUpstreamConnectionError

Response time

AliyunAlb_ListenerRequestTime

AliyunAlb_ListenerUpstreamResponseTime

AliyunAlb_LoadBalancerRequestTime

AliyunAlb_LoadBalancerUpstreamResponseTime

AliyunAlb_VipRequestTime

AliyunAlb_VipUpstreamResponseTime

AliyunAlb_RuleRequestTime

AliyunAlb_RuleUpstreamResponseTime

AliyunAlb_ServerGroupRequestTime

AliyunAlb_ServerGroupUpstreamResponseTime

Status code

AliyunAlb_ListenerHTTPCode2XX

AliyunAlb_ListenerHTTPCode3XX

AliyunAlb_ListenerHTTPCode4XX

AliyunAlb_ListenerHTTPCode5XX

AliyunAlb_ListenerHTTPCode500

AliyunAlb_ListenerHTTPCode502

AliyunAlb_ListenerHTTPCode503

AliyunAlb_ListenerHTTPCode504

AliyunAlb_ListenerHTTPCodeUpstream2XX

AliyunAlb_ListenerHTTPCodeUpstream3XX

AliyunAlb_ListenerHTTPCodeUpstream4XX

AliyunAlb_ListenerHTTPCodeUpstream5XX

AliyunAlb_LoadBalancerHTTPCode2XX

AliyunAlb_LoadBalancerHTTPCode3XX

AliyunAlb_LoadBalancerHTTPCode4XX

AliyunAlb_LoadBalancerHTTPCode5XX

AliyunAlb_LoadBalancerHTTPCode500

AliyunAlb_LoadBalancerHTTPCode502

AliyunAlb_LoadBalancerHTTPCode503

AliyunAlb_LoadBalancerHTTPCode504

AliyunAlb_LoadBalancerHTTPCodeUpstream2XX

AliyunAlb_LoadBalancerHTTPCodeUpstream3XX

AliyunAlb_LoadBalancerHTTPCodeUpstream4XX

AliyunAlb_LoadBalancerHTTPCodeUpstream5XX

AliyunAlb_VipHTTPCode2XX

AliyunAlb_VipHTTPCode3XX

AliyunAlb_VipHTTPCode4XX

AliyunAlb_VipHTTPCode5XX

AliyunAlb_VipHTTPCode500

AliyunAlb_VipHTTPCode502

AliyunAlb_VipHTTPCode503

AliyunAlb_VipHTTPCode504

AliyunAlb_RuleHTTPCodeUpstream2XX

AliyunAlb_RuleHTTPCodeUpstream3XX

AliyunAlb_RuleHTTPCodeUpstream4XX

AliyunAlb_RuleHTTPCodeUpstream5XX

AliyunAlb_RuleHTTPCodeUpstream2XXRatio

AliyunAlb_RuleHTTPCodeUpstream3XXRatio

AliyunAlb_RuleHTTPCodeUpstream4XXRatio

AliyunAlb_RuleHTTPCodeUpstream5XXRatio

AliyunAlb_ServerGroupHTTPCodeUpstream2XX

AliyunAlb_ServerGroupHTTPCodeUpstream3XX

AliyunAlb_ServerGroupHTTPCodeUpstream4XX

AliyunAlb_ServerGroupHTTPCodeUpstream5XX

Other

AliyunAlb_ListenerQPS

AliyunAlb_ListenerNonStickyRequest

AliyunAlb_ListenerUpstreamTLSNegotiationError

AliyunAlb_ListenerClientTLSNegotiationError

AliyunAlb_ListenerHTTPFixedResponse

AliyunAlb_ListenerHTTPRedirect

AliyunAlb_LoadBalancerQPS

AliyunAlb_LoadBalancerNonStickyRequest

AliyunAlb_LoadBalancerUpstreamTLSNegotiationError

AliyunAlb_LoadBalancerClientTLSNegotiationError

AliyunAlb_LoadBalancerHTTPFixedResponse

AliyunAlb_LoadBalancerHTTPRedirect

AliyunAlb_VipQPS

AliyunAlb_VipNonStickyRequest

AliyunAlb_VipUpstreamTLSNegotiationError

AliyunAlb_VipClientTLSNegotiationError

AliyunAlb_VipHTTPFixedResponse

AliyunAlb_VipHTTPRedirect

AliyunAlb_RuleQPS

AliyunAlb_RuleNonStickyRequest

AliyunAlb_RuleUpstreamTLSNegotiationError

AliyunAlb_ServerGroupQPS

AliyunAlb_ServerGroupNonStickyRequest

AliyunAlb_ServerGroupUpstreamTLSNegotiationError

ALB dashboard design reference

ALB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because ALB has many metric dimensions, we recommend the following for the Top N section dimensions:

  • If an ALB instance serves a single business, we recommend displaying the ALB instance in the Top N.

  • If an ALB instance serves mixed businesses, we recommend displaying the ALB listener in the Top N.

  • Other dimension information is used for problem analysis and is not displayed on the dashboard.

Panel name

Type

Metric

Axis

Description

Total ALB rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Total ALB connections

Time series

Active connections: Sum

Left

Inactive connections: Sum

Left

New connections: Sum

Left

Maximum connections: Sum

Left

Rejected connections: Sum

Right

Set yellow and red markers

Upstream rejected connections: Sum

Right

Set yellow and red markers

Health check

Time series

Healthy servers: Sum

Left

Mark as green

Unhealthy servers: Sum

Left

Set yellow and red markers

TLS errors

Time series

TLS negotiation errors: Sum

Left

Set yellow and red markers

Upstream TLS negotiation errors: Sum

Left

Set yellow and red markers

Layer 7 QPS

Time series

QPS: Sum

Left

Layer 7 response time

Time series

Response time: Max, Min, Avg

Left

Set yellow and red markers

Backend response time: Max, Min, Avg

Left

Layer 7 status code statistics

Time series

2xx: Sum

Left

3xx: Sum

Left

4xx: Sum

Left

Set yellow and red markers

5xx: Sum

Left

Set yellow and red markers

Upstream4xx: Sum

Left

Set yellow and red markers

Upstream5xx: Sum

Left

Set yellow and red markers

Top N by inbound rate

Table

Inbound rate

Top N by outbound rate

Table

Outbound rate

Top N by maximum connections

Table

Top N by new connections

Table

Top N by dropped connections

Table

Dropped connections > 0

Mark as red

Top N by unhealthy servers

Table

Unhealthy servers > 0

Mark as red

Top N by TLS negotiation errors

Table

TLS negotiation errors > 0

Mark as red

Top N by upstream TLS negotiation errors

Table

Upstream TLS negotiation errors > 0

Mark as red

Top N by response time

Table

Response time > Average value × 2

Mark as yellow or red

Top N by QPS

Table

QPS

Top N by 4xx

Table

4xx

Top N by 5xx

Table

5xx

GA dashboard design reference

GA dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total frontend IP rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Inbound drop rate: Sum

Right

Mark as red if > 100

Outbound drop rate: Sum

Right

Mark as red if > 100

Frontend IP bandwidth utilization

Time series

Inbound bandwidth utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Outbound bandwidth utilization: Max, Min, Avg

Left

Total frontend IP active connections

Time series

Active connections: Sum

Left

Total backend group rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Inbound drop rate: Sum

Right

Mark as red if > 100

Outbound drop rate: Sum

Right

Mark as red if > 100

Backend group bandwidth utilization

Time series

Inbound bandwidth utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Outbound bandwidth utilization: Max, Min, Avg

Left

Tunnel latency

Time series

Tunnel latency: Max, Min, Avg

Dynamic threshold marking

Top N by frontend inbound rate

Table

Inbound rate

Top N by frontend outbound rate

Table

Outbound rate

Top N by frontend inbound bandwidth utilization

Table

Inbound bandwidth utilization > 30

Mark as red or yellow

Top N by frontend outbound bandwidth utilization

Table

Outbound bandwidth utilization > 30

Mark as red or yellow

Top N by active connections

Table

Active connections

Top N by backend group inbound bandwidth utilization

Table

Top N by backend group inbound bandwidth utilization

Top N by backend group outbound bandwidth utilization

Table

Top N by backend group outbound bandwidth utilization

Top N by tunnel latency

Table

Tunnel latency

Dynamic threshold

NAT dashboard design reference

NAT dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total NAT connections

Time series

Active connections: Sum

Left

New connections: Sum

Left

Dropped active connections: Sum

Right

Mark as yellow if > 0

Mark as red if > 100

Dropped new connections: Sum

Right

Mark as yellow if > 0

Mark as red if > 100

NAT connection utilization

Time series

Active connection utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

New connection utilization: Max, Min, Avg

Left

Total rate

Time series

Public network inbound rate: Sum

Left

Mark as red if inbound-outbound rate difference > threshold

Public network outbound rate: Sum

Left

Private network inbound rate: Sum

Left

Private network outbound rate: Sum

Left

Top N instances by active connections

Table

Active connections

Top N by new connections

Table

New connections

Top N by dropped active connections

Table

Dropped active connections > 0

Mark as yellow if > 0

Mark as red if > 100

Top N by dropped new connections

Table

Dropped new connections > 0

Mark as yellow if > 0

Mark as red if > 100

Top N by active connection utilization

Table

Active connection utilization > 30

Mark as yellow if > 50

Mark as red if > 80

Top N by new connection utilization

Table

Active connection utilization > 30

Mark as yellow if > 50

Mark as red if > 80

Top N by inbound rate

Table

Public network inbound rate

Top N by outbound rate

Table

Public network outbound rate

6.1.3 Global networking business dashboard

Express Connect - Physical port design reference

Physical port dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total rate

Time series

Inbound rate to cloud: Sum

Left

Total Egress Rate

Left

Port error packets

Time series

Port inbound error packets: Sum

Left

Mark as yellow or red

Port outbound error packets: Sum

Left

Mark as yellow or red

Number of disconnected leased lines

Time series

Port down: Count

Left

Mark as red

Top N by inbound rate to cloud

Table

Inbound rate to cloud

Top N by outbound rate from cloud

Table

Download rate

Top N by port inbound error packets

Table

Port inbound error packets > 0

Mark as red

Top N by port outbound error packets

Table

Port outbound error packets > 0

Mark as red

Disconnected leased line instances

Table

Port down == 1

Mark as red

Express Connect - VBR dashboard design reference

VBR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total rate

Time series

Inbound rate to cloud: Sum

Left

Total Egress Rate

Left

Throttling packet loss: Sum

Right

Mark as red if > 100

Packet loss

Time series

Port inbound packet loss: Sum

Left

Mark as yellow or red

Port outbound packet loss: Sum

Left

Mark as yellow or red

Probe packet loss

Time series

Probe packet loss: Max, Min, Avg

Left

Mark as yellow if > 0

Mark as red if > 10

Probe latency

Time series

Probe latency: Max, Min, Avg

Left

Dynamic threshold

Top N by inbound rate to cloud

Table

Inbound rate to cloud

Top N by outbound rate from cloud

Table

Outbound rate from cloud

Top N by throttling packet loss

Table

Throttling packet loss > 0

Mark as red

Top N by port inbound packet loss

Table

Port inbound packet loss > 0

Mark as red

Top N by port outbound error packets

Table

Port outbound packet loss > 0

Mark as red

Top N by probe packet loss

Table

Probe packet loss > 0

Mark as yellow if > 0

Mark as red if > 10

Top N by probe latency

Table

Probe latency

ECR dashboard design reference

ECR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total rate

Time series

Inbound rate: Sum

Left

Outbound rate: Sum

Left

Total cross-region throttling packet loss rate

Time series

Throttling packet loss bit rate: Sum

Left

Mark as yellow or red

Throttling packet loss message rate: Sum

Right

Mark as yellow or red

Top N by inbound rate

Table

Inbound rate

Top N by outbound rate

Table

Outbound rate

Top N by cross-region rate

Table

Cross-region rate

Top N by cross-region throttling

Table

Throttling packet loss > 0

Mark as red

VPN dashboard design reference

VPN dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

Total rate

Time series

VPN Gateway inbound rate to cloud: Sum

Left

IPsec-VPN connection inbound rate to cloud: Sum

Left

VPN Gateway outbound rate from cloud: Sum

Right

IPsec-VPN connection outbound rate from cloud: Sum

Right

VPN Gateway utilization

Time series

Inbound bandwidth utilization to cloud: Max, Min, Avg

Left

Mark as yellow or red

Outbound bandwidth utilization from cloud: Max, Min, Avg

Left

Mark as yellow or red

Number of online SSL clients

Time series

Number of SSL clients: Sum

Left

Top N by inbound bandwidth utilization to cloud

Table

Inbound bandwidth utilization to cloud > 30

Mark as yellow if > 50

Mark as red if > 80

Top N by outbound bandwidth utilization from cloud

Table

Outbound bandwidth utilization from cloud > 30

Mark as yellow if > 50

Mark as red if > 80

Top N by VPN Gateway inbound rate to cloud

Table

VPN Gateway inbound rate to cloud

VPN Gateway outbound rate from cloud

Table

VPN Gateway outbound rate from cloud

IPsec-VPN connection inbound rate to cloud

Table

IPsec-VPN connection inbound rate to cloud

IPsec-VPN connection outbound rate from cloud

Table

IPsec-VPN connection outbound rate from cloud

TR dashboard design reference

TR cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

TR traffic

Time series

Inbound rate: Sum

Left

Mark as red if inbound-outbound rate difference > threshold

Outbound rate: Sum

Left

Blackhole drop rate: Sum

Right

No-route drop rate: Sum

Right

Attachment connection traffic

Time series

Inbound rate: Sum

Left

Mark as red if inbound-outbound rate difference > threshold

Outbound rate: Sum

Left

Blackhole drop rate: Sum

Left

Top N by TR inbound traffic

Table

TR inbound rate

Top N by TR outbound traffic

Table

TR outbound rate

Top N by TR blackhole drops

Table

TR blackhole drop rate

Top N by TR no-route drops

Table

TR no-route drop rate

Top N by attachment connection inbound traffic

Table

Attachment connection inbound traffic

Top N by attachment connection outbound traffic

Table

Attachment connection outbound traffic

Top N by attachment connection drops

Table

Attachment connection blackhole drop rate

CEN cross-region design reference

CEN cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name

Type

Metric

Axis

Description

CEN traffic

Time series

Region outbound rate: Sum

Left

Mark as red if outbound rate difference > threshold

Area outbound rate: Sum

Left

Bandwidth plan average outbound rate: Sum

Left

Microburst tip:

  • Mark as yellow if Peak/Avg > 3

  • Mark as red if Peak/Avg > 10

Bandwidth plan peak outbound rate: Sum

Left

Region throttling packet loss rate: Sum

Right

Mark as red if > 100 kbps

CEN utilization

Time series

Region utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Area utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Bandwidth plan average utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Bandwidth plan peak utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

CEN QoS traffic

Time series

QoS outbound rate: Sum

Left

QoS throttling packet loss rate: Sum

Right

Mark as red if > 100 kbps

CEN QoS utilization

Time series

QoS average utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

QoS peak utilization: Max, Min, Avg

Left

Mark as yellow if > 50

Mark as red if > 80

Top N by region outbound rate

Table

Region outbound rate

Top N by region utilization

Table

Region utilization

Top N by region throttling packet loss rate

Table

Region throttling packet loss rate

Top N by QoS outbound rate

Table

QoS outbound rate

Top N by QoS peak utilization

Table

QoS peak utilization

Top N by QoS throttling packet loss rate

Table

QoS throttling packet loss rate

6.2 Monitoring configuration reference

6.2.1 Public network service monitoring configuration reference

For EIPs with self-built gateways providing public network service endpoints, refer to the following suggestions to configure alert rules in CloudMonitor for the public network endpoint EIPs:

Monitored object

Alert level

Monitoring metrics and conditions

EIP

Info

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 30%

  • Outbound bandwidth utilization > 30%

Warn

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 50%

  • Outbound bandwidth utilization > 50%

Critical

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 85%

  • Outbound bandwidth utilization > 85%

Internet Shared Bandwidth

Info

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 30%

  • Outbound bandwidth utilization > 30%

Warn

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 50%

  • Outbound bandwidth utilization > 50%

Critical

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 85%

  • Outbound bandwidth utilization > 85%

CDT

Info

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 30%

  • Outbound bandwidth utilization > 30%

Warn

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 50%

  • Outbound bandwidth utilization > 50%

Critical

When one of the following conditions occurs:

  • Inbound bandwidth utilization > 85%

  • Outbound bandwidth utilization > 85%

  • Inbound throttling packet loss rate > 10

  • Outbound throttling packet loss rate > 10

When the bandwidth load exceeds 30%, the system enters a high-load state. The business may experience slow access, occasional timeouts, and other SLA degradation. We recommend performing a capacity assessment and considering a scale-out.

When the bandwidth load exceeds 50%, in addition to the issues at the previous level, the multi-AZ disaster recovery architecture may fail. If a service interruption occurs in one AZ, the remaining AZs cannot handle the entire business load. We recommend an immediate scale-out.

When the bandwidth load exceeds 85%, in addition to the issues at the previous level, the system load seriously exceeds the designed capacity. Besides an immediate scale-out, you should also consider whether there are unexpected events such as business growth exceeding expectations or security attacks, and optimize the system design.

6.2.2 Network element service monitoring configuration reference

CLB/NBL/ALB

For SLB/NLB/ALB providing public network service endpoints, in addition to configuring monitoring for the public network endpoint as described in the previous section, refer to the following suggestions to configure alert rules in CloudMonitor for SLB/NLB/ALB:

Monitored object

Alert level

Monitoring metrics and conditions

CLB

Info

When one of the following conditions occurs at the instance dimension:

  • Layer 7 instance QPS utilization > 30%

  • Instance new connection utilization > 30%

  • Instance maximum connection utilization > 30%

  • Instance network inbound bandwidth utilization > 30%

  • Instance network outbound bandwidth utilization > 30%

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.1% of the peak QPS.

  • Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

When one of the following conditions occurs at the port dimension:

  • Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.

  • Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.

Warn

When one of the following conditions occurs at the instance dimension:

  • Layer 7 instance QPS utilization > 50%

  • Instance new connection utilization > 50%

  • Instance maximum connection utilization > 50%

  • Instance network inbound bandwidth utilization > 50%

  • Instance network outbound bandwidth utilization > 50%

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value, or use CloudMonitor intelligent threshold.

  • Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.5% of the peak QPS, or use CloudMonitor intelligent threshold.

  • Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value, or use CloudMonitor intelligent threshold.

When one of the following conditions occurs at the port dimension:

  • Number of unhealthy backend ECS instances for health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application.

  • Number of unhealthy backend ECS instances for Layer 7 forwarding rule health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum single deployment batch for the backend application.

  • Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.

  • Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.

Critical

When one of the following conditions occurs at the instance dimension:

  • Layer 7 instance QPS utilization > 85%

  • Instance new connection utilization > 85%

  • Instance maximum connection utilization > 85%

  • Instance network inbound bandwidth utilization > 85%

  • Instance network outbound bandwidth utilization > 85%

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% of the peak QPS.

  • Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

When one of the following conditions occurs at the port dimension:

  • Number of unhealthy backend ECS instances for health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application.

  • Number of unhealthy backend ECS instances for Layer 7 forwarding rule health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum single deployment batch for the backend application.

  • Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.

  • Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.

NLB

Info

When one of the following conditions occurs at the instance dimension:

  • New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

When one of the following conditions occurs at the port dimension:

  • Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.

Warn

When one of the following conditions occurs at the instance dimension:

  • New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

When one of the following conditions occurs at the port dimension:

  • Number of unhealthy backend ECS instances for listener health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application.

  • Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.

Critical

When one of the following conditions occurs at the instance dimension:

  • New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

When one of the following conditions occurs at the port dimension:

  • Number of unhealthy backend ECS instances for listener health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application.

  • Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.

ALB

Info

When one of the following conditions occurs at the loadBalancer dimension:

  • New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

  • Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.1% of the peak QPS.

  • Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

When one of the following conditions occurs at the listener dimension:

  • Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.

Warn

When one of the following conditions occurs at the loadBalancer dimension:

  • New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

  • Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.5% of the peak QPS.

  • Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

When one of the following conditions occurs at the listener dimension:

  • Number of unhealthy servers for the listener > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application.

  • Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.

Critical

When one of the following conditions occurs at the loadBalancer dimension:

  • New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

  • Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% of the peak QPS.

  • Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

When one of the following conditions occurs at the listener dimension:

  • Number of unhealthy servers for the listener > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application.

  • Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.

There are many application layer-related metrics, and they are closely related to the business. You should continuously optimize the relevant monitoring and the threshold configurations for each level based on actual business feedback.

6.2.3 Hybrid Disaster Recovery monitoring configuration reference

Leased line connection

If you use a leased line to connect to Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the leased line:

Monitored object

Alert level

Monitoring metrics and conditions

Express Connect - Physical port

Info

When one of the following conditions occurs:

  • Port outbound bandwidth utilization > 30%

  • Port inbound bandwidth utilization > 30%

Warn

When one of the following conditions occurs:

  • Port outbound bandwidth utilization > 50%

  • Port inbound bandwidth utilization > 50%

Critical

When one of the following conditions occurs:

  • Physical status = DOWN

  • Port outbound bandwidth utilization > 85%

  • Port inbound bandwidth utilization > 85%

  • Port inbound error packets > X. We recommend setting it to (average outbound rate from IDC to VPC / 512 / 8) × 0.005 × 60 (reaching 0.5% per minute).

  • Port outbound error packets > X. We recommend setting it to (average inbound rate from VPC to IDC / 512 / 8) × 0.005 × 60 (reaching 0.5% per minute).

  • Port outbound dropped packets > X. We recommend setting it to (port specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).

  • Port inbound dropped packets > X. We recommend setting it to (port specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).

Express Connect - Virtual Border Router

Info

When one of the following conditions occurs:

  • Inbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.30.

  • Outbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.30.

Warn

When one of the following conditions occurs:

  • Inbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.50.

  • Outbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.50.

Critical

When one of the following conditions occurs:

  • Inbound rate from IDC to VPC > X. We recommend setting it to port specification bps × 0.85.

  • Outbound rate from IDC to VPC > X. We recommend setting it to port specification bps × 0.85.

  • Dropped inbound packets from IDC to VPC > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).

  • Dropped outbound packets from VPC to IDC > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).

  • Throttled dropped packets from VPC to VBR > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).

  • VBR health check latency > X or VBR health check latency == 0. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value (when the leased line is disconnected, VBR health check latency outputs 0).

  • VBR health check packet loss rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% (if this metric is high, focus on checking the CoPP throttling configuration of the monitored object vSwitch).

Express Connect - Express Connect Router

Info

When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:

  • Outbound rate from ECR to TR > ECR Attachment specification × 0.3

When one of the following conditions occurs at the cross-region connection dimension:

  • Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.3

Warn

When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:

  • Outbound rate from ECR to TR > ECR Attachment specification × 0.5

When one of the following conditions occurs at the cross-region connection dimension:

  • Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.5

Critical

When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:

  • Outbound rate from ECR to TR > ECR Attachment specification × 0.85

When one of the following conditions occurs at the cross-region connection dimension:

  • Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.85

  • Packet loss rate of cross-region access for the ECR instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1%.

Express Connect - Peering connection

Info

When one of the following conditions occurs at the instance dimension:

  • Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.3.

  • Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.3.

Warn

When one of the following conditions occurs at the instance dimension:

  • Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.5.

  • Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.5.

Critical

When one of the following conditions occurs at the instance dimension:

  • Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.85.

  • Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.85.

  • Network throttling packet loss rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100.

Subscribe to the following CloudMonitor system events and push alerts:

  1. Product: Express Connect - Leased line connection. Event type: Down. Event name: BGP Peer status changed from Established to Down.

VPN Gateway

If you use a VPN Gateway to access Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN:

Monitored object

Alert level

Monitoring metrics and conditions

VPN Gateway

Info

When one of the following conditions occurs at the instance dimension:

  • VPN gateway inbound bandwidth utilization > 30%

  • VPN gateway outbound bandwidth utilization > 30%

Warn

When one of the following conditions occurs at the instance dimension:

  • VPN gateway inbound bandwidth utilization > 50%

  • VPN gateway outbound bandwidth utilization > 50%

Critical

When one of the following conditions occurs at the instance dimension:

  • VPN gateway inbound bandwidth utilization > 85%

  • VPN gateway outbound bandwidth utilization > 85%

  • Negotiation status of a single tunnel in the IPsec connection of the VPN gateway = 0 (0 is Down, 1 is Up)

Note: If you use the "IPsec connection attached to CEN/TR" method for networking, refer to the "CEN/TR global networking" section for monitoring methods.

Subscribe to the following CloudMonitor system events and push alerts:

  1. Product: VPN Gateway. Event type: Abnormal, Status Notification. Event name: Certificate expired, All IPsec connection tunnels failed to negotiate, IPsec tunnel negotiation failed, health check failed, VPN connection health check failed.

CEN/TR global networking

If you use CEN/TR for global networking, refer to the following suggestions to configure alert rules in CloudMonitor for CEN/TR:

Monitored object

Alert level

Monitoring metrics and conditions

Cloud Enterprise Network - Region monitoring

Info

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization between regions > 30%

  • Peak outbound bandwidth utilization of QoS queue between regions > 30%

Warn

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization between regions > 50%

  • Peak outbound bandwidth utilization of QoS queue between regions > 50%

Critical

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization between regions > 85%

  • Peak outbound bandwidth utilization of QoS queue between regions > 85%

  • Outbound throttling packet loss rate between regions > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100.

  • Outbound throttling packet loss rate of QoS queue between regions > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100.

Cloud Enterprise Network - Area monitoring

Info

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization of CEN bandwidth plan > 30%

Warn

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization of CEN bandwidth plan > 50%

Critical

When one of the following conditions occurs:

  • Peak outbound bandwidth utilization of CEN bandwidth plan > 85%

Cloud Enterprise Network - Transit Router (configure when using Enterprise Edition)

Info

When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:

  • TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value.

When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:

  • Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.3.

Warn

When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:

  • TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value.

When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:

  • Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.5.

Critical

When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:

  • TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value.

When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:

  • Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.85.

Note:

  1. For more information about TR connection bandwidth specifications, see Limits.

Subscribe to the following CloudMonitor system events and push alerts:

  1. Product: Cloud Enterprise Network. Event: 90%QuotaExceeded. Event name: Event for exceeding 90% of quota.

When creating a VPN Attachment in TR, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN connection:

Monitored object

Alert level

Monitoring metrics and conditions

VPN connection

Info

When one of the following conditions occurs:

  • VPN connection single tunnel inbound bandwidth > 300M (VPN Attachment bandwidth specification × 0.3)

  • VPN connection single tunnel outbound bandwidth > 300M (VPN Attachment bandwidth specification × 0.3)

  • VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 36,000 (VPN Attachment packet rate specification × 0.3)

Warn

When one of the following conditions occurs:

  • VPN connection single tunnel inbound bandwidth > 500M (VPN Attachment bandwidth specification × 0.5)

  • VPN connection single tunnel outbound bandwidth > 500M (VPN Attachment bandwidth specification × 0.5)

  • VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 60,000 (VPN Attachment packet rate specification × 0.5)

Critical

When one of the following conditions occurs:

  • VPN connection single tunnel inbound bandwidth > 850M (VPN Attachment bandwidth specification × 0.85)

  • VPN connection single tunnel outbound bandwidth > 850M (VPN Attachment bandwidth specification × 0.85)

  • VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 102,000 (VPN Attachment packet rate specification × 0.85)

VPN gateway

Critical

When one of the following conditions occurs at the vpnconnection dimension:

  • Negotiation status of a single tunnel in the VpnAttachment = 0 (0 is Down, 1 is Up)

Note:

  1. For more information about VPN connection specifications, see Quotas and limits.

7. Hands-on guide

To integrate cloud service monitoring metrics into ARMS Prometheus, configure custom dashboards, and configure alerts, see Cloud Service Observability.

8. Appendix

Dashboard configuration method

1. Alibaba Cloud CloudMonitor Prometheus

  • Data ingestion: Go to Application Real-Time Monitoring Service (ARMS) > Integration Center, select the corresponding product (such as EIP or ALB), and then follow the prompts to complete the integration.

  • Custom dashboard: Go to Application Real-Time Monitoring Service (ARMS) > Provisioning > Cloud Service Integration Environment, select the corresponding product, and then refer to the reference cases in Section 6 to customize the dashboard.

2. Data ingestion for non-Alibaba Cloud CloudMonitor Prometheus (self-managed or third-party)

  • You can deploy a lightweight Prometheus in an Alibaba Cloud ECS or ACK cluster, or use the Prometheus Agent mode.

  • You can configure a plugin to collect Alibaba Cloud resource metrics (using an Exporter, API, or logs). For more information, see the open source aliyun_exporter plugin.

  • In the prometheus.yml file, you can configure `remote_write` to point to the `/api/v1/write` interface of your self-managed Prometheus.

  • You can restart Prometheus. The data is then sent to your local instance.

3. Data ingestion for other monitoring platforms

  • You need to develop your own data ingestion solution. You can use the collection plugin for non-Alibaba Cloud CloudMonitor Prometheus as a reference.

Alert configuration method

1. Alibaba Cloud CloudMonitor Prometheus

  • Go to Application Real-Time Monitoring Service (ARMS) > Alert Rules > Create Alert Rule. Make sure to select the region where the Prometheus instance is located.

  • You can configure the rule name, Prometheus instance, custom PromQL, severity level, and alert threshold in order.

2. Alibaba Cloud CloudMonitor

  • Go to Alibaba Cloud CloudMonitor > Alert Service > Alert Rules > Create Alert Rule.

  • Select a product

  • You can create the rule and define the Critical, Warn, and Info rules.

Suggestions

You can organize your team to conduct regular network inspections (monthly or even weekly) and implement a clear optimization plan until all risks are eliminated.

Suggestions

The value of a tool lies in its practical application, but you must prepare in advance. Only through continual learning and practice can you ensure that the tool is effective when issues arise.