The Intelligent Network O&M Solution helps customers optimize cloud network performance, improve fault localization efficiency, and reduce O&M costs through global monitoring, intelligent alerting, regular inspections, and analysis tools.-Cloud Network Well-architected Design Guidelines-阿里云帮助中心

This document introduces the design goals and common scenarios of the Alibaba Cloud Intelligent Network O&M Solution. It explains how to achieve efficient, proactive, and intelligent network operations and management (O&M) using four key methods: using a central dashboard to obtain a global overview, using alerts to quickly detect and locate issues, using inspections to proactively find and eliminate risks, and using tools to analyze and resolve root causes. This document also provides guidance on selecting monitoring and event platforms that support these capabilities. It includes detailed instructions for configuring dashboards and alerts for various Alibaba Cloud networking services to help you apply these practices to your business.

1. Background and requirements

As digital transformation deepens, the cloud network has become the core infrastructure that supports business operations. However, the growing complexity, dynamic nature, and scale of cloud environments present unprecedented challenges to traditional network O&M:

Lack of a global view: Network resources are scattered across different products, which prevents a unified view for monitoring, analysis, and alerting. This fragmentation makes global optimization difficult.
Increased complexity: The widespread adoption of technologies such as hybrid cloud, multicloud architectures, microservices, and containerization (such as ACK) makes network topologies increasingly complex and difficult for traditional manual O&M to manage.
Difficulty in finding performance bottlenecks: Issues such as traffic bursts, bandwidth bottlenecks, and latency jitter are hard to detect and predict in real-time. This affects the user experience.
Difficulty in fault localization: Troubleshooting network link issues across regions, VPCs, accounts, and products is time-consuming. It often relies on experience, resulting in low fault localization efficiency.
Increased security risks: The attack surface is expanding. Incorrect configurations or improper change management of security policies, such as security groups and network ACLs, can lead to security vulnerabilities.
High O&M costs: Relying on a large amount of manual effort for daily inspections, fault response, and configuration management is inefficient and results in high labor costs.

The Intelligent Network O&M Solution aims to help customers effectively adopt, use, and manage the cloud. The solution focuses on guiding customers through their daily O&M tasks, such as monitoring network metrics, identifying network risks, and analyzing and resolving network anomalies. It also helps them upgrade and optimize network performance to meet the new requirements introduced by business iterations.

2. Target customers

This solution is suitable for the following types of customers:

Large enterprises and group customers: Customers with complex hybrid cloud or multicloud architectures, multi-region deployments, and many VPCs and network resources. These customers have extremely high requirements for network stability, security, and O&M efficiency.
Internet and technology companies: Businesses with fast iterations and large traffic fluctuations. These businesses have strong demands for network performance, elasticity, and automatic fault recovery, and they strive for high DevOps/NetOps efficiency.
Customers in key industries such as finance, government, and healthcare: Customers with extremely strict requirements for network high availability and security compliance. They need to meet rigorous audit and regulatory requirements.
Traditional enterprises undergoing digital transformation: Enterprises that are migrating from traditional data centers to the cloud and need to quickly establish modern network O&M capabilities.
IT/Network O&M teams: Teams that want to use intelligent tools to improve O&M efficiency, reduce failure rates, and free up human resources to focus on higher-value work.

3. Solution overview

The Intelligent Network O&M Solution recommends four main methods for customers to perform cloud network O&M:

Use a central "dashboard" to gather data and understand the global situation.
Rely on "alerts" to detect and locate issues.
Perform daily "inspections" to find and eliminate potential risks.
Use "tools" to analyze and resolve root causes.

3.1 Use a central "dashboard" to gather and understand the global situation

A network dashboard is more than just a data visualization board. It is a cloud network operational hub that integrates monitoring, analysis, decision-making, and collaboration. A dashboard should be designed with the following guidelines:

A network dashboard provides data support for specific roles to solve specific problems:

Business owners: Focus on overall availability, Service-Level Agreement (SLA) achievement, and cost trends.
O&M engineers: Focus on fault alerts, link status, and performance anomalies.
Architecture teams: Focus on topology, capacity usage, and scaling bottlenecks.

Suggestion: You can design different views for different roles, such as an "O&M View", "Business View", and "Architecture View".

Display information in layers based on the network architecture to avoid information overload:

Internet access layer: EIPs, Internet Shared Bandwidth, NAT Gateway, and others.
The Application Delivery Layer includes services such as CLB, ALB, NLB, and GA.
Global networking layer: VPN Gateway, Express Connect, TR, CEN, and others.

Suggestions: 1) Support a three-level top-down drill-down from "Network Product Overview" to "Specific Product Overview" to "Single Instance Details". This lets you move from the overall picture to specific details. 2) Place related metrics on the same chart to increase information density. 3) Use good naming conventions, resource group divisions, and tags to help locate issues quickly.

Focus on key services and metrics. Place important metrics on the dashboard. Other metrics do not require daily attention and are used for analysis only when related issues occur:

Traffic: Peak inbound and outbound bandwidth, traffic trends, and others.
Availability: SLB health check success rate, status of leased line, VPN, or cross-region links, and others.
Performance: Latency, response time, bandwidth utilization, packet loss rate, and others.
Cost: Number of EIPs, CDT bandwidth costs, network element CU costs, and others.

Suggestion: You can use "red/yellow/green" colors on the dashboard to indicate health status.

3.2 Rely on "alerts" to detect and locate problems

Event subscription mechanism: Subscribe to events that affect your business and set up an alert mechanism. This step helps you discover system anomalies, performance issues, or security threats as soon as they occur.
Immediate response process for critical alerts: Create a strict emergency response plan. For alerts marked as "Critical", have a clear plan and assign a dedicated person to coordinate and handle the issue until it is fully resolved.
Regularly check the Event Center: Set up a regular check-in schedule to review the history in the Event Center. Analyzing this data can help you identify trending issues or underlying risks in advance and take preventive measures to avoid service interruptions.

3.3 Rely on "inspections" to find and eliminate potential risks

The ability to perform inspections is a crucial step when you build and maintain a network architecture. First, you need to identify and understand various potential risks, including stability risks, security risks, performance risks, and cost waste. Stability risks are mainly caused by incorrect primary/secondary configurations, which can prevent a smooth failover during a failure and affect normal system operations. Unreasonable resource deployment can also lead to a large blast radius, increasing the possibility of a system crash. Security risks involve vulnerabilities in network ACL configurations and overly permissive security group permissions. These issues can create security holes and make the network environment vulnerable. Performance risks often manifest as network path detours, which increase data transmission latency. Frequent traffic overruns indicate that the system may need to be scaled out to meet growing demands. In addition, cost waste is an issue that cannot be ignored. Low resource utilization and incorrect choices among multiple billing methods can result in unnecessary expenses.

To effectively manage these risks, you need to perform regular inspections. In the NIS console, you can conduct network inspections, view historical reports, and initiate new inspections as needed. This process generates a detailed inspection report. We recommend running it once a week to promptly discover and address potential issues. Once an issue is found, you should immediately enter the risk handling stage. In this stage, you can use the NIS console and network inspection tools to view detailed reports, obtain network optimization suggestions, and take corresponding measures to mitigate the risks based on these suggestions. For example, for stability risks, you can optimize primary/secondary configurations and resource deployment. For security risks, you can patch network ACL configuration vulnerabilities and adjust security group permissions. For performance risks, you can optimize network paths and perform necessary scale-outs. For cost waste, you can improve resource utilization and choose more suitable billing methods.

3.4 Use "tools" to analyze and resolve root causes

NIS: Network Intelligence Service (NIS) is an intelligent network service product launched by Alibaba Cloud. It is based on years of large-scale network O&M practices and technological expertise and is designed for complex network scenarios. It integrates network measurement, diagnosis, and optimization to provide end-to-end network observability and intelligent analysis capabilities. NIS helps users quickly locate connectivity, performance, and fault issues across regions and network domains to achieve a cloud network O&M experience that is "visible, fast to check, and manageable". NIS provides a rich set of tools to analyze and resolve root causes. When you observe anomalies on the dashboard, when an alert occurs, or when an inspection report provides optimization suggestions, you can use the tools provided by NIS to perform the following functions:

Instance diagnosis: You can detect the configuration and running status of an instance and receive quick fixes based on the diagnosed anomalies.
Path analysis: You can analyze end-to-end network connectivity and diagnose connection problems caused by network configuration errors. When a destination is unreachable, you can identify the location and cause of the block. You can keep the network traffic analysis feature enabled to continuously monitor and analyze network traffic based on data such as throughput, packet loss, latency, and user distribution. This helps O&M engineers optimize the business architecture based on traffic conditions.
Network traffic analysis: You can monitor real-time and historical traffic data and metrics in the network to help you understand the performance and load of your network applications.
Network Insight: You can analyze the real-time operational status of business unit traffic to help you promptly perceive business network anomalies. It also provides network quality assessment data and event impact analysis.
Network topology: You can quickly understand your Alibaba Cloud network architecture, perform network configuration validation, and conduct unified O&M for your cloud network resources.
Performance monitoring: This feature provides average network latency data within Alibaba Cloud and across the Internet. This helps you choose a region or zone when you set up services.

4. Product portfolio

4.1 Monitoring platform comparison

There are many types of monitoring platforms. We divide them into three main categories based on their ecosystem: Alibaba Cloud CloudMonitor, Prometheus monitoring, and other monitoring platforms (such as Zabbix, the ELK Stack (Elasticsearch, Logstash, and Kibana), and OpenTelemetry). These are further divided into five subcategories: Alibaba Cloud CloudMonitor Basic, Alibaba Cloud Hybrid Cloud Monitoring (part of the Alibaba Cloud CloudMonitor category), ARMS Prometheus+Grafana, self-managed Prometheus+Grafana (part of the Prometheus category), and other monitoring platforms. The comparison is as follows:

	Advantages	Disadvantages	Description
CloudMonitor Basic	Out-of-the-box, no configuration needed Basic metrics (ECS, RDS, SLB) are free Simple interface	Does not support cross-region aggregation Does not support configuration files, weak automation capabilities Limited support for custom metrics Weak visualization capabilities (few chart types) Cannot integrate non-Alibaba Cloud resources	Single-region deployment Mainly for Alibaba Cloud services Simple scenarios
Hybrid Cloud Monitoring	Supports cross-region resource aggregation Can create advanced custom dashboards Supports more cloud service metrics Supports batch monitoring of hundreds of instances	Does not support configuration files, weak automation capabilities Does not support advanced query languages such as PromQL Limited Kubernetes monitoring capabilities Cannot integrate non-Alibaba Cloud resources	Mainly for Alibaba Cloud services Multi-region deployment
ARMS Prometheus+Grafana	Natively integrated with Grafana for powerful visualization. Supports PromQL for flexible queries Deeply integrated with Kubernetes Supports Remote Write to aggregate multiple clusters Supports custom business metrics (SDK/Exporter) Unified monitoring, supports multicloud/hybrid cloud scenarios	Steeper learning curve Cost is based on data write volume and storage duration	Recommended as the default choice
Self-managed Prometheus+Grafana	Natively integrated with Grafana for powerful visualization. Supports PromQL for flexible queries Deeply integrated with Kubernetes Supports custom business metrics (SDK/Exporter) Unified monitoring, supports multicloud/hybrid cloud scenarios	Steeper learning curve	For existing self-managed Prometheus deployments
Other monitoring platforms	Extremely flexible and customizable Unified monitoring, supports multicloud/hybrid cloud scenarios	Complex architecture, difficult to integrate Extremely high learning and maintenance costs Requires a professional team for support Difficult to integrate with Alibaba Cloud native services	For existing deployments of other monitoring platforms. Requires unified monitoring, logging, and tracing Requires self-developed data collection plugins for Alibaba Cloud

On the other hand, the monitoring platforms that users use simultaneously are highly fragmented. Teams with insufficient investment in O&M may passively use monitoring platforms, which leads to this result. Teams with professional O&M tend to unify their monitoring platforms, but still cannot avoid fragmentation. According to the Observability Survey 2024, 70% of teams use four or more different monitoring platforms. Prometheus+Grafana is the de facto standard in the current cloud-native ecosystem. It has an active community, rich documentation, convenient configuration, powerful features, and a high degree of integration, making it suitable for most modern application monitoring scenarios.

Alibaba Cloud's ARMS Prometheus is a managed Prometheus + Grafana service. It eliminates the complexity of deploying, maintaining, and scaling a monitoring system yourself. It is out-of-the-box and supports large-scale metric collection, storage, and visual analytics. It is deeply integrated with the Alibaba Cloud ecosystem and can seamlessly connect with various cloud resources and applications such as ECS, Container Service for Kubernetes (ACK), Serverless, and Application Real-Time Monitoring Service (ARMS), achieving full-stack observability. At the same time, ARMS Prometheus provides powerful alerting capabilities. It supports flexible alert rule configuration based on PromQL and can accurately detect metric anomalies such as increased latency, rising error rates, and resource usage exceeding limits. Alert rules support multi-level thresholds, duration judgments, grouping, and deduplication, effectively avoiding false positives and alert storms. When an alert is triggered, it can notify on-call personnel in real-time through various channels such as DingTalk, text messages, email, and Webhooks. It also integrates with the Alibaba Cloud alert center and event center to achieve unified management and a closed-loop response for the alert lifecycle. This helps teams quickly discover and handle potential risks, ensuring business stability.

4.2 Event platform comparison

Both NIS and CloudMonitor can generate "alert" events. The differences between them are as follows:

	CloudMonitor	NIS
Scope	All cloud products	Network products
Type	Predefined system events + user-configured thresholds	Predefined system events
Requires configuration	Yes	No
Generation logic	Predefined system events: Key events such as exceeding specifications, interruptions, and anomalies. User-configured thresholds: Alerts are generated based on the customer's judgment of how a specific metric affects the system. The customer configures these alerts.	Only predefined system events
Custom threshold	Yes	No
Comprehensiveness	Comprehensive	Limited
Notification methods	Support Phone, email, DingTalk, Lark, WeCom SLS, Message Queue for Light-weight IoT, Function Compute	Not supported You can push NIS events to CloudMonitor events and then use CloudMonitor's notification methods.

A typical CloudMonitor alert is: When the real-time bandwidth of an EIP reaches 5 Gbps, the application's O&M engineer is notified.
A typical NIS alert is: When the bandwidth of an EIP bandwidth plan reaches 95% of its specification, an NIS event is generated. The user can notify the application's O&M engineer through CloudMonitor or poll for this event through an API for automated processing.

CloudMonitor supports system events and user-defined threshold alerts for all cloud products. It has complete alert configuration capabilities, such as metric thresholds, duration judgments, and multi-level alerts. In contrast, NIS only provides predefined system events for network products and does not support user-configured thresholds. Therefore, CloudMonitor is more suitable to serve as an enterprise-level unified alert center.

5. Scenarios

Core application scenarios include the following:

Network performance monitoring and optimization.
Anomaly detection and intelligent alerting.
Quick fault localization and root cause analysis.
Resource utilization analysis and cost optimization.
Automated network O&M.
Unified management of multicloud/hybrid cloud networks.

6. Configuration reference manual

6.1 Dashboard configuration reference

The following section introduces typical dashboard designs for cloud networks, layered by network architecture.

6.1.1 Public network business dashboard

Elastic IP Address (EIP) is the standard way Alibaba Cloud products provide public network access. Most network services support binding a user-provided EIP. However, for historical reasons, some products also support public network services without a user-provided EIP. See the following table and suggestions.

Product	Public network type	Suggestion
ECS	EIP Public IP	Use EIPs uniformly. Use EIPs for ECS. Use private network SLB instances with EIPs.
SLB	EIP: Private network SLB instances support binding EIPs to provide services over the public network. SLB public IP: Public network SLB instances.
ALB/NLB/NAT	EIP
VPN	Public VPN gateways use non-user EIPs.	Set up a separate dashboard.
GA	GA uses non-user EIPs.	Set up a separate dashboard.

For statistical metrics, the inbound and outbound bandwidth/traffic on an EIP always represents the actual bandwidth/traffic transmitted on that EIP. However, the object to monitor for utilization, packet loss, and billing differs based on the combination of EIP + "added to Internet Shared Bandwidth" + "CDT enabled". Refer to the following table for details.

Product	Combination		Description	Monitored object
Product	Should I join? Internet Shared Bandwidth (cbwp)	Enabled CDT	Description	Monitored object
EIP	No	No	Bandwidth throttling granularity: Single EIP. Billing: Single EIP.	Utilization/throttling packet loss: On the EIP Billable item: On the EIP
EIP	No	Yes	Bandwidth throttling granularity: Single EIP. Billing: CDT.	Utilization/throttling packet loss: On the EIP Billable item: On CDT
EIP	Yes	No	Bandwidth throttling granularity: cbwp. The original bandwidth peak of the EIP is invalid and is the same as the bandwidth peak of the Internet Shared Bandwidth instance. Billing: cbwp.	Utilization/throttling packet loss: On the cbwp Billable item: On the cbwp
EIP	Yes	Yes	Bandwidth throttling granularity: cbwp. The original bandwidth peak of the EIP is invalid and is the same as the bandwidth peak of the Internet Shared Bandwidth instance. Billing: cdt.	Utilization/throttling packet loss: On the cbwp Billable item: On CDT

Public network dashboard design reference

EIP dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total EIP rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Throttling packet loss rate: Sum	Right	Mark as red if > 100
EIP bandwidth utilization	Time series	Inbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
EIP bandwidth utilization	Time series	Outbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Top N instances by EIP throttling packet loss	Table	Throttling packet loss rate		Mark as red if > 100
Top N instances by EIP inbound bandwidth utilization	Table	Inbound bandwidth utilization		Mark as yellow if > 50 Mark as red if > 80
Top N instances by EIP outbound bandwidth utilization	Table	Outbound bandwidth utilization		Mark as yellow if > 50 Mark as red if > 80
Top N instances by EIP inbound rate	Table	Inbound rate
Top N instances by EIP outbound rate	Table	Outbound rate

Internet Shared Bandwidth dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total EBWP rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Throttling packet loss rate: Sum	Right	Mark as red if > 100
EBWP bandwidth utilization	Time series	Inbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
EBWP bandwidth utilization	Time series	Outbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Top N instances by EBWP throttling packet loss	Table	Throttling packet loss rate > 0		Mark as red if > 100
Top N instances by EBWP inbound bandwidth utilization	Table	Inbound bandwidth utilization > 30		Mark as yellow if > 50 Mark as red if > 80
Top N instances by EBWP outbound bandwidth utilization	Table	Outbound bandwidth utilization > 30		Mark as yellow if > 50 Mark as red if > 80
Top N instances by EBWP inbound rate	Table	Inbound rate
Top N instances by EBWP outbound rate	Table	Outbound rate

6.1.2 Network element business dashboard

SLB dashboard design reference

SLB has over 54 monitoring metrics. We classify them by horizontal and vertical dimensions.

The horizontal dimension can be divided into SLB listener granularity and SLB instance granularity. Most statistical metrics have both listener-granularity and instance-granularity versions. However, 1) resource utilization-related metrics are only available at the instance granularity. 2) Health check-related metrics are only available at the listener granularity.
- Listener-granularity metric names do not contain "Instance". For example, AliyunSlb_ActiveConnection is the statistic for the number of active connections of a specific SLB listener.
- Instance-granularity metric names contain "Instance". For example, AliyunSlb_InstanceActiveConnection is the sum of active connection statistics for the entire SLB instance.
The vertical dimension can be divided into two main categories, Layer 4 and Layer 7, each with four subcategories.
- Layer 4 statistical metrics are divided into four categories: health check, resource utilization, traffic, and connections.
- Layer 7 statistical metrics are divided into four categories: resource utilization, response time, status code, and others.

	Classification	Listener granularity	Instance granularity
Layer 4	Health check	AliyunSlb_HealthyServerCount AliyunSlb_UnhealthyServerCount
	Resource utilization		AliyunSlb_InstanceNewConnectionUtilization - New connection utilization of the instance AliyunSlb_InstanceMaxConnectionUtilization - Maximum connection utilization of the instance AliyunSlb_InstanceTrafficRXUtilization - Received traffic utilization of the instance AliyunSlb_InstanceTrafficTXUtilization - Sent traffic utilization of the instance
	Traffic	AliyunSlb_TrafficRXNew - New received traffic AliyunSlb_TrafficTXNew - New sent traffic AliyunSlb_DropTrafficRX - Dropped traffic at the receiver AliyunSlb_DropTrafficTX - Dropped traffic at the sender AliyunSlb_PacketRX - Received packets AliyunSlb_PacketTX - Sent packets AliyunSlb_DropPacketRX - Dropped packets at the receiver AliyunSlb_DropPacketTX - Dropped packets at the sender	AliyunSlb_InstanceTrafficRX - Received traffic of the instance AliyunSlb_InstanceTrafficTX - Sent traffic of the instance AliyunSlb_InstanceDropTrafficRX - Dropped traffic at the receiver of the instance AliyunSlb_InstanceDropTrafficTX - Dropped traffic at the sender of the instance AliyunSlb_InstancePacketRX - Received packets of the instance AliyunSlb_InstancePacketTX - Sent packets of the instance AliyunSlb_InstanceDropPacketRX - Dropped packets at the receiver of the instance AliyunSlb_InstanceDropPacketTX - Dropped packets at the sender of the instance
	Connections	AliyunSlb_ActiveConnection - Current active connections AliyunSlb_InactiveConnection - Inactive connections AliyunSlb_NewConnection - New connections AliyunSlb_MaxConnection - Maximum connections AliyunSlb_DropConnection - Dropped connections	AliyunSlb_InstanceActiveConnection - Current active connections of the instance AliyunSlb_InstanceInactiveConnection - Inactive connections of the instance AliyunSlb_InstanceNewConnection - New connections of the instance AliyunSlb_InstanceMaxConnection - Maximum connections of the instance AliyunSlb_InstanceDropConnection - Dropped connections of the instance
Layer 7	Utilization		AliyunSlb_InstanceQpsUtilization - QPS utilization of the instance
	Response time	AliyunSlb_Rt - Response time	AliyunSlb_InstanceRt - Response time of the instance AliyunSlb_InstanceUpstreamRt - Backend response time of the instance
	Status code	AliyunSlb_StatusCode2xx - Number of requests with HTTP status code 2xx AliyunSlb_StatusCode3xx - Number of requests with HTTP status code 3xx AliyunSlb_StatusCode4xx - Number of requests with HTTP status code 4xx AliyunSlb_StatusCode5xx - Number of requests with HTTP status code 5xx AliyunSlb_StatusCodeOther - Number of requests with other HTTP status codes AliyunSlb_UpstreamCode4xx - Number of times the backend returned a 4xx status code AliyunSlb_UpstreamCode5xx - Number of times the backend returned a 5xx status code	AliyunSlb_InstanceStatusCode2xx - Number of requests with HTTP status code 2xx for the instance AliyunSlb_InstanceStatusCode3xx - Number of requests with HTTP status code 3xx for the instance AliyunSlb_InstanceStatusCode4xx - Number of requests with HTTP status code 4xx for the instance AliyunSlb_InstanceStatusCode5xx - Number of requests with HTTP status code 5xx for the instance AliyunSlb_InstanceStatusCodeOther - Number of requests with other HTTP status codes for the instance AliyunSlb_InstanceUpstreamCode4xx - Number of times the backend of the instance returned a 4xx status code AliyunSlb_InstanceUpstreamCode5xx - Number of times the backend of the instance returned a 5xx status code
	Other	AliyunSlb_Qps - Queries per second (QPS)	AliyunSlb_InstanceQps - Queries per second (QPS) of the instance

High-Level Design Reference for CLB

SLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because CLB metrics have multiple dimensions, we recommend that you select the following dimensions for the TopN section:

CLB now consolidates utilization metrics at the instance level.
If an SLB instance serves a single business, we recommend displaying the SLB instance in the Top N.
If a CLB instance manages various services, you can display its listeners in the TopN list.

Panel name	Type	Metric	Axis	Description
Total SLB rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Inbound drop rate: Sum	Right	Mark as red if > 100
		Outbound drop rate: Sum	Right	Mark as red if > 100
Total SLB connections	Time series	Active connections: Sum	Left
		Inactive connections: Sum	Left
		New connections: Sum	Left
		Maximum connections: Sum	Left
		Dropped connections: Sum	Right	Set yellow and red markers
Health check	Time series	Healthy servers: Sum	Left	Mark as green
Health check	Time series	Unhealthy servers: Sum	Left	Set yellow and red markers
SLB instance utilization	Time series	New connection utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
		Maximum connection utilization: Max, Min, Avg	Left
		Inbound bandwidth utilization: Max, Min, Avg	Left
		Outbound bandwidth utilization: Max, Min, Avg	Left
		Layer 7 QPS utilization: Max, Min, Avg	Left
Layer 7 QPS	Time series	QPS: Sum	Left
Layer 7 response time	Time series	Response time: Max, Min, Avg	Left	Set yellow and red markers
Layer 7 response time	Time series	Backend response time: Max, Min, Avg	Left	Set yellow and red markers
Layer 7 status code statistics	Time series	2xx: Sum	Left
		3xx: Sum	Left
		4xx: Sum	Left	Set yellow and red markers
		5xx: Sum	Left	Set yellow and red markers
		Other: Sum	Left
		Upstream4xx: Sum	Left	Set yellow and red markers
		Upstream5xx: Sum	Left	Set yellow and red markers
Top N by inbound rate	Table	Inbound rate
Top N by outbound rate	Table	Outbound rate
Top N by inbound drop rate	Table	Inbound drop rate > 0		Mark as red
Top N by outbound drop rate	Table	Outbound drop rate > 0		Mark as red
Top N by maximum connections	Table
Top N by new connections	Table
Top N by dropped connections	Table	Dropped connections > 0		Mark as red
Top N by unhealthy servers	Table	Unhealthy servers > 0		Mark as red
High utilization instances	Table	New connection utilization > 30 OR Maximum connection utilization > 30 OR Inbound bandwidth utilization > 30 OR Outbound bandwidth utilization > 30 OR QPS utilization > 30		Mark as yellow if > 50 Mark as red if > 80
Top N by response time	Table	Response time > Average value × 2		Dynamic threshold marking
Top N by QPS	Table	QPS
Top N by 4xx	Table	4xx
Top N by 5xx	Table	5xx

NLB dashboard design reference

NLB has over 45 monitoring metrics. We classify them by horizontal and vertical dimensions.

The horizontal dimension can be divided into NLB listener granularity, NLB VIP granularity, and NLB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions.
- Listener-granularity metric names do not contain "Instance" or "Vip". For example, AliyunNlb_ActiveConnection is the statistic for the number of active connections of a specific NLB listener.
- VIP-granularity metric names contain "Vip". For example, AliyunNlb_VipActiveConnection is the statistic for the number of active connections of a specific NLB VIP.
- Instance-granularity metric names contain "Instance". For example, AliyunNlb_InstanceActiveConnection is the sum of active connection statistics for the entire NLB instance.
The vertical dimension can be divided into four main categories: health check, traffic, connections, and others.

Classification	Listener granularity	Instance granularity	VIP granularity
Health check	AliyunNlb_ListenerHeathyServerCount - Number of healthy servers AliyunNlb_ListenerUnhealthyServerCount - Number of unhealthy servers	AliyunNlb_NlbInstanceHeathyServerCount - Number of healthy servers for the SLB instance AliyunNlb_InstanceUnhealthyServerCount - Number of unhealthy servers for the instance	N/A
Traffic	AliyunNlb_TrafficRXNew - Received traffic AliyunNlb_TrafficTXNew - Sent traffic AliyunNlb_ListenerPacketRX - Received packets AliyunNlb_ListenerPacketTX - Sent packets AliyunNlb_DropTrafficRX - Received dropped traffic AliyunNlb_DropTrafficTX - Sent dropped traffic AliyunNlb_DropPacketRX - Received dropped packets AliyunNlb_DropPacketTX - Sent dropped packets	AliyunNlb_InstanceTrafficRX - Instance received traffic AliyunNlb_InstanceTrafficTX - Instance sent traffic AliyunNlb_InstancePacketRX - Instance received packets AliyunNlb_InstancePacketTX - Instance sent packets AliyunNlb_InstanceDropTrafficRX - Instance received dropped traffic AliyunNlb_InstanceDropTrafficTX - Instance sent dropped traffic AliyunNlb_InstanceDropPacketRX - Instance received dropped packets AliyunNlb_InstanceDropPacketTX - Instance sent dropped packets Code mode	AliyunNlb_VipTrafficRX - VIP received traffic AliyunNlb_VipTrafficTX - VIP sent traffic AliyunNlb_VipPacketRX - VIP received packets AliyunNlb_VipPacketTX - VIP sent packets AliyunNlb_VipDropTrafficRX - VIP received dropped traffic AliyunNlb_VipDropTrafficTX - VIP sent dropped traffic AliyunNlb_VipDropPacketRX - VIP received dropped packets AliyunNlb_VipDropPacketTX - VIP sent dropped packets Code mode
Connections	AliyunNlb_NewConnection - New connections AliyunNlb_MaxConnection - Maximum connections AliyunNlb_DropConnection - Dropped connections AliyunNlb_ActiveConnection - Active connections AliyunNlb_InactiveConnection - Inactive connections	AliyunNlb_InstanceNewConnection - Instance new connections AliyunNlb_InstanceMaxConnection - Instance maximum connections AliyunNlb_InstanceDropConnection - Instance dropped connections AliyunNlb_InstanceActiveConnection - Instance active connections AliyunNlb_InstanceInactiveConnection - Instance inactive connections	AliyunNlb_VipNewConnection - VIP new connections AliyunNlb_VipMaxConnection - VIP maximum connections AliyunNlb_VipDropConnection - VIP dropped connections AliyunNlb_VipActiveConnection - VIP active connections AliyunNlb_VipInactiveConnection - VIP inactive connections
Other	N/A	N/A	AliyunNlb_VipClientResetPacket - Number of VIP client reset packets AliyunNlb_RealServerResetPacket - Number of VIP server reset packets

NLB dashboard design reference

NLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because NLB has many metric dimensions, we recommend the following for the Top N section dimensions:

If an NLB instance serves a single business, we recommend displaying the NLB instance in the Top N.
If an NLB instance serves mixed businesses, we recommend displaying the NLB listener in the Top N.
Other dimension information is used for problem analysis and is not displayed on the dashboard.

Panel name	Type	Metric	Axis	Description
Total NLB rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Inbound drop rate: Sum	Right	Mark as red if > 100
		Outbound drop rate: Sum	Right	Mark as red if > 100
Total NLB connections	Time series	Active connections: Sum	Left
		Inactive connections: Sum	Left
		New connections: Sum	Left
		Maximum connections: Sum	Left
		Dropped connections: Sum	Right	Set yellow and red markers
Health check	Time series	Healthy servers: Sum	Left	Mark as green
Health check	Time series	Unhealthy servers: Sum	Left	Set yellow and red markers
Reset	Time series	Number of client reset packets	Left	Mark as red if > 100
Reset	Time series	Number of server reset packets	Left	Mark as red if > 100
Top N by inbound rate	Table	Inbound rate
Top N by outbound rate	Table	Outbound rate
Top N by inbound drop rate	Table	Inbound drop rate > 0		Mark as red
Top N by outbound drop rate	Table	Outbound drop rate > 0		Mark as red
Top N by maximum connections	Table
Top N by new connections	Table
Top N by dropped connections	Table	Dropped connections > 0		Mark as red
Top N by unhealthy servers	Table	Unhealthy servers > 0		Mark as red
Top N by ClientReset	Table	Number of client reset packets > 0
Server Reset Top N	Table	Number of server reset packets > 0

ALB dashboard design reference

ALB has over 112 monitoring metrics. We classify them by horizontal and vertical dimensions.

The horizontal dimension can be divided into ALB listener granularity, ALB VIP granularity, ALB rule granularity, ALB server group granularity, and ALB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions. A few metrics also provide rule-granularity and server group-granularity.
- Listener-granularity metric names contain "Listener". For example, AliyunAlb_ListenerQPS is the QPS statistic for a specific ALB listener.
- VIP-granularity metric names contain "Vip". For example, AliyunAlb_VipQPS is the QPS statistic for a specific ALB VIP.
- Rule-granularity metric names contain "Rule". For example, AliyunAlb_RuleQPS is the QPS statistic for a specific ALB rule.
- Server group-granularity metric names contain "ServerGroup". For example, AliyunAlb_ServerGroupQPS is the QPS statistic for a specific ALB server group.
- Instance-granularity metric names contain "LoadBalancer". For example, AliyunAlb_LoadBalancerQPS is the sum of QPS statistics for the entire ALB instance.
The vertical dimension can be divided into six main categories: health check, traffic, connections, response time, status code, and others.

Classification

Listener granularity

Instance granularity

VIP granularity

Rule granularity

Server group granularity

Health check

AliyunAlb_ListenerHealthyHostCount

AliyunAlb_ListenerUnHealthyHostCount

AliyunAlb_LoadBalancerHealthyHostCount

AliyunAlb_LoadBalancerUnHealthyHostCount

AliyunAlb_RuleHealthyHostCount

AliyunAlb_RuleUnHealthyHostCount

AliyunAlb_ServerGroupHealthyHostCount

AliyunAlb_ServerGroupUnHealthyHostCount

Traffic

AliyunAlb_ListenerInBits

AliyunAlb_ListenerOutBits

AliyunAlb_LoadBalancerInBits

AliyunAlb_LoadBalancerOutBits

AliyunAlb_VipInBits

AliyunAlb_VipOutBits

Connections

AliyunAlb_ListenerActiveConnection

AliyunAlb_ListenerInactiveConnection

AliyunAlb_ListenerNewConnection

AliyunAlb_ListenerMaxConnection

AliyunAlb_ListenerRejectedConnection

AliyunAlb_ListenerUpstreamConnectionError

AliyunAlb_LoadBalancerActiveConnection

AliyunAlb_LoadBalancerInactiveConnection

AliyunAlb_LoadBalancerNewConnection

AliyunAlb_LoadBalancerMaxConnection

AliyunAlb_LoadBalancerRejectedConnection

AliyunAlb_LoadBalancerUpstreamConnectionError

AliyunAlb_VipActiveConnection

AliyunAlb_VipInactiveConnection

AliyunAlb_VipNewConnection

AliyunAlb_VipMaxConnection

AliyunAlb_VipRejectedConnection

AliyunAlb_VipUpstreamConnectionError

AliyunAlb_RuleUpstreamConnectionError

AliyunAlb_ServerGroupUpstreamConnectionError

Response time

AliyunAlb_ListenerRequestTime

AliyunAlb_ListenerUpstreamResponseTime

AliyunAlb_LoadBalancerRequestTime

AliyunAlb_LoadBalancerUpstreamResponseTime

AliyunAlb_VipRequestTime

AliyunAlb_VipUpstreamResponseTime

AliyunAlb_RuleRequestTime

AliyunAlb_RuleUpstreamResponseTime

AliyunAlb_ServerGroupRequestTime

AliyunAlb_ServerGroupUpstreamResponseTime

Status code

AliyunAlb_ListenerHTTPCode2XX

AliyunAlb_ListenerHTTPCode3XX

AliyunAlb_ListenerHTTPCode4XX

AliyunAlb_ListenerHTTPCode5XX

AliyunAlb_ListenerHTTPCode500

AliyunAlb_ListenerHTTPCode502

AliyunAlb_ListenerHTTPCode503

AliyunAlb_ListenerHTTPCode504

AliyunAlb_ListenerHTTPCodeUpstream2XX

AliyunAlb_ListenerHTTPCodeUpstream3XX

AliyunAlb_ListenerHTTPCodeUpstream4XX

AliyunAlb_ListenerHTTPCodeUpstream5XX

AliyunAlb_LoadBalancerHTTPCode2XX

AliyunAlb_LoadBalancerHTTPCode3XX

AliyunAlb_LoadBalancerHTTPCode4XX

AliyunAlb_LoadBalancerHTTPCode5XX

AliyunAlb_LoadBalancerHTTPCode500

AliyunAlb_LoadBalancerHTTPCode502

AliyunAlb_LoadBalancerHTTPCode503

AliyunAlb_LoadBalancerHTTPCode504

AliyunAlb_LoadBalancerHTTPCodeUpstream2XX

AliyunAlb_LoadBalancerHTTPCodeUpstream3XX

AliyunAlb_LoadBalancerHTTPCodeUpstream4XX

AliyunAlb_LoadBalancerHTTPCodeUpstream5XX

AliyunAlb_VipHTTPCode2XX

AliyunAlb_VipHTTPCode3XX

AliyunAlb_VipHTTPCode4XX

AliyunAlb_VipHTTPCode5XX

AliyunAlb_VipHTTPCode500

AliyunAlb_VipHTTPCode502

AliyunAlb_VipHTTPCode503

AliyunAlb_VipHTTPCode504

AliyunAlb_RuleHTTPCodeUpstream2XX

AliyunAlb_RuleHTTPCodeUpstream3XX

AliyunAlb_RuleHTTPCodeUpstream4XX

AliyunAlb_RuleHTTPCodeUpstream5XX

AliyunAlb_RuleHTTPCodeUpstream2XXRatio

AliyunAlb_RuleHTTPCodeUpstream3XXRatio

AliyunAlb_RuleHTTPCodeUpstream4XXRatio

AliyunAlb_RuleHTTPCodeUpstream5XXRatio

AliyunAlb_ServerGroupHTTPCodeUpstream2XX

AliyunAlb_ServerGroupHTTPCodeUpstream3XX

AliyunAlb_ServerGroupHTTPCodeUpstream4XX

AliyunAlb_ServerGroupHTTPCodeUpstream5XX

Other

AliyunAlb_ListenerQPS

AliyunAlb_ListenerNonStickyRequest

AliyunAlb_ListenerUpstreamTLSNegotiationError

AliyunAlb_ListenerClientTLSNegotiationError

AliyunAlb_ListenerHTTPFixedResponse

AliyunAlb_ListenerHTTPRedirect

AliyunAlb_LoadBalancerQPS

AliyunAlb_LoadBalancerNonStickyRequest

AliyunAlb_LoadBalancerUpstreamTLSNegotiationError

AliyunAlb_LoadBalancerClientTLSNegotiationError

AliyunAlb_LoadBalancerHTTPFixedResponse

AliyunAlb_LoadBalancerHTTPRedirect

AliyunAlb_VipQPS

AliyunAlb_VipNonStickyRequest

AliyunAlb_VipUpstreamTLSNegotiationError

AliyunAlb_VipClientTLSNegotiationError

AliyunAlb_VipHTTPFixedResponse

AliyunAlb_VipHTTPRedirect

AliyunAlb_RuleQPS

AliyunAlb_RuleNonStickyRequest

AliyunAlb_RuleUpstreamTLSNegotiationError

AliyunAlb_ServerGroupQPS

AliyunAlb_ServerGroupNonStickyRequest

AliyunAlb_ServerGroupUpstreamTLSNegotiationError

ALB dashboard design reference

ALB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Because ALB has many metric dimensions, we recommend the following for the Top N section dimensions:

If an ALB instance serves a single business, we recommend displaying the ALB instance in the Top N.
If an ALB instance serves mixed businesses, we recommend displaying the ALB listener in the Top N.
Other dimension information is used for problem analysis and is not displayed on the dashboard.

Panel name	Type	Metric	Axis	Description
Total ALB rate	Time series	Inbound rate: Sum	Left
Total ALB rate	Time series	Outbound rate: Sum	Left
Total ALB connections	Time series	Active connections: Sum	Left
		Inactive connections: Sum	Left
		New connections: Sum	Left
		Maximum connections: Sum	Left
		Rejected connections: Sum	Right	Set yellow and red markers
		Upstream rejected connections: Sum	Right	Set yellow and red markers
Health check	Time series	Healthy servers: Sum	Left	Mark as green
Health check	Time series	Unhealthy servers: Sum	Left	Set yellow and red markers
TLS errors	Time series	TLS negotiation errors: Sum	Left	Set yellow and red markers
TLS errors	Time series	Upstream TLS negotiation errors: Sum	Left	Set yellow and red markers
Layer 7 QPS	Time series	QPS: Sum	Left
Layer 7 response time	Time series	Response time: Max, Min, Avg	Left	Set yellow and red markers
Layer 7 response time	Time series	Backend response time: Max, Min, Avg	Left	Set yellow and red markers
Layer 7 status code statistics	Time series	2xx: Sum	Left
		3xx: Sum	Left
		4xx: Sum	Left	Set yellow and red markers
		5xx: Sum	Left	Set yellow and red markers
		Upstream4xx: Sum	Left	Set yellow and red markers
		Upstream5xx: Sum	Left	Set yellow and red markers
Top N by inbound rate	Table	Inbound rate
Top N by outbound rate	Table	Outbound rate
Top N by maximum connections	Table
Top N by new connections	Table
Top N by dropped connections	Table	Dropped connections > 0		Mark as red
Top N by unhealthy servers	Table	Unhealthy servers > 0		Mark as red
Top N by TLS negotiation errors	Table	TLS negotiation errors > 0		Mark as red
Top N by upstream TLS negotiation errors	Table	Upstream TLS negotiation errors > 0		Mark as red
Top N by response time	Table	Response time > Average value × 2		Mark as yellow or red
Top N by QPS	Table	QPS
Top N by 4xx	Table	4xx
Top N by 5xx	Table	5xx

GA dashboard design reference

GA dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total frontend IP rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Inbound drop rate: Sum	Right	Mark as red if > 100
		Outbound drop rate: Sum	Right	Mark as red if > 100
Frontend IP bandwidth utilization	Time series	Inbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Frontend IP bandwidth utilization	Time series	Outbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Total frontend IP active connections	Time series	Active connections: Sum	Left
Total backend group rate	Time series	Inbound rate: Sum	Left
		Outbound rate: Sum	Left
		Inbound drop rate: Sum	Right	Mark as red if > 100
		Outbound drop rate: Sum	Right	Mark as red if > 100
Backend group bandwidth utilization	Time series	Inbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Backend group bandwidth utilization	Time series	Outbound bandwidth utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Tunnel latency	Time series	Tunnel latency: Max, Min, Avg		Dynamic threshold marking
Top N by frontend inbound rate	Table	Inbound rate
Top N by frontend outbound rate	Table	Outbound rate
Top N by frontend inbound bandwidth utilization	Table	Inbound bandwidth utilization > 30		Mark as red or yellow
Top N by frontend outbound bandwidth utilization	Table	Outbound bandwidth utilization > 30		Mark as red or yellow
Top N by active connections	Table	Active connections
Top N by backend group inbound bandwidth utilization	Table	Top N by backend group inbound bandwidth utilization
Top N by backend group outbound bandwidth utilization	Table	Top N by backend group outbound bandwidth utilization
Top N by tunnel latency	Table	Tunnel latency		Dynamic threshold

NAT dashboard design reference

NAT dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total NAT connections	Time series	Active connections: Sum	Left
		New connections: Sum	Left
		Dropped active connections: Sum	Right	Mark as yellow if > 0 Mark as red if > 100
		Dropped new connections: Sum	Right	Mark as yellow if > 0 Mark as red if > 100
NAT connection utilization	Time series	Active connection utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
NAT connection utilization	Time series	New connection utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Total rate	Time series	Public network inbound rate: Sum	Left	Mark as red if inbound-outbound rate difference > threshold
		Public network outbound rate: Sum	Left
		Private network inbound rate: Sum	Left
		Private network outbound rate: Sum	Left
Top N instances by active connections	Table	Active connections
Top N by new connections	Table	New connections
Top N by dropped active connections	Table	Dropped active connections > 0		Mark as yellow if > 0 Mark as red if > 100
Top N by dropped new connections	Table	Dropped new connections > 0		Mark as yellow if > 0 Mark as red if > 100
Top N by active connection utilization	Table	Active connection utilization > 30		Mark as yellow if > 50 Mark as red if > 80
Top N by new connection utilization	Table	Active connection utilization > 30		Mark as yellow if > 50 Mark as red if > 80
Top N by inbound rate	Table	Public network inbound rate
Top N by outbound rate	Table	Public network outbound rate

6.1.3 Global networking business dashboard

Express Connect - Physical port design reference

Physical port dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total rate	Time series	Inbound rate to cloud: Sum	Left
Total rate	Time series	Total Egress Rate	Left
Port error packets	Time series	Port inbound error packets: Sum	Left	Mark as yellow or red
Port error packets	Time series	Port outbound error packets: Sum	Left	Mark as yellow or red
Number of disconnected leased lines	Time series	Port down: Count	Left	Mark as red
Top N by inbound rate to cloud	Table	Inbound rate to cloud
Top N by outbound rate from cloud	Table	Download rate
Top N by port inbound error packets	Table	Port inbound error packets > 0		Mark as red
Top N by port outbound error packets	Table	Port outbound error packets > 0		Mark as red
Disconnected leased line instances	Table	Port down == 1		Mark as red

Express Connect - VBR dashboard design reference

VBR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total rate	Time series	Inbound rate to cloud: Sum	Left
		Total Egress Rate	Left
		Throttling packet loss: Sum	Right	Mark as red if > 100
Packet loss	Time series	Port inbound packet loss: Sum	Left	Mark as yellow or red
Packet loss	Time series	Port outbound packet loss: Sum	Left	Mark as yellow or red
Probe packet loss	Time series	Probe packet loss: Max, Min, Avg	Left	Mark as yellow if > 0 Mark as red if > 10
Probe latency	Time series	Probe latency: Max, Min, Avg	Left	Dynamic threshold
Top N by inbound rate to cloud	Table	Inbound rate to cloud
Top N by outbound rate from cloud	Table	Outbound rate from cloud
Top N by throttling packet loss	Table	Throttling packet loss > 0		Mark as red
Top N by port inbound packet loss	Table	Port inbound packet loss > 0		Mark as red
Top N by port outbound error packets	Table	Port outbound packet loss > 0		Mark as red
Top N by probe packet loss	Table	Probe packet loss > 0		Mark as yellow if > 0 Mark as red if > 10
Top N by probe latency	Table	Probe latency

ECR dashboard design reference

ECR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total rate	Time series	Inbound rate: Sum	Left
Total rate	Time series	Outbound rate: Sum	Left
Total cross-region throttling packet loss rate	Time series	Throttling packet loss bit rate: Sum	Left	Mark as yellow or red
Total cross-region throttling packet loss rate	Time series	Throttling packet loss message rate: Sum	Right	Mark as yellow or red
Top N by inbound rate	Table	Inbound rate
Top N by outbound rate	Table	Outbound rate
Top N by cross-region rate	Table	Cross-region rate
Top N by cross-region throttling	Table	Throttling packet loss > 0		Mark as red

VPN dashboard design reference

VPN dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
Total rate	Time series	VPN Gateway inbound rate to cloud: Sum	Left
		IPsec-VPN connection inbound rate to cloud: Sum	Left
		VPN Gateway outbound rate from cloud: Sum	Right
		IPsec-VPN connection outbound rate from cloud: Sum	Right
VPN Gateway utilization	Time series	Inbound bandwidth utilization to cloud: Max, Min, Avg	Left	Mark as yellow or red
VPN Gateway utilization	Time series	Outbound bandwidth utilization from cloud: Max, Min, Avg	Left	Mark as yellow or red
Number of online SSL clients	Time series	Number of SSL clients: Sum	Left
Top N by inbound bandwidth utilization to cloud	Table	Inbound bandwidth utilization to cloud > 30		Mark as yellow if > 50 Mark as red if > 80
Top N by outbound bandwidth utilization from cloud	Table	Outbound bandwidth utilization from cloud > 30		Mark as yellow if > 50 Mark as red if > 80
Top N by VPN Gateway inbound rate to cloud	Table	VPN Gateway inbound rate to cloud
VPN Gateway outbound rate from cloud	Table	VPN Gateway outbound rate from cloud
IPsec-VPN connection inbound rate to cloud	Table	IPsec-VPN connection inbound rate to cloud
IPsec-VPN connection outbound rate from cloud	Table	IPsec-VPN connection outbound rate from cloud

TR dashboard design reference

TR cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
TR traffic	Time series	Inbound rate: Sum	Left	Mark as red if inbound-outbound rate difference > threshold
		Outbound rate: Sum	Left	Mark as red if inbound-outbound rate difference > threshold
		Blackhole drop rate: Sum	Right
		No-route drop rate: Sum	Right
Attachment connection traffic	Time series	Inbound rate: Sum	Left	Mark as red if inbound-outbound rate difference > threshold
		Outbound rate: Sum	Left	Mark as red if inbound-outbound rate difference > threshold
		Blackhole drop rate: Sum	Left
Top N by TR inbound traffic	Table	TR inbound rate
Top N by TR outbound traffic	Table	TR outbound rate
Top N by TR blackhole drops	Table	TR blackhole drop rate
Top N by TR no-route drops	Table	TR no-route drop rate
Top N by attachment connection inbound traffic	Table	Attachment connection inbound traffic
Top N by attachment connection outbound traffic	Table	Attachment connection outbound traffic
Top N by attachment connection drops	Table	Attachment connection blackhole drop rate

CEN cross-region design reference

CEN cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.

Panel name	Type	Metric	Axis	Description
CEN traffic	Time series	Region outbound rate: Sum	Left	Mark as red if outbound rate difference > threshold
		Area outbound rate: Sum	Left	Mark as red if outbound rate difference > threshold
		Bandwidth plan average outbound rate: Sum	Left	Microburst tip: Mark as yellow if Peak/Avg > 3 Mark as red if Peak/Avg > 10
		Bandwidth plan peak outbound rate: Sum	Left
		Region throttling packet loss rate: Sum	Right	Mark as red if > 100 kbps
CEN utilization	Time series	Region utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
		Area utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
		Bandwidth plan average utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
		Bandwidth plan peak utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
CEN QoS traffic	Time series	QoS outbound rate: Sum	Left
CEN QoS traffic	Time series	QoS throttling packet loss rate: Sum	Right	Mark as red if > 100 kbps
CEN QoS utilization	Time series	QoS average utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
CEN QoS utilization	Time series	QoS peak utilization: Max, Min, Avg	Left	Mark as yellow if > 50 Mark as red if > 80
Top N by region outbound rate	Table	Region outbound rate
Top N by region utilization	Table	Region utilization
Top N by region throttling packet loss rate	Table	Region throttling packet loss rate
Top N by QoS outbound rate	Table	QoS outbound rate
Top N by QoS peak utilization	Table	QoS peak utilization
Top N by QoS throttling packet loss rate	Table	QoS throttling packet loss rate

6.2 Monitoring configuration reference

6.2.1 Public network service monitoring configuration reference

For EIPs with self-built gateways providing public network service endpoints, refer to the following suggestions to configure alert rules in CloudMonitor for the public network endpoint EIPs:

Monitored object	Alert level	Monitoring metrics and conditions
EIP	Info	When one of the following conditions occurs: Inbound bandwidth utilization > 30% Outbound bandwidth utilization > 30%
	Warn	When one of the following conditions occurs: Inbound bandwidth utilization > 50% Outbound bandwidth utilization > 50%
	Critical	When one of the following conditions occurs: Inbound bandwidth utilization > 85% Outbound bandwidth utilization > 85%
Internet Shared Bandwidth	Info	When one of the following conditions occurs: Inbound bandwidth utilization > 30% Outbound bandwidth utilization > 30%
	Warn	When one of the following conditions occurs: Inbound bandwidth utilization > 50% Outbound bandwidth utilization > 50%
	Critical	When one of the following conditions occurs: Inbound bandwidth utilization > 85% Outbound bandwidth utilization > 85%
CDT	Info	When one of the following conditions occurs: Inbound bandwidth utilization > 30% Outbound bandwidth utilization > 30%
	Warn	When one of the following conditions occurs: Inbound bandwidth utilization > 50% Outbound bandwidth utilization > 50%
	Critical	When one of the following conditions occurs: Inbound bandwidth utilization > 85% Outbound bandwidth utilization > 85% Inbound throttling packet loss rate > 10 Outbound throttling packet loss rate > 10

When the bandwidth load exceeds 30%, the system enters a high-load state. The business may experience slow access, occasional timeouts, and other SLA degradation. We recommend performing a capacity assessment and considering a scale-out.

When the bandwidth load exceeds 50%, in addition to the issues at the previous level, the multi-AZ disaster recovery architecture may fail. If a service interruption occurs in one AZ, the remaining AZs cannot handle the entire business load. We recommend an immediate scale-out.

When the bandwidth load exceeds 85%, in addition to the issues at the previous level, the system load seriously exceeds the designed capacity. Besides an immediate scale-out, you should also consider whether there are unexpected events such as business growth exceeding expectations or security attacks, and optimize the system design.

6.2.2 Network element service monitoring configuration reference

CLB/NBL/ALB

For SLB/NLB/ALB providing public network service endpoints, in addition to configuring monitoring for the public network endpoint as described in the previous section, refer to the following suggestions to configure alert rules in CloudMonitor for SLB/NLB/ALB:

Monitored object	Alert level	Monitoring metrics and conditions
CLB	Info	When one of the following conditions occurs at the instance dimension: Layer 7 instance QPS utilization > 30% Instance new connection utilization > 30% Instance maximum connection utilization > 30% Instance network inbound bandwidth utilization > 30% Instance network outbound bandwidth utilization > 30% Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.1% of the peak QPS. Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. When one of the following conditions occurs at the port dimension: Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic. Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.
	Warn	When one of the following conditions occurs at the instance dimension: Layer 7 instance QPS utilization > 50% Instance new connection utilization > 50% Instance maximum connection utilization > 50% Instance network inbound bandwidth utilization > 50% Instance network outbound bandwidth utilization > 50% Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value, or use CloudMonitor intelligent threshold. Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.5% of the peak QPS, or use CloudMonitor intelligent threshold. Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value, or use CloudMonitor intelligent threshold. When one of the following conditions occurs at the port dimension: Number of unhealthy backend ECS instances for health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application. Number of unhealthy backend ECS instances for Layer 7 forwarding rule health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum single deployment batch for the backend application. Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic. Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.
	Critical	When one of the following conditions occurs at the instance dimension: Layer 7 instance QPS utilization > 85% Instance new connection utilization > 85% Instance maximum connection utilization > 85% Instance network inbound bandwidth utilization > 85% Instance network outbound bandwidth utilization > 85% Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Number of UpstreamCode5xx per second for the Layer 7 instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% of the peak QPS. Layer 7 listener RT > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. When one of the following conditions occurs at the port dimension: Number of unhealthy backend ECS instances for health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application. Number of unhealthy backend ECS instances for Layer 7 forwarding rule health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum single deployment batch for the backend application. Number of healthy backend ECS instances for health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic. Number of healthy backend ECS instances for Layer 7 forwarding rule health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.
NLB	Info	When one of the following conditions occurs at the instance dimension: New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. When one of the following conditions occurs at the port dimension: Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.
	Warn	When one of the following conditions occurs at the instance dimension: New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. When one of the following conditions occurs at the port dimension: Number of unhealthy backend ECS instances for listener health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application. Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.
	Critical	When one of the following conditions occurs at the instance dimension: New connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Maximum concurrent connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Inbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Outbound bits per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Dropped connections per second for the instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. When one of the following conditions occurs at the port dimension: Number of unhealthy backend ECS instances for listener health check > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application. Number of healthy backend ECS instances for listener health check < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.
ALB	Info	When one of the following conditions occurs at the loadBalancer dimension: New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.1% of the peak QPS. Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. When one of the following conditions occurs at the listener dimension: Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.3 × the minimum number of backend ECS instances required to handle peak traffic.
	Warn	When one of the following conditions occurs at the loadBalancer dimension: New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 0.5% of the peak QPS. Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. When one of the following conditions occurs at the listener dimension: Number of unhealthy servers for the listener > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the maximum number of concurrent deployment instances for the backend application. Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.5 × the minimum number of backend ECS instances required to handle peak traffic.
	Critical	When one of the following conditions occurs at the loadBalancer dimension: New connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Maximum concurrent connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Inbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Outbound bandwidth for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Dropped connections per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. TLS handshake failures per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. Number of 5XX errors per second for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% of the peak QPS. Request latency for the SLB instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. When one of the following conditions occurs at the listener dimension: Number of unhealthy servers for the listener > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 2 × the maximum number of concurrent deployment instances for the backend application. Number of healthy servers for the listener < X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to 0.85 × the minimum number of backend ECS instances required to handle peak traffic.

There are many application layer-related metrics, and they are closely related to the business. You should continuously optimize the relevant monitoring and the threshold configurations for each level based on actual business feedback.

6.2.3 Hybrid Disaster Recovery monitoring configuration reference

Leased line connection

If you use a leased line to connect to Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the leased line:

Monitored object	Alert level	Monitoring metrics and conditions
Express Connect - Physical port	Info	When one of the following conditions occurs: Port outbound bandwidth utilization > 30% Port inbound bandwidth utilization > 30%
	Warn	When one of the following conditions occurs: Port outbound bandwidth utilization > 50% Port inbound bandwidth utilization > 50%
	Critical	When one of the following conditions occurs: Physical status = DOWN Port outbound bandwidth utilization > 85% Port inbound bandwidth utilization > 85% Port inbound error packets > X. We recommend setting it to (average outbound rate from IDC to VPC / 512 / 8) × 0.005 × 60 (reaching 0.5% per minute). Port outbound error packets > X. We recommend setting it to (average inbound rate from VPC to IDC / 512 / 8) × 0.005 × 60 (reaching 0.5% per minute). Port outbound dropped packets > X. We recommend setting it to (port specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute). Port inbound dropped packets > X. We recommend setting it to (port specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute).
Express Connect - Virtual Border Router	Info	When one of the following conditions occurs: Inbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.30. Outbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.30.
	Warn	When one of the following conditions occurs: Inbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.50. Outbound rate from IDC to VPC > X. We recommend setting it to VBR specification bps × 0.50.
	Critical	When one of the following conditions occurs: Inbound rate from IDC to VPC > X. We recommend setting it to port specification bps × 0.85. Outbound rate from IDC to VPC > X. We recommend setting it to port specification bps × 0.85. Dropped inbound packets from IDC to VPC > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute). Dropped outbound packets from VPC to IDC > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute). Throttled dropped packets from VPC to VBR > X. We recommend setting it to (VBR specification / 512 / 8) × 0.02 × 60 (reaching 2% per minute). VBR health check latency > X or VBR health check latency == 0. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value (when the leased line is disconnected, VBR health check latency outputs 0). VBR health check packet loss rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1% (if this metric is high, focus on checking the CoPP throttling configuration of the monitored object vSwitch).
Express Connect - Express Connect Router	Info	When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension: Outbound rate from ECR to TR > ECR Attachment specification × 0.3 When one of the following conditions occurs at the cross-region connection dimension: Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.3
	Warn	When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension: Outbound rate from ECR to TR > ECR Attachment specification × 0.5 When one of the following conditions occurs at the cross-region connection dimension: Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.5
	Critical	When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension: Outbound rate from ECR to TR > ECR Attachment specification × 0.85 When one of the following conditions occurs at the cross-region connection dimension: Rate of cross-region access for the ECR instance > Cross-region connection bandwidth specification × 0.85 Packet loss rate of cross-region access for the ECR instance > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 1%.
Express Connect - Peering connection	Info	When one of the following conditions occurs at the instance dimension: Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.3. Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.3.
	Warn	When one of the following conditions occurs at the instance dimension: Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.5. Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.5.
	Critical	When one of the following conditions occurs at the instance dimension: Inbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.85. Outbound bandwidth > X. We recommend setting it to the peering connection bandwidth specification × 0.85. Network throttling packet loss rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100.

Subscribe to the following CloudMonitor system events and push alerts:

Product: Express Connect - Leased line connection. Event type: Down. Event name: BGP Peer status changed from Established to Down.

VPN Gateway

If you use a VPN Gateway to access Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN:

Monitored object	Alert level	Monitoring metrics and conditions
VPN Gateway	Info	When one of the following conditions occurs at the instance dimension: VPN gateway inbound bandwidth utilization > 30% VPN gateway outbound bandwidth utilization > 30%
	Warn	When one of the following conditions occurs at the instance dimension: VPN gateway inbound bandwidth utilization > 50% VPN gateway outbound bandwidth utilization > 50%
	Critical	When one of the following conditions occurs at the instance dimension: VPN gateway inbound bandwidth utilization > 85% VPN gateway outbound bandwidth utilization > 85% Negotiation status of a single tunnel in the IPsec connection of the VPN gateway = 0 (0 is Down, 1 is Up)

Note: If you use the "IPsec connection attached to CEN/TR" method for networking, refer to the "CEN/TR global networking" section for monitoring methods.

Subscribe to the following CloudMonitor system events and push alerts:

Product: VPN Gateway. Event type: Abnormal, Status Notification. Event name: Certificate expired, All IPsec connection tunnels failed to negotiate, IPsec tunnel negotiation failed, health check failed, VPN connection health check failed.

CEN/TR global networking

If you use CEN/TR for global networking, refer to the following suggestions to configure alert rules in CloudMonitor for CEN/TR:

Monitored object	Alert level	Monitoring metrics and conditions
Cloud Enterprise Network - Region monitoring	Info	When one of the following conditions occurs: Peak outbound bandwidth utilization between regions > 30% Peak outbound bandwidth utilization of QoS queue between regions > 30%
	Warn	When one of the following conditions occurs: Peak outbound bandwidth utilization between regions > 50% Peak outbound bandwidth utilization of QoS queue between regions > 50%
	Critical	When one of the following conditions occurs: Peak outbound bandwidth utilization between regions > 85% Peak outbound bandwidth utilization of QoS queue between regions > 85% Outbound throttling packet loss rate between regions > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100. Outbound throttling packet loss rate of QoS queue between regions > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 100.
Cloud Enterprise Network - Area monitoring	Info	When one of the following conditions occurs: Peak outbound bandwidth utilization of CEN bandwidth plan > 30%
	Warn	When one of the following conditions occurs: Peak outbound bandwidth utilization of CEN bandwidth plan > 50%
	Critical	When one of the following conditions occurs: Peak outbound bandwidth utilization of CEN bandwidth plan > 85%
Cloud Enterprise Network - Transit Router (configure when using Enterprise Edition)	Info	When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension: TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 2 times the peak value. When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension: Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.3.
	Warn	When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension: TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 5 times the peak value. When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension: Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.5.
	Critical	When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension: TR inbound traffic rate > X. The specific value needs to be adjusted based on actual conditions. A reference setting is 10 times the peak value. When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension: Inbound rate > X. The specific value needs to be adjusted based on actual conditions. We recommend setting it to the connection bandwidth specification × 0.85.

Note:

For more information about TR connection bandwidth specifications, see Limits.

Subscribe to the following CloudMonitor system events and push alerts:

Product: Cloud Enterprise Network. Event: 90%QuotaExceeded. Event name: Event for exceeding 90% of quota.

When creating a VPN Attachment in TR, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN connection:

Monitored object	Alert level	Monitoring metrics and conditions
VPN connection	Info	When one of the following conditions occurs: VPN connection single tunnel inbound bandwidth > 300M (VPN Attachment bandwidth specification × 0.3) VPN connection single tunnel outbound bandwidth > 300M (VPN Attachment bandwidth specification × 0.3) VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 36,000 (VPN Attachment packet rate specification × 0.3)
	Warn	When one of the following conditions occurs: VPN connection single tunnel inbound bandwidth > 500M (VPN Attachment bandwidth specification × 0.5) VPN connection single tunnel outbound bandwidth > 500M (VPN Attachment bandwidth specification × 0.5) VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 60,000 (VPN Attachment packet rate specification × 0.5)
	Critical	When one of the following conditions occurs: VPN connection single tunnel inbound bandwidth > 850M (VPN Attachment bandwidth specification × 0.85) VPN connection single tunnel outbound bandwidth > 850M (VPN Attachment bandwidth specification × 0.85) VPN connection single tunnel inbound packet rate + VPN connection single tunnel outbound packet rate > 102,000 (VPN Attachment packet rate specification × 0.85)
VPN gateway	Critical	When one of the following conditions occurs at the vpnconnection dimension: Negotiation status of a single tunnel in the VpnAttachment = 0 (0 is Down, 1 is Up)

Note:

For more information about VPN connection specifications, see Quotas and limits.

7. Hands-on guide

To integrate cloud service monitoring metrics into ARMS Prometheus, configure custom dashboards, and configure alerts, see Cloud Service Observability.

8. Appendix

Dashboard configuration method

1. Alibaba Cloud CloudMonitor Prometheus

Data ingestion: Go to Application Real-Time Monitoring Service (ARMS) > Integration Center, select the corresponding product (such as EIP or ALB), and then follow the prompts to complete the integration.
Custom dashboard: Go to Application Real-Time Monitoring Service (ARMS) > Provisioning > Cloud Service Integration Environment, select the corresponding product, and then refer to the reference cases in Section 6 to customize the dashboard.

2. Data ingestion for non-Alibaba Cloud CloudMonitor Prometheus (self-managed or third-party)

You can deploy a lightweight Prometheus in an Alibaba Cloud ECS or ACK cluster, or use the Prometheus Agent mode.
You can configure a plugin to collect Alibaba Cloud resource metrics (using an Exporter, API, or logs). For more information, see the open source aliyun_exporter plugin.
In the prometheus.yml file, you can configure `remote_write` to point to the `/api/v1/write` interface of your self-managed Prometheus.
You can restart Prometheus. The data is then sent to your local instance.

3. Data ingestion for other monitoring platforms

You need to develop your own data ingestion solution. You can use the collection plugin for non-Alibaba Cloud CloudMonitor Prometheus as a reference.

Alert configuration method

1. Alibaba Cloud CloudMonitor Prometheus

Go to Application Real-Time Monitoring Service (ARMS) > Alert Rules > Create Alert Rule. Make sure to select the region where the Prometheus instance is located.
You can configure the rule name, Prometheus instance, custom PromQL, severity level, and alert threshold in order.

2. Alibaba Cloud CloudMonitor

Go to Alibaba Cloud CloudMonitor > Alert Service > Alert Rules > Create Alert Rule.
Select a product
You can create the rule and define the Critical, Warn, and Info rules.

Suggestions

You can organize your team to conduct regular network inspections (monthly or even weekly) and implement a clear optimization plan until all risks are eliminated.

Suggestions

The value of a tool lies in its practical application, but you must prepare in advance. Only through continual learning and practice can you ensure that the tool is effective when issues arise.