Intelligent network O&M solutions
This document introduces the design goals and common scenarios of the Alibaba Cloud Intelligent Network O&M Solution. It explains how to achieve efficient, proactive, and intelligent network operations and management (O&M) using four key methods: using a central dashboard to obtain a global overview, using alerts to quickly detect and locate issues, using inspections to proactively find and eliminate risks, and using tools to analyze and resolve root causes. This document also provides guidance on selecting monitoring and event platforms that support these capabilities. It includes detailed instructions for configuring dashboards and alerts for various Alibaba Cloud networking services to help you apply these practices to your business.
1. Background and requirements
As digital transformation deepens, the cloud network has become the core infrastructure that supports business operations. However, the growing complexity, dynamic nature, and scale of cloud environments present unprecedented challenges to traditional network O&M:
Lack of a global view: Network resources are scattered across different products, which prevents a unified view for monitoring, analysis, and alerting. This fragmentation makes global optimization difficult.
Increased complexity: The widespread adoption of technologies such as hybrid cloud, multicloud architectures, microservices, and containerization (such as ACK) makes network topologies increasingly complex and difficult for traditional manual O&M to manage.
Difficulty in finding performance bottlenecks: Issues such as traffic bursts, bandwidth bottlenecks, and latency jitter are hard to detect and predict in real-time. This affects the user experience.
Difficulty in fault localization: Troubleshooting network link issues across regions, VPCs, accounts, and products is time-consuming. It often relies on experience, resulting in low fault localization efficiency.
Increased security risks: The attack surface is expanding. Incorrect configurations or improper change management of security policies, such as security groups and network ACLs, can lead to security vulnerabilities.
High O&M costs: Relying on a large amount of manual effort for daily inspections, fault response, and configuration management is inefficient and results in high labor costs.
The Intelligent Network O&M Solution aims to help customers effectively adopt, use, and manage the cloud. The solution focuses on guiding customers through their daily O&M tasks, such as monitoring network metrics, identifying network risks, and analyzing and resolving network anomalies. It also helps them upgrade and optimize network performance to meet the new requirements introduced by business iterations.
2. Target customers
This solution is suitable for the following types of customers:
Large enterprises and group customers: Customers with complex hybrid cloud or multicloud architectures, multi-region deployments, and many VPCs and network resources. These customers have extremely high requirements for network stability, security, and O&M efficiency.
Internet and technology companies: Businesses with fast iterations and large traffic fluctuations. These businesses have strong demands for network performance, elasticity, and automatic fault recovery, and they strive for high DevOps/NetOps efficiency.
Customers in key industries such as finance, government, and healthcare: Customers with extremely strict requirements for network high availability and security compliance. They need to meet rigorous audit and regulatory requirements.
Traditional enterprises undergoing digital transformation: Enterprises that are migrating from traditional data centers to the cloud and need to quickly establish modern network O&M capabilities.
IT/Network O&M teams: Teams that want to use intelligent tools to improve O&M efficiency, reduce failure rates, and free up human resources to focus on higher-value work.
3. Solution overview
The Intelligent Network O&M Solution recommends four main methods for customers to perform cloud network O&M:
Use a central "dashboard" to gather data and understand the global situation.
Rely on "alerts" to detect and locate issues.
Perform daily "inspections" to find and eliminate potential risks.
Use "tools" to analyze and resolve root causes.
3.1 Use a central "dashboard" to gather and understand the global situation
A network dashboard is more than just a data visualization board. It is a cloud network operational hub that integrates monitoring, analysis, decision-making, and collaboration. A dashboard should be designed with the following guidelines:
A network dashboard provides data support for specific roles to solve specific problems:
Business owners: Focus on overall availability, Service-Level Agreement (SLA) achievement, and cost trends.
O&M engineers: Focus on fault alerts, link status, and performance anomalies.
Architecture teams: Focus on topology, capacity usage, and scaling bottlenecks.
Suggestion: You can design different views for different roles, such as an "O&M View", "Business View", and "Architecture View".
Display information in layers based on the network architecture to avoid information overload:
Internet access layer: EIPs, Internet Shared Bandwidth, NAT Gateway, and others.
The Application Delivery Layer includes services such as CLB, ALB, NLB, and GA.
Global networking layer: VPN Gateway, Express Connect, TR, CEN, and others.
Suggestions: 1) Support a three-level top-down drill-down from "Network Product Overview" to "Specific Product Overview" to "Single Instance Details". This lets you move from the overall picture to specific details. 2) Place related metrics on the same chart to increase information density. 3) Use good naming conventions, resource group divisions, and tags to help locate issues quickly.
Focus on key services and metrics. Place important metrics on the dashboard. Other metrics do not require daily attention and are used for analysis only when related issues occur:
Traffic: Peak inbound and outbound bandwidth, traffic trends, and others.
Availability: SLB health check success rate, status of leased line, VPN, or cross-region links, and others.
Performance: Latency, response time, bandwidth utilization, packet loss rate, and others.
Cost: Number of EIPs, CDT bandwidth costs, network element CU costs, and others.
Suggestion: You can use "red/yellow/green" colors on the dashboard to indicate health status.
3.2 Rely on "alerts" to detect and locate problems
Event subscription mechanism: Subscribe to events that affect your business and set up an alert mechanism. This step helps you discover system anomalies, performance issues, or security threats as soon as they occur.
Immediate response process for critical alerts: Create a strict emergency response plan. For alerts marked as "Critical", have a clear plan and assign a dedicated person to coordinate and handle the issue until it is fully resolved.
Regularly check the Event Center: Set up a regular check-in schedule to review the history in the Event Center. Analyzing this data can help you identify trending issues or underlying risks in advance and take preventive measures to avoid service interruptions.
3.3 Rely on "inspections" to find and eliminate potential risks
The ability to perform inspections is a crucial step when you build and maintain a network architecture. First, you need to identify and understand various potential risks, including stability risks, security risks, performance risks, and cost waste. Stability risks are mainly caused by incorrect primary/secondary configurations, which can prevent a smooth failover during a failure and affect normal system operations. Unreasonable resource deployment can also lead to a large blast radius, increasing the possibility of a system crash. Security risks involve vulnerabilities in network ACL configurations and overly permissive security group permissions. These issues can create security holes and make the network environment vulnerable. Performance risks often manifest as network path detours, which increase data transmission latency. Frequent traffic overruns indicate that the system may need to be scaled out to meet growing demands. In addition, cost waste is an issue that cannot be ignored. Low resource utilization and incorrect choices among multiple billing methods can result in unnecessary expenses.
To effectively manage these risks, you need to perform regular inspections. In the NIS console, you can conduct network inspections, view historical reports, and initiate new inspections as needed. This process generates a detailed inspection report. We recommend running it once a week to promptly discover and address potential issues. Once an issue is found, you should immediately enter the risk handling stage. In this stage, you can use the NIS console and network inspection tools to view detailed reports, obtain network optimization suggestions, and take corresponding measures to mitigate the risks based on these suggestions. For example, for stability risks, you can optimize primary/secondary configurations and resource deployment. For security risks, you can patch network ACL configuration vulnerabilities and adjust security group permissions. For performance risks, you can optimize network paths and perform necessary scale-outs. For cost waste, you can improve resource utilization and choose more suitable billing methods.
3.4 Use "tools" to analyze and resolve root causes
NIS: Network Intelligence Service (NIS) is an intelligent network service product launched by Alibaba Cloud. It is based on years of large-scale network O&M practices and technological expertise and is designed for complex network scenarios. It integrates network measurement, diagnosis, and optimization to provide end-to-end network observability and intelligent analysis capabilities. NIS helps users quickly locate connectivity, performance, and fault issues across regions and network domains to achieve a cloud network O&M experience that is "visible, fast to check, and manageable". NIS provides a rich set of tools to analyze and resolve root causes. When you observe anomalies on the dashboard, when an alert occurs, or when an inspection report provides optimization suggestions, you can use the tools provided by NIS to perform the following functions:
Instance diagnosis: You can detect the configuration and running status of an instance and receive quick fixes based on the diagnosed anomalies.
Path analysis: You can analyze end-to-end network connectivity and diagnose connection problems caused by network configuration errors. When a destination is unreachable, you can identify the location and cause of the block. You can keep the network traffic analysis feature enabled to continuously monitor and analyze network traffic based on data such as throughput, packet loss, latency, and user distribution. This helps O&M engineers optimize the business architecture based on traffic conditions.
Network traffic analysis: You can monitor real-time and historical traffic data and metrics in the network to help you understand the performance and load of your network applications.
Network Insight: You can analyze the real-time operational status of business unit traffic to help you promptly perceive business network anomalies. It also provides network quality assessment data and event impact analysis.
Network topology: You can quickly understand your Alibaba Cloud network architecture, perform network configuration validation, and conduct unified O&M for your cloud network resources.
Performance monitoring: This feature provides average network latency data within Alibaba Cloud and across the Internet. This helps you choose a region or zone when you set up services.
4. Product portfolio
4.1 Monitoring platform comparison
There are many types of monitoring platforms. We divide them into three main categories based on their ecosystem: Alibaba Cloud CloudMonitor, Prometheus monitoring, and other monitoring platforms (such as Zabbix, the ELK Stack (Elasticsearch, Logstash, and Kibana), and OpenTelemetry). These are further divided into five subcategories: Alibaba Cloud CloudMonitor Basic, Alibaba Cloud Hybrid Cloud Monitoring (part of the Alibaba Cloud CloudMonitor category), ARMS Prometheus+Grafana, self-managed Prometheus+Grafana (part of the Prometheus category), and other monitoring platforms. The comparison is as follows:
Advantages | Disadvantages | Description | |
CloudMonitor Basic |
|
|
|
Hybrid Cloud Monitoring |
|
|
|
ARMS Prometheus+Grafana |
|
|
|
Self-managed Prometheus+Grafana |
|
|
|
Other monitoring platforms |
|
|
|
On the other hand, the monitoring platforms that users use simultaneously are highly fragmented. Teams with insufficient investment in O&M may passively use monitoring platforms, which leads to this result. Teams with professional O&M tend to unify their monitoring platforms, but still cannot avoid fragmentation. According to the Observability Survey 2024, 70% of teams use four or more different monitoring platforms. Prometheus+Grafana is the de facto standard in the current cloud-native ecosystem. It has an active community, rich documentation, convenient configuration, powerful features, and a high degree of integration, making it suitable for most modern application monitoring scenarios.
Alibaba Cloud's ARMS Prometheus is a managed Prometheus + Grafana service. It eliminates the complexity of deploying, maintaining, and scaling a monitoring system yourself. It is out-of-the-box and supports large-scale metric collection, storage, and visual analytics. It is deeply integrated with the Alibaba Cloud ecosystem and can seamlessly connect with various cloud resources and applications such as ECS, Container Service for Kubernetes (ACK), Serverless, and Application Real-Time Monitoring Service (ARMS), achieving full-stack observability. At the same time, ARMS Prometheus provides powerful alerting capabilities. It supports flexible alert rule configuration based on PromQL and can accurately detect metric anomalies such as increased latency, rising error rates, and resource usage exceeding limits. Alert rules support multi-level thresholds, duration judgments, grouping, and deduplication, effectively avoiding false positives and alert storms. When an alert is triggered, it can notify on-call personnel in real-time through various channels such as DingTalk, text messages, email, and Webhooks. It also integrates with the Alibaba Cloud alert center and event center to achieve unified management and a closed-loop response for the alert lifecycle. This helps teams quickly discover and handle potential risks, ensuring business stability.
4.2 Event platform comparison
Both NIS and CloudMonitor can generate "alert" events. The differences between them are as follows:
CloudMonitor | NIS | |
Scope | All cloud products | Network products |
Type | Predefined system events + user-configured thresholds | Predefined system events |
Requires configuration | Yes | No |
Generation logic | Predefined system events: Key events such as exceeding specifications, interruptions, and anomalies. User-configured thresholds: Alerts are generated based on the customer's judgment of how a specific metric affects the system. The customer configures these alerts. | Only predefined system events |
Custom threshold | Yes | No |
Comprehensiveness | Comprehensive | Limited |
Notification methods | Support Phone, email, DingTalk, Lark, WeCom SLS, Message Queue for Light-weight IoT, Function Compute | Not supported You can push NIS events to CloudMonitor events and then use CloudMonitor's notification methods. |
A typical CloudMonitor alert is: When the real-time bandwidth of an EIP reaches 5 Gbps, the application's O&M engineer is notified.
A typical NIS alert is: When the bandwidth of an EIP bandwidth plan reaches 95% of its specification, an NIS event is generated. The user can notify the application's O&M engineer through CloudMonitor or poll for this event through an API for automated processing.
CloudMonitor supports system events and user-defined threshold alerts for all cloud products. It has complete alert configuration capabilities, such as metric thresholds, duration judgments, and multi-level alerts. In contrast, NIS only provides predefined system events for network products and does not support user-configured thresholds. Therefore, CloudMonitor is more suitable to serve as an enterprise-level unified alert center.
5. Scenarios
Core application scenarios include the following:
Network performance monitoring and optimization.
Anomaly detection and intelligent alerting.
Quick fault localization and root cause analysis.
Resource utilization analysis and cost optimization.
Automated network O&M.
Unified management of multicloud/hybrid cloud networks.
6. Configuration reference manual
6.1 Dashboard configuration reference
The following section introduces typical dashboard designs for cloud networks, layered by network architecture.
6.1.1 Public network business dashboard
Elastic IP Address (EIP) is the standard way Alibaba Cloud products provide public network access. Most network services support binding a user-provided EIP. However, for historical reasons, some products also support public network services without a user-provided EIP. See the following table and suggestions.
Product | Public network type | Suggestion |
ECS |
| Use EIPs uniformly.
|
SLB |
| |
ALB/NLB/NAT |
| |
VPN | Public VPN gateways use non-user EIPs. | Set up a separate dashboard. |
GA | GA uses non-user EIPs. |
For statistical metrics, the inbound and outbound bandwidth/traffic on an EIP always represents the actual bandwidth/traffic transmitted on that EIP. However, the object to monitor for utilization, packet loss, and billing differs based on the combination of EIP + "added to Internet Shared Bandwidth" + "CDT enabled". Refer to the following table for details.
Product | Combination | Description | Monitored object | |
Should I join? Internet Shared Bandwidth (cbwp) | Enabled CDT | |||
EIP | No | No |
|
|
EIP | No | Yes |
|
|
EIP | Yes | No |
|
|
EIP | Yes | Yes |
|
|
Public network dashboard design reference
EIP dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total EIP rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Throttling packet loss rate: Sum | Right | Mark as red if > 100 | ||
EIP bandwidth utilization | Time series | Inbound bandwidth utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Outbound bandwidth utilization: Max, Min, Avg | Left | |||
Top N instances by EIP throttling packet loss | Table | Throttling packet loss rate | Mark as red if > 100 | |
Top N instances by EIP inbound bandwidth utilization | Table | Inbound bandwidth utilization | Mark as yellow if > 50 Mark as red if > 80 | |
Top N instances by EIP outbound bandwidth utilization | Table | Outbound bandwidth utilization | ||
Top N instances by EIP inbound rate | Table | Inbound rate | ||
Top N instances by EIP outbound rate | Table | Outbound rate |
Internet Shared Bandwidth dashboard, which supports filtering instances by region, resource group, instance ID, or IP address. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total EBWP rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Throttling packet loss rate: Sum | Right | Mark as red if > 100 | ||
EBWP bandwidth utilization | Time series | Inbound bandwidth utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Outbound bandwidth utilization: Max, Min, Avg | Left | |||
Top N instances by EBWP throttling packet loss | Table | Throttling packet loss rate > 0 | Mark as red if > 100 | |
Top N instances by EBWP inbound bandwidth utilization | Table | Inbound bandwidth utilization > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N instances by EBWP outbound bandwidth utilization | Table | Outbound bandwidth utilization > 30 | ||
Top N instances by EBWP inbound rate | Table | Inbound rate | ||
Top N instances by EBWP outbound rate | Table | Outbound rate |
6.1.2 Network element business dashboard
SLB dashboard design reference
SLB has over 54 monitoring metrics. We classify them by horizontal and vertical dimensions.
The horizontal dimension can be divided into SLB listener granularity and SLB instance granularity. Most statistical metrics have both listener-granularity and instance-granularity versions. However, 1) resource utilization-related metrics are only available at the instance granularity. 2) Health check-related metrics are only available at the listener granularity.
Listener-granularity metric names do not contain "Instance". For example, AliyunSlb_ActiveConnection is the statistic for the number of active connections of a specific SLB listener.
Instance-granularity metric names contain "Instance". For example, AliyunSlb_InstanceActiveConnection is the sum of active connection statistics for the entire SLB instance.
The vertical dimension can be divided into two main categories, Layer 4 and Layer 7, each with four subcategories.
Layer 4 statistical metrics are divided into four categories: health check, resource utilization, traffic, and connections.
Layer 7 statistical metrics are divided into four categories: resource utilization, response time, status code, and others.
Classification | Listener granularity | Instance granularity | |
Layer 4 | Health check | AliyunSlb_HealthyServerCount AliyunSlb_UnhealthyServerCount | |
Resource utilization | AliyunSlb_InstanceNewConnectionUtilization - New connection utilization of the instance AliyunSlb_InstanceMaxConnectionUtilization - Maximum connection utilization of the instance AliyunSlb_InstanceTrafficRXUtilization - Received traffic utilization of the instance AliyunSlb_InstanceTrafficTXUtilization - Sent traffic utilization of the instance | ||
Traffic | AliyunSlb_TrafficRXNew - New received traffic AliyunSlb_TrafficTXNew - New sent traffic AliyunSlb_DropTrafficRX - Dropped traffic at the receiver AliyunSlb_DropTrafficTX - Dropped traffic at the sender AliyunSlb_PacketRX - Received packets AliyunSlb_PacketTX - Sent packets AliyunSlb_DropPacketRX - Dropped packets at the receiver AliyunSlb_DropPacketTX - Dropped packets at the sender | AliyunSlb_InstanceTrafficRX - Received traffic of the instance AliyunSlb_InstanceTrafficTX - Sent traffic of the instance AliyunSlb_InstanceDropTrafficRX - Dropped traffic at the receiver of the instance AliyunSlb_InstanceDropTrafficTX - Dropped traffic at the sender of the instance AliyunSlb_InstancePacketRX - Received packets of the instance AliyunSlb_InstancePacketTX - Sent packets of the instance AliyunSlb_InstanceDropPacketRX - Dropped packets at the receiver of the instance AliyunSlb_InstanceDropPacketTX - Dropped packets at the sender of the instance | |
Connections | AliyunSlb_ActiveConnection - Current active connections AliyunSlb_InactiveConnection - Inactive connections AliyunSlb_NewConnection - New connections AliyunSlb_MaxConnection - Maximum connections AliyunSlb_DropConnection - Dropped connections | AliyunSlb_InstanceActiveConnection - Current active connections of the instance AliyunSlb_InstanceInactiveConnection - Inactive connections of the instance AliyunSlb_InstanceNewConnection - New connections of the instance AliyunSlb_InstanceMaxConnection - Maximum connections of the instance AliyunSlb_InstanceDropConnection - Dropped connections of the instance | |
Layer 7 | Utilization | AliyunSlb_InstanceQpsUtilization - QPS utilization of the instance | |
Response time | AliyunSlb_Rt - Response time | AliyunSlb_InstanceRt - Response time of the instance AliyunSlb_InstanceUpstreamRt - Backend response time of the instance | |
Status code | AliyunSlb_StatusCode2xx - Number of requests with HTTP status code 2xx AliyunSlb_StatusCode3xx - Number of requests with HTTP status code 3xx AliyunSlb_StatusCode4xx - Number of requests with HTTP status code 4xx AliyunSlb_StatusCode5xx - Number of requests with HTTP status code 5xx AliyunSlb_StatusCodeOther - Number of requests with other HTTP status codes AliyunSlb_UpstreamCode4xx - Number of times the backend returned a 4xx status code AliyunSlb_UpstreamCode5xx - Number of times the backend returned a 5xx status code | AliyunSlb_InstanceStatusCode2xx - Number of requests with HTTP status code 2xx for the instance AliyunSlb_InstanceStatusCode3xx - Number of requests with HTTP status code 3xx for the instance AliyunSlb_InstanceStatusCode4xx - Number of requests with HTTP status code 4xx for the instance AliyunSlb_InstanceStatusCode5xx - Number of requests with HTTP status code 5xx for the instance AliyunSlb_InstanceStatusCodeOther - Number of requests with other HTTP status codes for the instance AliyunSlb_InstanceUpstreamCode4xx - Number of times the backend of the instance returned a 4xx status code AliyunSlb_InstanceUpstreamCode5xx - Number of times the backend of the instance returned a 5xx status code | |
Other | AliyunSlb_Qps - Queries per second (QPS) | AliyunSlb_InstanceQps - Queries per second (QPS) of the instance |
High-Level Design Reference for CLB
SLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Because CLB metrics have multiple dimensions, we recommend that you select the following dimensions for the TopN section:
CLB now consolidates utilization metrics at the instance level.
If an SLB instance serves a single business, we recommend displaying the SLB instance in the Top N.
If a CLB instance manages various services, you can display its listeners in the TopN list.
Panel name | Type | Metric | Axis | Description |
Total SLB rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Inbound drop rate: Sum | Right | Mark as red if > 100 | ||
Outbound drop rate: Sum | Right | Mark as red if > 100 | ||
Total SLB connections | Time series | Active connections: Sum | Left | |
Inactive connections: Sum | Left | |||
New connections: Sum | Left | |||
Maximum connections: Sum | Left | |||
Dropped connections: Sum | Right | Set yellow and red markers | ||
Health check | Time series | Healthy servers: Sum | Left | Mark as green |
Unhealthy servers: Sum | Left | Set yellow and red markers | ||
SLB instance utilization | Time series | New connection utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Maximum connection utilization: Max, Min, Avg | Left | |||
Inbound bandwidth utilization: Max, Min, Avg | Left | |||
Outbound bandwidth utilization: Max, Min, Avg | Left | |||
Layer 7 QPS utilization: Max, Min, Avg | Left | |||
Layer 7 QPS | Time series | QPS: Sum | Left | |
Layer 7 response time | Time series | Response time: Max, Min, Avg | Left | Set yellow and red markers |
Backend response time: Max, Min, Avg | Left | |||
Layer 7 status code statistics | Time series | 2xx: Sum | Left | |
3xx: Sum | Left | |||
4xx: Sum | Left | Set yellow and red markers | ||
5xx: Sum | Left | Set yellow and red markers | ||
Other: Sum | Left | |||
Upstream4xx: Sum | Left | Set yellow and red markers | ||
Upstream5xx: Sum | Left | Set yellow and red markers | ||
Top N by inbound rate | Table | Inbound rate | ||
Top N by outbound rate | Table | Outbound rate | ||
Top N by inbound drop rate | Table | Inbound drop rate > 0 | Mark as red | |
Top N by outbound drop rate | Table | Outbound drop rate > 0 | Mark as red | |
Top N by maximum connections | Table | |||
Top N by new connections | Table | |||
Top N by dropped connections | Table | Dropped connections > 0 | Mark as red | |
Top N by unhealthy servers | Table | Unhealthy servers > 0 | Mark as red | |
High utilization instances | Table | New connection utilization > 30 OR Maximum connection utilization > 30 OR Inbound bandwidth utilization > 30 OR Outbound bandwidth utilization > 30 OR QPS utilization > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N by response time | Table | Response time > Average value × 2 | Dynamic threshold marking | |
Top N by QPS | Table | QPS | ||
Top N by 4xx | Table | 4xx | ||
Top N by 5xx | Table | 5xx |
NLB dashboard design reference
NLB has over 45 monitoring metrics. We classify them by horizontal and vertical dimensions.
The horizontal dimension can be divided into NLB listener granularity, NLB VIP granularity, and NLB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions.
Listener-granularity metric names do not contain "Instance" or "Vip". For example, AliyunNlb_ActiveConnection is the statistic for the number of active connections of a specific NLB listener.
VIP-granularity metric names contain "Vip". For example, AliyunNlb_VipActiveConnection is the statistic for the number of active connections of a specific NLB VIP.
Instance-granularity metric names contain "Instance". For example, AliyunNlb_InstanceActiveConnection is the sum of active connection statistics for the entire NLB instance.
The vertical dimension can be divided into four main categories: health check, traffic, connections, and others.
Classification | Listener granularity | Instance granularity | VIP granularity |
Health check | AliyunNlb_ListenerHeathyServerCount - Number of healthy servers AliyunNlb_ListenerUnhealthyServerCount - Number of unhealthy servers | AliyunNlb_NlbInstanceHeathyServerCount - Number of healthy servers for the SLB instance AliyunNlb_InstanceUnhealthyServerCount - Number of unhealthy servers for the instance | N/A |
Traffic | AliyunNlb_TrafficRXNew - Received traffic AliyunNlb_TrafficTXNew - Sent traffic AliyunNlb_ListenerPacketRX - Received packets AliyunNlb_ListenerPacketTX - Sent packets AliyunNlb_DropTrafficRX - Received dropped traffic AliyunNlb_DropTrafficTX - Sent dropped traffic AliyunNlb_DropPacketRX - Received dropped packets AliyunNlb_DropPacketTX - Sent dropped packets | AliyunNlb_InstanceTrafficRX - Instance received traffic Code mode | AliyunNlb_VipTrafficRX - VIP received traffic Code mode |
Connections | AliyunNlb_NewConnection - New connections AliyunNlb_MaxConnection - Maximum connections AliyunNlb_DropConnection - Dropped connections AliyunNlb_ActiveConnection - Active connections AliyunNlb_InactiveConnection - Inactive connections | AliyunNlb_InstanceNewConnection - Instance new connections AliyunNlb_InstanceMaxConnection - Instance maximum connections AliyunNlb_InstanceDropConnection - Instance dropped connections AliyunNlb_InstanceActiveConnection - Instance active connections AliyunNlb_InstanceInactiveConnection - Instance inactive connections | AliyunNlb_VipNewConnection - VIP new connections AliyunNlb_VipMaxConnection - VIP maximum connections AliyunNlb_VipDropConnection - VIP dropped connections AliyunNlb_VipActiveConnection - VIP active connections AliyunNlb_VipInactiveConnection - VIP inactive connections |
Other | N/A | N/A | AliyunNlb_VipClientResetPacket - Number of VIP client reset packets AliyunNlb_RealServerResetPacket - Number of VIP server reset packets |
NLB dashboard design reference
NLB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Because NLB has many metric dimensions, we recommend the following for the Top N section dimensions:
If an NLB instance serves a single business, we recommend displaying the NLB instance in the Top N.
If an NLB instance serves mixed businesses, we recommend displaying the NLB listener in the Top N.
Other dimension information is used for problem analysis and is not displayed on the dashboard.
Panel name | Type | Metric | Axis | Description |
Total NLB rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Inbound drop rate: Sum | Right | Mark as red if > 100 | ||
Outbound drop rate: Sum | Right | Mark as red if > 100 | ||
Total NLB connections | Time series | Active connections: Sum | Left | |
Inactive connections: Sum | Left | |||
New connections: Sum | Left | |||
Maximum connections: Sum | Left | |||
Dropped connections: Sum | Right | Set yellow and red markers | ||
Health check | Time series | Healthy servers: Sum | Left | Mark as green |
Unhealthy servers: Sum | Left | Set yellow and red markers | ||
Reset | Time series | Number of client reset packets | Left | Mark as red if > 100 |
Number of server reset packets | Left | Mark as red if > 100 | ||
Top N by inbound rate | Table | Inbound rate | ||
Top N by outbound rate | Table | Outbound rate | ||
Top N by inbound drop rate | Table | Inbound drop rate > 0 | Mark as red | |
Top N by outbound drop rate | Table | Outbound drop rate > 0 | Mark as red | |
Top N by maximum connections | Table | |||
Top N by new connections | Table | |||
Top N by dropped connections | Table | Dropped connections > 0 | Mark as red | |
Top N by unhealthy servers | Table | Unhealthy servers > 0 | Mark as red | |
Top N by ClientReset | Table | Number of client reset packets > 0 | ||
Server Reset Top N | Table | Number of server reset packets > 0 |
ALB dashboard design reference
ALB has over 112 monitoring metrics. We classify them by horizontal and vertical dimensions.
The horizontal dimension can be divided into ALB listener granularity, ALB VIP granularity, ALB rule granularity, ALB server group granularity, and ALB instance granularity. Most statistical metrics have listener-granularity, VIP-granularity, and instance-granularity versions. A few metrics also provide rule-granularity and server group-granularity.
Listener-granularity metric names contain "Listener". For example, AliyunAlb_ListenerQPS is the QPS statistic for a specific ALB listener.
VIP-granularity metric names contain "Vip". For example, AliyunAlb_VipQPS is the QPS statistic for a specific ALB VIP.
Rule-granularity metric names contain "Rule". For example, AliyunAlb_RuleQPS is the QPS statistic for a specific ALB rule.
Server group-granularity metric names contain "ServerGroup". For example, AliyunAlb_ServerGroupQPS is the QPS statistic for a specific ALB server group.
Instance-granularity metric names contain "LoadBalancer". For example, AliyunAlb_LoadBalancerQPS is the sum of QPS statistics for the entire ALB instance.
The vertical dimension can be divided into six main categories: health check, traffic, connections, response time, status code, and others.
Classification | Listener granularity | Instance granularity | VIP granularity | Rule granularity | Server group granularity |
Health check | AliyunAlb_ListenerHealthyHostCount AliyunAlb_ListenerUnHealthyHostCount | AliyunAlb_LoadBalancerHealthyHostCount AliyunAlb_LoadBalancerUnHealthyHostCount | AliyunAlb_RuleHealthyHostCount AliyunAlb_RuleUnHealthyHostCount | AliyunAlb_ServerGroupHealthyHostCount AliyunAlb_ServerGroupUnHealthyHostCount | |
Traffic | AliyunAlb_ListenerInBits AliyunAlb_ListenerOutBits | AliyunAlb_LoadBalancerInBits AliyunAlb_LoadBalancerOutBits | AliyunAlb_VipInBits AliyunAlb_VipOutBits | ||
Connections | AliyunAlb_ListenerActiveConnection AliyunAlb_ListenerInactiveConnection AliyunAlb_ListenerNewConnection AliyunAlb_ListenerMaxConnection AliyunAlb_ListenerRejectedConnection AliyunAlb_ListenerUpstreamConnectionError | AliyunAlb_LoadBalancerActiveConnection AliyunAlb_LoadBalancerInactiveConnection AliyunAlb_LoadBalancerNewConnection AliyunAlb_LoadBalancerMaxConnection AliyunAlb_LoadBalancerRejectedConnection AliyunAlb_LoadBalancerUpstreamConnectionError | AliyunAlb_VipActiveConnection AliyunAlb_VipInactiveConnection AliyunAlb_VipNewConnection AliyunAlb_VipMaxConnection AliyunAlb_VipRejectedConnection AliyunAlb_VipUpstreamConnectionError | AliyunAlb_RuleUpstreamConnectionError | AliyunAlb_ServerGroupUpstreamConnectionError |
Response time | AliyunAlb_ListenerRequestTime AliyunAlb_ListenerUpstreamResponseTime | AliyunAlb_LoadBalancerRequestTime AliyunAlb_LoadBalancerUpstreamResponseTime | AliyunAlb_VipRequestTime AliyunAlb_VipUpstreamResponseTime | AliyunAlb_RuleRequestTime AliyunAlb_RuleUpstreamResponseTime | AliyunAlb_ServerGroupRequestTime AliyunAlb_ServerGroupUpstreamResponseTime |
Status code | AliyunAlb_ListenerHTTPCode2XX AliyunAlb_ListenerHTTPCode3XX AliyunAlb_ListenerHTTPCode4XX AliyunAlb_ListenerHTTPCode5XX AliyunAlb_ListenerHTTPCode500 AliyunAlb_ListenerHTTPCode502 AliyunAlb_ListenerHTTPCode503 AliyunAlb_ListenerHTTPCode504 AliyunAlb_ListenerHTTPCodeUpstream2XX AliyunAlb_ListenerHTTPCodeUpstream3XX AliyunAlb_ListenerHTTPCodeUpstream4XX AliyunAlb_ListenerHTTPCodeUpstream5XX | AliyunAlb_LoadBalancerHTTPCode2XX AliyunAlb_LoadBalancerHTTPCode3XX AliyunAlb_LoadBalancerHTTPCode4XX AliyunAlb_LoadBalancerHTTPCode5XX AliyunAlb_LoadBalancerHTTPCode500 AliyunAlb_LoadBalancerHTTPCode502 AliyunAlb_LoadBalancerHTTPCode503 AliyunAlb_LoadBalancerHTTPCode504 AliyunAlb_LoadBalancerHTTPCodeUpstream2XX AliyunAlb_LoadBalancerHTTPCodeUpstream3XX AliyunAlb_LoadBalancerHTTPCodeUpstream4XX AliyunAlb_LoadBalancerHTTPCodeUpstream5XX | AliyunAlb_VipHTTPCode2XX AliyunAlb_VipHTTPCode3XX AliyunAlb_VipHTTPCode4XX AliyunAlb_VipHTTPCode5XX AliyunAlb_VipHTTPCode500 AliyunAlb_VipHTTPCode502 AliyunAlb_VipHTTPCode503 AliyunAlb_VipHTTPCode504 | AliyunAlb_RuleHTTPCodeUpstream2XX AliyunAlb_RuleHTTPCodeUpstream3XX AliyunAlb_RuleHTTPCodeUpstream4XX AliyunAlb_RuleHTTPCodeUpstream5XX AliyunAlb_RuleHTTPCodeUpstream2XXRatio AliyunAlb_RuleHTTPCodeUpstream3XXRatio AliyunAlb_RuleHTTPCodeUpstream4XXRatio AliyunAlb_RuleHTTPCodeUpstream5XXRatio | AliyunAlb_ServerGroupHTTPCodeUpstream2XX AliyunAlb_ServerGroupHTTPCodeUpstream3XX AliyunAlb_ServerGroupHTTPCodeUpstream4XX AliyunAlb_ServerGroupHTTPCodeUpstream5XX |
Other | AliyunAlb_ListenerQPS AliyunAlb_ListenerNonStickyRequest AliyunAlb_ListenerUpstreamTLSNegotiationError AliyunAlb_ListenerClientTLSNegotiationError AliyunAlb_ListenerHTTPFixedResponse AliyunAlb_ListenerHTTPRedirect | AliyunAlb_LoadBalancerQPS AliyunAlb_LoadBalancerNonStickyRequest AliyunAlb_LoadBalancerUpstreamTLSNegotiationError AliyunAlb_LoadBalancerClientTLSNegotiationError AliyunAlb_LoadBalancerHTTPFixedResponse AliyunAlb_LoadBalancerHTTPRedirect | AliyunAlb_VipQPS AliyunAlb_VipNonStickyRequest AliyunAlb_VipUpstreamTLSNegotiationError AliyunAlb_VipClientTLSNegotiationError AliyunAlb_VipHTTPFixedResponse AliyunAlb_VipHTTPRedirect | AliyunAlb_RuleQPS AliyunAlb_RuleNonStickyRequest AliyunAlb_RuleUpstreamTLSNegotiationError | AliyunAlb_ServerGroupQPS AliyunAlb_ServerGroupNonStickyRequest AliyunAlb_ServerGroupUpstreamTLSNegotiationError |
ALB dashboard design reference
ALB dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Because ALB has many metric dimensions, we recommend the following for the Top N section dimensions:
If an ALB instance serves a single business, we recommend displaying the ALB instance in the Top N.
If an ALB instance serves mixed businesses, we recommend displaying the ALB listener in the Top N.
Other dimension information is used for problem analysis and is not displayed on the dashboard.
Panel name | Type | Metric | Axis | Description |
Total ALB rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Total ALB connections | Time series | Active connections: Sum | Left | |
Inactive connections: Sum | Left | |||
New connections: Sum | Left | |||
Maximum connections: Sum | Left | |||
Rejected connections: Sum | Right | Set yellow and red markers | ||
Upstream rejected connections: Sum | Right | Set yellow and red markers | ||
Health check | Time series | Healthy servers: Sum | Left | Mark as green |
Unhealthy servers: Sum | Left | Set yellow and red markers | ||
TLS errors | Time series | TLS negotiation errors: Sum | Left | Set yellow and red markers |
Upstream TLS negotiation errors: Sum | Left | Set yellow and red markers | ||
Layer 7 QPS | Time series | QPS: Sum | Left | |
Layer 7 response time | Time series | Response time: Max, Min, Avg | Left | Set yellow and red markers |
Backend response time: Max, Min, Avg | Left | |||
Layer 7 status code statistics | Time series | 2xx: Sum | Left | |
3xx: Sum | Left | |||
4xx: Sum | Left | Set yellow and red markers | ||
5xx: Sum | Left | Set yellow and red markers | ||
Upstream4xx: Sum | Left | Set yellow and red markers | ||
Upstream5xx: Sum | Left | Set yellow and red markers | ||
Top N by inbound rate | Table | Inbound rate | ||
Top N by outbound rate | Table | Outbound rate | ||
Top N by maximum connections | Table | |||
Top N by new connections | Table | |||
Top N by dropped connections | Table | Dropped connections > 0 | Mark as red | |
Top N by unhealthy servers | Table | Unhealthy servers > 0 | Mark as red | |
Top N by TLS negotiation errors | Table | TLS negotiation errors > 0 | Mark as red | |
Top N by upstream TLS negotiation errors | Table | Upstream TLS negotiation errors > 0 | Mark as red | |
Top N by response time | Table | Response time > Average value × 2 | Mark as yellow or red | |
Top N by QPS | Table | QPS | ||
Top N by 4xx | Table | 4xx | ||
Top N by 5xx | Table | 5xx |
GA dashboard design reference
GA dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total frontend IP rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Inbound drop rate: Sum | Right | Mark as red if > 100 | ||
Outbound drop rate: Sum | Right | Mark as red if > 100 | ||
Frontend IP bandwidth utilization | Time series | Inbound bandwidth utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Outbound bandwidth utilization: Max, Min, Avg | Left | |||
Total frontend IP active connections | Time series | Active connections: Sum | Left | |
Total backend group rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Inbound drop rate: Sum | Right | Mark as red if > 100 | ||
Outbound drop rate: Sum | Right | Mark as red if > 100 | ||
Backend group bandwidth utilization | Time series | Inbound bandwidth utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Outbound bandwidth utilization: Max, Min, Avg | Left | |||
Tunnel latency | Time series | Tunnel latency: Max, Min, Avg | Dynamic threshold marking | |
Top N by frontend inbound rate | Table | Inbound rate | ||
Top N by frontend outbound rate | Table | Outbound rate | ||
Top N by frontend inbound bandwidth utilization | Table | Inbound bandwidth utilization > 30 | Mark as red or yellow | |
Top N by frontend outbound bandwidth utilization | Table | Outbound bandwidth utilization > 30 | Mark as red or yellow | |
Top N by active connections | Table | Active connections | ||
Top N by backend group inbound bandwidth utilization | Table | Top N by backend group inbound bandwidth utilization | ||
Top N by backend group outbound bandwidth utilization | Table | Top N by backend group outbound bandwidth utilization | ||
Top N by tunnel latency | Table | Tunnel latency | Dynamic threshold |
NAT dashboard design reference
NAT dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total NAT connections | Time series | Active connections: Sum | Left | |
New connections: Sum | Left | |||
Dropped active connections: Sum | Right | Mark as yellow if > 0 Mark as red if > 100 | ||
Dropped new connections: Sum | Right | Mark as yellow if > 0 Mark as red if > 100 | ||
NAT connection utilization | Time series | Active connection utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
New connection utilization: Max, Min, Avg | Left | |||
Total rate | Time series | Public network inbound rate: Sum | Left | Mark as red if inbound-outbound rate difference > threshold |
Public network outbound rate: Sum | Left | |||
Private network inbound rate: Sum | Left | |||
Private network outbound rate: Sum | Left | |||
Top N instances by active connections | Table | Active connections | ||
Top N by new connections | Table | New connections | ||
Top N by dropped active connections | Table | Dropped active connections > 0 | Mark as yellow if > 0 Mark as red if > 100 | |
Top N by dropped new connections | Table | Dropped new connections > 0 | Mark as yellow if > 0 Mark as red if > 100 | |
Top N by active connection utilization | Table | Active connection utilization > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N by new connection utilization | Table | Active connection utilization > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N by inbound rate | Table | Public network inbound rate | ||
Top N by outbound rate | Table | Public network outbound rate |
6.1.3 Global networking business dashboard
Express Connect - Physical port design reference
Physical port dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total rate | Time series | Inbound rate to cloud: Sum | Left | |
Total Egress Rate | Left | |||
Port error packets | Time series | Port inbound error packets: Sum | Left | Mark as yellow or red |
Port outbound error packets: Sum | Left | Mark as yellow or red | ||
Number of disconnected leased lines | Time series | Port down: Count | Left | Mark as red |
Top N by inbound rate to cloud | Table | Inbound rate to cloud | ||
Top N by outbound rate from cloud | Table | Download rate | ||
Top N by port inbound error packets | Table | Port inbound error packets > 0 | Mark as red | |
Top N by port outbound error packets | Table | Port outbound error packets > 0 | Mark as red | |
Disconnected leased line instances | Table | Port down == 1 | Mark as red |
Express Connect - VBR dashboard design reference
VBR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total rate | Time series | Inbound rate to cloud: Sum | Left | |
Total Egress Rate | Left | |||
Throttling packet loss: Sum | Right | Mark as red if > 100 | ||
Packet loss | Time series | Port inbound packet loss: Sum | Left | Mark as yellow or red |
Port outbound packet loss: Sum | Left | Mark as yellow or red | ||
Probe packet loss | Time series | Probe packet loss: Max, Min, Avg | Left | Mark as yellow if > 0 Mark as red if > 10 |
Probe latency | Time series | Probe latency: Max, Min, Avg | Left | Dynamic threshold |
Top N by inbound rate to cloud | Table | Inbound rate to cloud | ||
Top N by outbound rate from cloud | Table | Outbound rate from cloud | ||
Top N by throttling packet loss | Table | Throttling packet loss > 0 | Mark as red | |
Top N by port inbound packet loss | Table | Port inbound packet loss > 0 | Mark as red | |
Top N by port outbound error packets | Table | Port outbound packet loss > 0 | Mark as red | |
Top N by probe packet loss | Table | Probe packet loss > 0 | Mark as yellow if > 0 Mark as red if > 10 | |
Top N by probe latency | Table | Probe latency |
ECR dashboard design reference
ECR dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total rate | Time series | Inbound rate: Sum | Left | |
Outbound rate: Sum | Left | |||
Total cross-region throttling packet loss rate | Time series | Throttling packet loss bit rate: Sum | Left | Mark as yellow or red |
Throttling packet loss message rate: Sum | Right | Mark as yellow or red | ||
Top N by inbound rate | Table | Inbound rate | ||
Top N by outbound rate | Table | Outbound rate | ||
Top N by cross-region rate | Table | Cross-region rate | ||
Top N by cross-region throttling | Table | Throttling packet loss > 0 | Mark as red |
VPN dashboard design reference
VPN dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
Total rate | Time series | VPN Gateway inbound rate to cloud: Sum | Left | |
IPsec-VPN connection inbound rate to cloud: Sum | Left | |||
VPN Gateway outbound rate from cloud: Sum | Right | |||
IPsec-VPN connection outbound rate from cloud: Sum | Right | |||
VPN Gateway utilization | Time series | Inbound bandwidth utilization to cloud: Max, Min, Avg | Left | Mark as yellow or red |
Outbound bandwidth utilization from cloud: Max, Min, Avg | Left | Mark as yellow or red | ||
Number of online SSL clients | Time series | Number of SSL clients: Sum | Left | |
Top N by inbound bandwidth utilization to cloud | Table | Inbound bandwidth utilization to cloud > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N by outbound bandwidth utilization from cloud | Table | Outbound bandwidth utilization from cloud > 30 | Mark as yellow if > 50 Mark as red if > 80 | |
Top N by VPN Gateway inbound rate to cloud | Table | VPN Gateway inbound rate to cloud | ||
VPN Gateway outbound rate from cloud | Table | VPN Gateway outbound rate from cloud | ||
IPsec-VPN connection inbound rate to cloud | Table | IPsec-VPN connection inbound rate to cloud | ||
IPsec-VPN connection outbound rate from cloud | Table | IPsec-VPN connection outbound rate from cloud |
TR dashboard design reference
TR cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
TR traffic | Time series | Inbound rate: Sum | Left | Mark as red if inbound-outbound rate difference > threshold |
Outbound rate: Sum | Left | |||
Blackhole drop rate: Sum | Right | |||
No-route drop rate: Sum | Right | |||
Attachment connection traffic | Time series | Inbound rate: Sum | Left | Mark as red if inbound-outbound rate difference > threshold |
Outbound rate: Sum | Left | |||
Blackhole drop rate: Sum | Left | |||
Top N by TR inbound traffic | Table | TR inbound rate | ||
Top N by TR outbound traffic | Table | TR outbound rate | ||
Top N by TR blackhole drops | Table | TR blackhole drop rate | ||
Top N by TR no-route drops | Table | TR no-route drop rate | ||
Top N by attachment connection inbound traffic | Table | Attachment connection inbound traffic | ||
Top N by attachment connection outbound traffic | Table | Attachment connection outbound traffic | ||
Top N by attachment connection drops | Table | Attachment connection blackhole drop rate |
CEN cross-region design reference
CEN cross-region dashboard, which supports filtering instances by region, resource group, instance ID, or instance name. The instance display table supports hyperlinks to the instance monitoring page.
Panel name | Type | Metric | Axis | Description |
CEN traffic | Time series | Region outbound rate: Sum | Left | Mark as red if outbound rate difference > threshold |
Area outbound rate: Sum | Left | |||
Bandwidth plan average outbound rate: Sum | Left | Microburst tip:
| ||
Bandwidth plan peak outbound rate: Sum | Left | |||
Region throttling packet loss rate: Sum | Right | Mark as red if > 100 kbps | ||
CEN utilization | Time series | Region utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
Area utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 | ||
Bandwidth plan average utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 | ||
Bandwidth plan peak utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 | ||
CEN QoS traffic | Time series | QoS outbound rate: Sum | Left | |
QoS throttling packet loss rate: Sum | Right | Mark as red if > 100 kbps | ||
CEN QoS utilization | Time series | QoS average utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 |
QoS peak utilization: Max, Min, Avg | Left | Mark as yellow if > 50 Mark as red if > 80 | ||
Top N by region outbound rate | Table | Region outbound rate | ||
Top N by region utilization | Table | Region utilization | ||
Top N by region throttling packet loss rate | Table | Region throttling packet loss rate | ||
Top N by QoS outbound rate | Table | QoS outbound rate | ||
Top N by QoS peak utilization | Table | QoS peak utilization | ||
Top N by QoS throttling packet loss rate | Table | QoS throttling packet loss rate |
6.2 Monitoring configuration reference
6.2.1 Public network service monitoring configuration reference
For EIPs with self-built gateways providing public network service endpoints, refer to the following suggestions to configure alert rules in CloudMonitor for the public network endpoint EIPs:
Monitored object | Alert level | Monitoring metrics and conditions |
EIP | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
Internet Shared Bandwidth | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
CDT | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
|
When the bandwidth load exceeds 30%, the system enters a high-load state. The business may experience slow access, occasional timeouts, and other SLA degradation. We recommend performing a capacity assessment and considering a scale-out.
When the bandwidth load exceeds 50%, in addition to the issues at the previous level, the multi-AZ disaster recovery architecture may fail. If a service interruption occurs in one AZ, the remaining AZs cannot handle the entire business load. We recommend an immediate scale-out.
When the bandwidth load exceeds 85%, in addition to the issues at the previous level, the system load seriously exceeds the designed capacity. Besides an immediate scale-out, you should also consider whether there are unexpected events such as business growth exceeding expectations or security attacks, and optimize the system design.
6.2.2 Network element service monitoring configuration reference
CLB/NBL/ALB
For SLB/NLB/ALB providing public network service endpoints, in addition to configuring monitoring for the public network endpoint as described in the previous section, refer to the following suggestions to configure alert rules in CloudMonitor for SLB/NLB/ALB:
Monitored object | Alert level | Monitoring metrics and conditions |
CLB | Info | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
|
Warn | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
| |
Critical | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
| |
NLB | Info | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
|
Warn | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
| |
Critical | When one of the following conditions occurs at the instance dimension:
When one of the following conditions occurs at the port dimension:
| |
ALB | Info | When one of the following conditions occurs at the loadBalancer dimension:
When one of the following conditions occurs at the listener dimension:
|
Warn | When one of the following conditions occurs at the loadBalancer dimension:
When one of the following conditions occurs at the listener dimension:
| |
Critical | When one of the following conditions occurs at the loadBalancer dimension:
When one of the following conditions occurs at the listener dimension:
|
There are many application layer-related metrics, and they are closely related to the business. You should continuously optimize the relevant monitoring and the threshold configurations for each level based on actual business feedback.
6.2.3 Hybrid Disaster Recovery monitoring configuration reference
Leased line connection
If you use a leased line to connect to Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the leased line:
Monitored object | Alert level | Monitoring metrics and conditions |
Express Connect - Physical port | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
Express Connect - Virtual Border Router | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
Express Connect - Express Connect Router | Info | When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:
When one of the following conditions occurs at the cross-region connection dimension:
|
Warn | When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:
When one of the following conditions occurs at the cross-region connection dimension:
| |
Critical | When one of the following conditions occurs at the Transit Router (TR) instance monitoring dimension:
When one of the following conditions occurs at the cross-region connection dimension:
| |
Express Connect - Peering connection | Info | When one of the following conditions occurs at the instance dimension:
|
Warn | When one of the following conditions occurs at the instance dimension:
| |
Critical | When one of the following conditions occurs at the instance dimension:
|
Subscribe to the following CloudMonitor system events and push alerts:
Product: Express Connect - Leased line connection. Event type: Down. Event name: BGP Peer status changed from Established to Down.
VPN Gateway
If you use a VPN Gateway to access Alibaba Cloud, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN:
Monitored object | Alert level | Monitoring metrics and conditions |
VPN Gateway | Info | When one of the following conditions occurs at the instance dimension:
|
Warn | When one of the following conditions occurs at the instance dimension:
| |
Critical | When one of the following conditions occurs at the instance dimension:
|
Note: If you use the "IPsec connection attached to CEN/TR" method for networking, refer to the "CEN/TR global networking" section for monitoring methods.
Subscribe to the following CloudMonitor system events and push alerts:
Product: VPN Gateway. Event type: Abnormal, Status Notification. Event name: Certificate expired, All IPsec connection tunnels failed to negotiate, IPsec tunnel negotiation failed, health check failed, VPN connection health check failed.
CEN/TR global networking
If you use CEN/TR for global networking, refer to the following suggestions to configure alert rules in CloudMonitor for CEN/TR:
Monitored object | Alert level | Monitoring metrics and conditions |
Cloud Enterprise Network - Region monitoring | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
Cloud Enterprise Network - Area monitoring | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
Cloud Enterprise Network - Transit Router (configure when using Enterprise Edition) | Info | When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:
When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:
|
Warn | When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:
When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:
| |
Critical | When one of the following conditions occurs at the Transit Router (TR) instance AZ-level monitoring dimension:
When one of the following conditions occurs at the Transit Router (TR) connection AZ-level monitoring dimension:
|
Note:
For more information about TR connection bandwidth specifications, see Limits.
Subscribe to the following CloudMonitor system events and push alerts:
Product: Cloud Enterprise Network. Event: 90%QuotaExceeded. Event name: Event for exceeding 90% of quota.
When creating a VPN Attachment in TR, refer to the following suggestions to configure alert rules in CloudMonitor for the VPN connection:
Monitored object | Alert level | Monitoring metrics and conditions |
VPN connection | Info | When one of the following conditions occurs:
|
Warn | When one of the following conditions occurs:
| |
Critical | When one of the following conditions occurs:
| |
VPN gateway | Critical | When one of the following conditions occurs at the vpnconnection dimension:
|
Note:
For more information about VPN connection specifications, see Quotas and limits.
7. Hands-on guide
To integrate cloud service monitoring metrics into ARMS Prometheus, configure custom dashboards, and configure alerts, see Cloud Service Observability.
8. Appendix
Dashboard configuration method
1. Alibaba Cloud CloudMonitor Prometheus
Data ingestion: Go to Application Real-Time Monitoring Service (ARMS) > Integration Center, select the corresponding product (such as EIP or ALB), and then follow the prompts to complete the integration.
Custom dashboard: Go to Application Real-Time Monitoring Service (ARMS) > Provisioning > Cloud Service Integration Environment, select the corresponding product, and then refer to the reference cases in Section 6 to customize the dashboard.
2. Data ingestion for non-Alibaba Cloud CloudMonitor Prometheus (self-managed or third-party)
You can deploy a lightweight Prometheus in an Alibaba Cloud ECS or ACK cluster, or use the Prometheus Agent mode.
You can configure a plugin to collect Alibaba Cloud resource metrics (using an Exporter, API, or logs). For more information, see the open source aliyun_exporter plugin.
In the prometheus.yml file, you can configure `remote_write` to point to the `/api/v1/write` interface of your self-managed Prometheus.
You can restart Prometheus. The data is then sent to your local instance.
3. Data ingestion for other monitoring platforms
You need to develop your own data ingestion solution. You can use the collection plugin for non-Alibaba Cloud CloudMonitor Prometheus as a reference.
Alert configuration method
1. Alibaba Cloud CloudMonitor Prometheus
Go to Application Real-Time Monitoring Service (ARMS) > Alert Rules > Create Alert Rule. Make sure to select the region where the Prometheus instance is located.
You can configure the rule name, Prometheus instance, custom PromQL, severity level, and alert threshold in order.
2. Alibaba Cloud CloudMonitor
Go to Alibaba Cloud CloudMonitor > Alert Service > Alert Rules > Create Alert Rule.
Select a product
You can create the rule and define the Critical, Warn, and Info rules.
Suggestions
You can organize your team to conduct regular network inspections (monthly or even weekly) and implement a clear optimization plan until all risks are eliminated.
Suggestions
The value of a tool lies in its practical application, but you must prepare in advance. Only through continual learning and practice can you ensure that the tool is effective when issues arise.