The Network Intelligence Service (NIS) event center provides proactive alerts to help you identify risks, view potentially affected resources, and prevent business interruptions.
Use cases
Alibaba Cloud defines events in NIS to record and send notifications about your cloud network resources, such as O&M task execution status, resource issues, and resource status changes.
Notify you of risks and issues
If an event occurs that impairs the availability or performance of an instance, such as performance degradation from usage beyond specifications, service unavailability from packet loss on an ISP link, or an alert for an expiring instance, Alibaba Cloud pushes the event to the event center in the NIS console. Respond to these events promptly to prevent business interruptions caused by impaired resource availability or performance.
Enable automated O&M
Each event displayed in the NIS console has a defined status, which helps you track the execution of related system O&M tasks. When an event is generated or its status changes, the system reports it to CloudMonitor. This allows you to build an event-driven, automated O&M system based on your business needs.
Limitations
NIS does not support the event feature for instance families that are no longer available for purchase. For more information, see the end-of-sale announcements for each cloud service.
Basics
Event types
Events are defined by Alibaba Cloud to record information and send notifications about cloud network resources. Based on their cause, events are categorized as follows:
Category | Description | Example |
issue event | An exceptional event that has already caused business impact and has remained in the In Progress state for seven days. |
|
risk event | An exceptional event that may cause business impact and has remained in the In Progress state for seven days. |
|
Event levels
Events are classified into the following levels based on their impact on the normal operation of an instance:
Critical: A significant impact that requires immediate action to prevent the instance from becoming unavailable.
Warn: Moderate impact. You should monitor the event while it persists or handle it at an appropriate time.
Info: These events do not require immediate action.
For more information about event codes, names, descriptions, and recommended actions, see Event summary.
Event summary
This section summarizes the events supported by NIS and provides recommended actions for each event.
Issue events do not support monitoring for shared-resource CLB instances.
Issue events
Event code | Event name | Event level | CloudMonitor event name | CloudMonitor metric name | Description and impact | Alert rules | Recommended action |
Internet-facing instance | |||||||
problem-internetBandwidthOverlimit | Packet loss due to excess bandwidth usage | Critical | Packet loss due to excess instance bandwidth usage |
| The actual bandwidth usage of an Internet-facing instance has exceeded its specification, causing packet loss. Internet-facing instances include elastic IP address (EIP) instances, bandwidth plans, and Classic Load Balancer (CLB) instances. | Critical: Bandwidth usage frequently exceeds the limit over the last 10 minutes, causing packet loss. | Upgrade the instance to increase the peak bandwidth. |
NAT gateway | |||||||
problem-nat-sessionOverLimit | Connection drop caused by excess NAT sessions | Critical | Connection drop caused by excess NAT sessions |
| The number of sessions on the NAT gateway exceeds its specification, which causes new sessions to fail and a packet loss rate of more than 100 packets/s. | Critical: The number of concurrent sessions frequently exceeds the limit over the last 10 minutes, and the packet loss rate is greater than 100 packets/s. | Upgrade the specification or use multiple NAT gateway instances. For more information, see Manage NAT Gateway quotas and Internet NAT gateway and Create and manage a VPC NAT Gateway instance. |
problem-nat-sessionNewOverLimit | Connection drop caused by excess new NAT sessions | Critical | Connection drop caused by excess new NAT sessions |
| The rate of new sessions on the NAT gateway exceeds its specification, which causes new sessions to fail and a packet loss rate of more than 100 packets/s. | Critical: The number of new sessions frequently exceeds the limit over the last 10 minutes, and the packet loss rate is greater than 100 packets/s. | |
problem-nat-portAllocationError | Allocation failure of SNAT source ports | Critical | Allocation failure of SNAT source ports |
| Too few EIPs or IP addresses are configured for the NAT gateway instance, causing source port allocation to fail and a packet loss rate of more than 10 packets/s. Note You cannot create a subscription for this event. | Critical: Source port allocation frequently fails over the last 10 minutes, and the packet loss rate is greater than 10 packets/s. | Add more EIPs or IP addresses that are associated with the NAT gateway instance. For more information, see Create and manage a VPC NAT Gateway instance. |
problem-nat-datapathUnavailable | NAT gateway data path unavailable | Critical | NAT gateway data path unavailable | Not applicable (system availability event) | The data path of a NAT gateway is unavailable. The availability of your NAT gateway was 0% in the past 10 minutes, which means all traffic is affected and your NAT gateway is not working. This may be due to a platform-level event. Alibaba Cloud engineers are working to resolve the issue. | Critical: The availability of the NAT gateway was 0% in the last 10 minutes. | If you have deployed multiple NAT gateways for high availability, we recommend that you switch to another NAT gateway. For more information, see Deploy multiple NAT gateways to implement high availability. Otherwise, contact Alibaba Cloud engineers to get the latest recovery progress. |
problem-nat-datapathDegraded | NAT gateway data path degraded | Critical | NAT gateway data path degraded | Not applicable (system availability event) | The data path of a NAT gateway is degraded. The availability of your NAT gateway was below 80% in the past 10 minutes, which means more than 20% of traffic is affected and your NAT gateway is not working properly. This may be due to a platform-level event causing packet drops. Alibaba Cloud engineers are working to resolve the issue. | Critical: The availability of the NAT gateway was less than 80% in the last 10 minutes, causing packet loss. | |
Classic Load Balancer (CLB) | |||||||
problem-clb-connectionOverLimit | Dropped new connections due to excess CLB sessions | Critical | Dropped new connections due to excess CLB sessions |
| The number of new or concurrent connections on a CLB instance exceeds its specification, causing new sessions to fail and a high rate of dropped connections. | Critical: The number of concurrent sessions frequently exceeds the limit over the last 10 minutes, causing packet loss. | Upgrade the instance or switch to a Network Load Balancer (NLB) or Application Load Balancer (ALB) instance. For more information, see Manage CLB quotas. For product details about NLB and ALB, see What is Network Load Balancer (NLB)? and What is Application Load Balancer (ALB)?. |
problem-clb-bandwidthOverLimit | Packet loss due to excess CLB bandwidth usage | Critical | Packet loss due to excess CLB bandwidth usage |
| The actual traffic of a CLB instance exceeds its bandwidth specification, causing packet loss. | Critical: Bandwidth usage frequently exceeds the specification over the last 10 minutes, and the drop rate is greater than 100 bps. | Upgrade the instance specification. For more information, see Adjust the specifications of performance-guaranteed instances. |
problem-clb-connectionFail | Sharp increase in failed CLB connections | Critical | Sharp increase in failed CLB connections | Not supported by CloudMonitor | The number of failed connections on the CLB instance has sharply increased. This may be because the backend server specification is exceeded, the load is too high, or there is a service exception. | Critical: The number of failed new connections for the CLB instance has sharply increased over the last 10 minutes. An alert is triggered if all of the following conditions are met: Condition 1: The number of failed connections is greater than 100/s. Condition 2: The number of failed connections increases by 30% compared with the previous 10-minute window. Condition 3: Based on an intelligent baseline learned from historical data, the number of failed connections continuously exceeds the upper baseline limit by more than 30% within a 10-minute period. | Depending on the cause, upgrade the backend server specification, upgrade the CLB specification, or check the backend service status. For more information, see Manage CLB quotas and Diagnose a CLB instance. |
NLB | |||||||
problem-nlb-connectionFail | Sharp increase in failed NLB connections | Critical | Sharp increase in failed NLB connections | Not supported by CloudMonitor | The number of failed connections on a virtual IP address (VIP) of the NLB instance has sharply increased for 10 consecutive minutes. Possible causes include:
| Critical: An alert is triggered if the number of failed connections on the NLB instance meets all of the following conditions: Condition 1: Within a 610-second monitoring window, the number of failed connections exceeds the intelligent forecast baseline by more than 100% for 3 consecutive minutes. Condition 2: Within a 610-second monitoring window, the number of failed connections increases by 50% or more compared with the previous hour for 7 consecutive minutes. Condition 3: Within a 610-second monitoring window, the number of failed connections is 1,000 or more for 8 consecutive minutes. | Check if the backend server resources or service status are normal. For more information, see Diagnose an NLB instance. |
problem-nlb-newConnectionSurge | Dropped new NLB connections | Critical | Dropped new NLB connections |
| Due to a surge in new connections, the VIP of the NLB instance continuously drops new connection requests at millisecond or second intervals. | Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions: Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0. Condition 2: Within 10 minutes, there are more than 8 data points where the number of new connections established by the VIP per second is less than 200,000. |
Distribute traffic across multiple NLB instances or contact your account manager to apply for a quota increase. |
problem-nlb-newConnectionOverLimit | Excess new NLB connections | Critical | Excess new NLB connections |
| The number of new connections on the VIP of the NLB instance has exceeded the automatic scaling limit for a single VIP, causing new connection requests to be continuously dropped. | Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions: Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0. Condition 2: Within 10 minutes, there are more than 8 data points where the number of new connections established by the VIP per second is 200,000 or more. | |
problem-nlb-concurrentConnectionOverLimit | Excess concurrent NLB connections | Critical | Excess concurrent NLB connections |
| The number of concurrent connections on the VIP of the NLB instance has exceeded the automatic scaling limit for a single VIP, causing new connection requests to be continuously dropped. | Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions: Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0. Condition 2: Within 10 minutes, there are more than 8 data points where the maximum number of concurrent connections on the VIP is greater than 5,000,000. | |
ALB | |||||||
problem-alb-intranetBandwidthOverLimit | Packet loss due to excess private bandwidth usage of ALB instances | Critical | Packet loss due to excess private bandwidth usage of ALB instances | Not supported by CloudMonitor | The outbound or inbound bandwidth on the VIP address of the ALB instance has reached its limit. A single VIP resolved from an ALB domain name has a bandwidth limit. | Critical: Within 10 minutes, there are more than 8 data points where the traffic dropped by the ALB instance is greater than 100 bps. | Add a CNAME record for the ALB instance. For more information, see Add a CNAME record for an ALB instance. |
problem-alb-sessionOverLimit | Dropped new connections due to excess ALB sessions | Critical | Dropped new connections due to excess ALB sessions |
| The number of new or concurrent connections on the VIP address of the ALB instance exceeds the limit, causing new sessions to fail. A single VIP resolved from an ALB domain name has a new connection limit. | Critical: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the ALB instance per second is greater than 0. | |
problem-alb-qpsOverLimit | 503 error code returned because QPS exceeds the limit | Critical | 503 error code returned because QPS exceeds the limit | Not supported by CloudMonitor | The number of queries per second (QPS) on the VIP address of the ALB instance has reached the VIP limit. A single VIP resolved from an ALB domain name has a QPS limit. | Critical: Within 10 minutes, there are more than 8 data points where the number of requests dropped per second is greater than 200 qps, and for 10 consecutive minutes, the number of dropped requests per second increases by 30% or more compared with 7 minutes earlier. | |
Cloud Enterprise Network (CEN) | |||||||
problem-cen-routeOverLimit | Excess CEN routes | Critical | Excess CEN routes | Not applicable (event-based metric) | The CEN route quota is exceeded, which may cause network issues. | Critical: The CEN route quota is exceeded, causing network issues. | Upgrade the Transit Router (TR). For more information, see Upgrade a basic transit router to an enterprise transit router. |
TR | |||||||
problem-cen-vpcAttachBandwidthOverLimit | Packet loss due to excess VPC connection bandwidth | Critical | Packet loss due to excess VPC connection bandwidth | Not supported by CloudMonitor | The actual traffic of a CEN transit router exceeds the bandwidth specification, causing packet loss. | Critical: Within 10 minutes, there are more than 5 data points where the inbound packet loss rate is greater than 0. | Increase the bandwidth limit. For more information, see Manage CEN quotas. |
problem-cen-peerAttachBandwidthOverLimit | Packet loss due to excess inter-region connection bandwidth | Critical | Packet loss due to excess inter-region connection bandwidth |
| The actual traffic of a CEN transit router exceeds the bandwidth specification, causing packet loss. | Critical: An alert is triggered if the actual traffic of the TR instance meets all of the following conditions: Condition 1: Within 10 minutes, there are more than 8 data points where the peak outbound bandwidth utilization is 90% or higher. Condition 2: Within 10 minutes, there are more than 8 data points where the outbound rate-limiting packet drop rate is greater than 100 pps. | Increase the bandwidth limit. For more information, see Manage CEN quotas. |
Risk events
Event code | Event name | Event level | CloudMonitor event name | CloudMonitor metric name | Description and impact | Alert rules | Recommended action |
Internet-facing instance | |||||||
risk-internetPacketLoss | Risk of Internet link packet loss | Warn | Risk of Internet link packet loss | Not applicable (Internet link probe event) | Probing has detected a packet loss on the physical link from the Alibaba Cloud {Region} to {Country} - {Area} - {ISP}. Traffic on this link in your account may experience jitter. | Critical: An alert is triggered if either of the following conditions is met: Condition 1: The detected packet loss rate of a regional-level ISP link is greater than 50%. Condition 2: Packet loss is detected on a national-level ISP link, and the average bandwidth of the traffic on this link within your account is 0.05 Mbps or higher in the last 10 minutes. Note
Warn: The Internet link packet loss rate is less than 50%, and the average bandwidth is greater than 0.5 Mbps in the last 10 minutes. | Check whether the instance bandwidth on this link meets your business requirements (you can refer to the 5-tuple data in traffic analysis). If there is an issue, consider migrating critical services to another region. If not, you can ignore this alert. |
risk-internetBandwidthOverlimit | Packet loss risk due to excess bandwidth usage | Warn | Packet loss risk due to excess bandwidth usage |
| Historical data indicates a >90% probability that the instance's bandwidth usage will exceed its specification. | Warn: There is a greater than 90% probability that traffic will exceed the specification at a certain time, causing packet loss. | Monitor the usage. If the specification is exceeded, consider upgrading the instance specification. |
VPN Gateway | |||||||
risk-vpn-bpsOverLimit | Risk of excess VPN bandwidth usage | Warn | Risk of excess VPN bandwidth usage |
| The bandwidth utilization of the VPN instance has exceeded 90% three times in the last 10 minutes. | Warn: Within 10 minutes, there are more than 3 data points where the bandwidth utilization is greater than 90%. |
|
risk-vpn-bgpRouteLimit | Risk of excess BGP routes | Warn | Risk of excess BGP routes | Not supported by CloudMonitor | The number of BGP dynamic routes learned by the VPN instance has exceeded 90% of the instance's BGP route quota in the last 10 minutes. | Warn: Within 10 minutes, there is more than 1 data point where the route utilization is greater than 90%. | Monitor BGP route usage. If the quota is nearly full, consider route aggregation on the peer VPN Gateway based on your network plan. |
Express Connect | |||||||
risk-ec-physicalConnectionFail | Express Connect circuit or port failure | Warn | Express Connect circuit or port failure | Not applicable (link status event) | A failure in the ISP's physical Express Connect circuit or a device port failure causes a service interruption. | Warn: The inbound traffic rate (from the data center to the VPC) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met: Condition 1: 3 ≤ number of Express Connect port down events < 20. Condition 2: The Express Connect port is down for more than 2 consecutive data points. Condition 3: Not all Express Connect ports are in a down state. | Contact your business manager for assistance. |
risk-ec-bgpRouterFail | BGP connection failure | Warn | BGP connection failure | Not applicable (BGP connection status event) | A network connectivity failure on the physical Express Connect circuit or an abnormal BGP configuration causes a BGP connection failure and route loss. | Warn: An alert is triggered if the BGP connection status changes from Connected to any other state. | Contact your business manager for assistance. |
risk-ec-inTrafficDroppedToZero | Sharp drop in inbound VBR traffic | Warn | Sharp drop in inbound VBR traffic |
| A failure in the ISP's physical Express Connect circuit or a device port failure causes the inbound traffic of the virtual border router (VBR) to drop sharply. | Warn: The inbound traffic rate (from the data center to the VPC) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met: Condition 1: For 3 consecutive minutes, the rate per minute drops by ≥ 99% compared to the average rate of the previous 7 minutes. Condition 2: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 1 Mbps compared to the average rate of the previous 7 minutes. Condition 3: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 0.5 Mbps compared to the average rates of the previous 15, 30, and 60 minutes. Condition 4 (Intelligent baseline alert): An intelligent baseline learns the historical patterns of the VBR instance's inbound traffic rate to predict a stable range for the next cycle. If the rate drops below the predicted lower bound by ≥ 99% for 2 consecutive minutes within a 3-minute period when the cycle begins, it is considered an abnormal drop. | Check whether this is normal business traffic behavior or if a health check switchover has occurred. If your business is impacted, contact your business manager for assistance. |
risk-ec-outTrafficDroppedToZero | Sharp drop in outbound VBR traffic | Warn | Sharp drop in outbound VBR traffic |
| A failure in the ISP's physical Express Connect circuit or a device port failure causes the outbound traffic of the VBR to drop sharply. | Warn: The outbound traffic rate (from the VPC to the data center) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met: Condition 1: For 3 consecutive minutes, the rate per minute drops by ≥ 99% compared to the average rate of the previous 7 minutes. Condition 2: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 1 Mbps compared to the average rate of the previous 7 minutes. Condition 3: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 0.5 Mbps compared to the average rates of the previous 15, 30, and 60 minutes. Condition 4 (Intelligent baseline alert): An intelligent baseline learns the historical patterns of the VBR instance's outbound traffic rate to predict a stable range for the next cycle. If the rate drops below the predicted lower bound by ≥ 99% for 2 consecutive minutes within a 3-minute period when the cycle begins, it is considered an abnormal drop. | Check whether this is normal business traffic behavior or if a health check switchover has occurred. If your business is impacted, contact your business manager for assistance. |
Related operations
Actions | Description and references |
View events | You can view events in the following ways:
|
Subscribe to events | You can subscribe to events in CloudMonitor. After you subscribe, you will be notified about new events and status updates by phone, text message, or email. For more information, see Configure NIS event subscriptions. |
Handle events | After you view an event, you can resolve the issue based on the provided recommendations. For more information, see Event summary. |