Event center

更新时间:
复制 MD 格式

The Network Intelligence Service (NIS) event center provides proactive alerts to help you identify risks, view potentially affected resources, and prevent business interruptions.

Use cases

Alibaba Cloud defines events in NIS to record and send notifications about your cloud network resources, such as O&M task execution status, resource issues, and resource status changes.

  • Notify you of risks and issues

    If an event occurs that impairs the availability or performance of an instance, such as performance degradation from usage beyond specifications, service unavailability from packet loss on an ISP link, or an alert for an expiring instance, Alibaba Cloud pushes the event to the event center in the NIS console. Respond to these events promptly to prevent business interruptions caused by impaired resource availability or performance.

  • Enable automated O&M

    Each event displayed in the NIS console has a defined status, which helps you track the execution of related system O&M tasks. When an event is generated or its status changes, the system reports it to CloudMonitor. This allows you to build an event-driven, automated O&M system based on your business needs.

Limitations

NIS does not support the event feature for instance families that are no longer available for purchase. For more information, see the end-of-sale announcements for each cloud service.

Basics

Event types

Events are defined by Alibaba Cloud to record information and send notifications about cloud network resources. Based on their cause, events are categorized as follows:

Category

Description

Example

issue event

An exceptional event that has already caused business impact and has remained in the In Progress state for seven days.

  • Packet loss due to excess bandwidth usage

  • Instance shutdown due to overdue payments

risk event

An exceptional event that may cause business impact and has remained in the In Progress state for seven days.

  • Risk of business impact due to packet loss on a physical link

  • Risk of failure due to sudden spikes or drops in bandwidth usage

  • Risk of instance shutdown due to overdue payments

Event levels

Events are classified into the following levels based on their impact on the normal operation of an instance:

  • Critical: A significant impact that requires immediate action to prevent the instance from becoming unavailable.

  • Warn: Moderate impact. You should monitor the event while it persists or handle it at an appropriate time.

  • Info: These events do not require immediate action.

Note

For more information about event codes, names, descriptions, and recommended actions, see Event summary.

Event summary

This section summarizes the events supported by NIS and provides recommended actions for each event.

Note

Issue events do not support monitoring for shared-resource CLB instances.

Issue events

Event code

Event name

Event level

CloudMonitor event name

CloudMonitor metric name

Description and impact

Alert rules

Recommended action

Internet-facing instance

problem-internetBandwidthOverlimit

Packet loss due to excess bandwidth usage

Critical

Packet loss due to excess instance bandwidth usage

net_out.rate_percentage (outbound bandwidth utilization), out_ratelimit_drop_speed (outbound rate-limiting packet drop rate), net_tx.rate (outbound bandwidth)

The actual bandwidth usage of an Internet-facing instance has exceeded its specification, causing packet loss.

Internet-facing instances include elastic IP address (EIP) instances, bandwidth plans, and Classic Load Balancer (CLB) instances.

Critical: Bandwidth usage frequently exceeds the limit over the last 10 minutes, causing packet loss.

Upgrade the instance to increase the peak bandwidth.

NAT gateway

problem-nat-sessionOverLimit

Connection drop caused by excess NAT sessions

Critical

Connection drop caused by excess NAT sessions

EniSessionLimitDropConnection (interface concurrent connection drop rate), EniSessionActiveConnection (interface concurrent connections)

The number of sessions on the NAT gateway exceeds its specification, which causes new sessions to fail and a packet loss rate of more than 100 packets/s.

Critical: The number of concurrent sessions frequently exceeds the limit over the last 10 minutes, and the packet loss rate is greater than 100 packets/s.

Upgrade the specification or use multiple NAT gateway instances. For more information, see Manage NAT Gateway quotas and Internet NAT gateway and Create and manage a VPC NAT Gateway instance.

problem-nat-sessionNewOverLimit

Connection drop caused by excess new NAT sessions

Critical

Connection drop caused by excess new NAT sessions

EniSessionNewLimitDropConnection (interface new connection drop rate), EniSessionNewConnection (interface new connection rate)

The rate of new sessions on the NAT gateway exceeds its specification, which causes new sessions to fail and a packet loss rate of more than 100 packets/s.

Critical: The number of new sessions frequently exceeds the limit over the last 10 minutes, and the packet loss rate is greater than 100 packets/s.

problem-nat-portAllocationError

Allocation failure of SNAT source ports

Critical

Allocation failure of SNAT source ports

ErrorPortAllocationRate (rate of port allocation failures)

Too few EIPs or IP addresses are configured for the NAT gateway instance, causing source port allocation to fail and a packet loss rate of more than 10 packets/s.

Note

You cannot create a subscription for this event.

Critical: Source port allocation frequently fails over the last 10 minutes, and the packet loss rate is greater than 10 packets/s.

Add more EIPs or IP addresses that are associated with the NAT gateway instance. For more information, see Create and manage a VPC NAT Gateway instance.

problem-nat-datapathUnavailable

NAT gateway data path unavailable

Critical

NAT gateway data path unavailable

Not applicable (system availability event)

The data path of a NAT gateway is unavailable. The availability of your NAT gateway was 0% in the past 10 minutes, which means all traffic is affected and your NAT gateway is not working. This may be due to a platform-level event. Alibaba Cloud engineers are working to resolve the issue.

Critical: The availability of the NAT gateway was 0% in the last 10 minutes.

If you have deployed multiple NAT gateways for high availability, we recommend that you switch to another NAT gateway. For more information, see Deploy multiple NAT gateways to implement high availability. Otherwise, contact Alibaba Cloud engineers to get the latest recovery progress.

problem-nat-datapathDegraded

NAT gateway data path degraded

Critical

NAT gateway data path degraded

Not applicable (system availability event)

The data path of a NAT gateway is degraded. The availability of your NAT gateway was below 80% in the past 10 minutes, which means more than 20% of traffic is affected and your NAT gateway is not working properly. This may be due to a platform-level event causing packet drops. Alibaba Cloud engineers are working to resolve the issue.

Critical: The availability of the NAT gateway was less than 80% in the last 10 minutes, causing packet loss.

Classic Load Balancer (CLB)

problem-clb-connectionOverLimit

Dropped new connections due to excess CLB sessions

Critical

Dropped new connections due to excess CLB sessions

InstanceDropConnection (dropped connections per second for an instance)

The number of new or concurrent connections on a CLB instance exceeds its specification, causing new sessions to fail and a high rate of dropped connections.

Critical: The number of concurrent sessions frequently exceeds the limit over the last 10 minutes, causing packet loss.

Upgrade the instance or switch to a Network Load Balancer (NLB) or Application Load Balancer (ALB) instance.

For more information, see Manage CLB quotas. For product details about NLB and ALB, see What is Network Load Balancer (NLB)? and What is Application Load Balancer (ALB)?.

problem-clb-bandwidthOverLimit

Packet loss due to excess CLB bandwidth usage

Critical

Packet loss due to excess CLB bandwidth usage

InstanceDropTrafficRX (inbound bits dropped per second for an instance)

The actual traffic of a CLB instance exceeds its bandwidth specification, causing packet loss.

Critical: Bandwidth usage frequently exceeds the specification over the last 10 minutes, and the drop rate is greater than 100 bps.

Upgrade the instance specification. For more information, see Adjust the specifications of performance-guaranteed instances.

problem-clb-connectionFail

Sharp increase in failed CLB connections

Critical

Sharp increase in failed CLB connections

Not supported by CloudMonitor

The number of failed connections on the CLB instance has sharply increased. This may be because the backend server specification is exceeded, the load is too high, or there is a service exception.

Critical: The number of failed new connections for the CLB instance has sharply increased over the last 10 minutes. An alert is triggered if all of the following conditions are met:

Condition 1: The number of failed connections is greater than 100/s.

Condition 2: The number of failed connections increases by 30% compared with the previous 10-minute window.

Condition 3: Based on an intelligent baseline learned from historical data, the number of failed connections continuously exceeds the upper baseline limit by more than 30% within a 10-minute period.

Depending on the cause, upgrade the backend server specification, upgrade the CLB specification, or check the backend service status.

For more information, see Manage CLB quotas and Diagnose a CLB instance.

NLB

problem-nlb-connectionFail

Sharp increase in failed NLB connections

Critical

Sharp increase in failed NLB connections

Not supported by CloudMonitor

The number of failed connections on a virtual IP address (VIP) of the NLB instance has sharply increased for 10 consecutive minutes. Possible causes include:

  • Network link jitter.

  • Insufficient backend server performance.

Critical: An alert is triggered if the number of failed connections on the NLB instance meets all of the following conditions:

Condition 1: Within a 610-second monitoring window, the number of failed connections exceeds the intelligent forecast baseline by more than 100% for 3 consecutive minutes.

Condition 2: Within a 610-second monitoring window, the number of failed connections increases by 50% or more compared with the previous hour for 7 consecutive minutes.

Condition 3: Within a 610-second monitoring window, the number of failed connections is 1,000 or more for 8 consecutive minutes.

Check if the backend server resources or service status are normal.

For more information, see Diagnose an NLB instance.

problem-nlb-newConnectionSurge

Dropped new NLB connections

Critical

Dropped new NLB connections

VipDropConnection (dropped connections per second for a VIP), VipNewConnection (new connections per second for a VIP)

Due to a surge in new connections, the VIP of the NLB instance continuously drops new connection requests at millisecond or second intervals.

Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions:

Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0.

Condition 2: Within 10 minutes, there are more than 8 data points where the number of new connections established by the VIP per second is less than 200,000.

Distribute traffic across multiple NLB instances or contact your account manager to apply for a quota increase.

problem-nlb-newConnectionOverLimit

Excess new NLB connections

Critical

Excess new NLB connections

VipDropConnection (dropped connections per second for a VIP), VipNewConnection (new connections per second for a VIP)

The number of new connections on the VIP of the NLB instance has exceeded the automatic scaling limit for a single VIP, causing new connection requests to be continuously dropped.

Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions:

Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0.

Condition 2: Within 10 minutes, there are more than 8 data points where the number of new connections established by the VIP per second is 200,000 or more.

problem-nlb-concurrentConnectionOverLimit

Excess concurrent NLB connections

Critical

Excess concurrent NLB connections

VipDropConnection (dropped connections per second for a VIP), VipMaxConnection (maximum concurrent connections for a VIP)

The number of concurrent connections on the VIP of the NLB instance has exceeded the automatic scaling limit for a single VIP, causing new connection requests to be continuously dropped.

Critical: An alert is triggered if the number of connections on the NLB instance meets all of the following conditions:

Condition 1: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the VIP per second is greater than 0.

Condition 2: Within 10 minutes, there are more than 8 data points where the maximum number of concurrent connections on the VIP is greater than 5,000,000.

ALB

problem-alb-intranetBandwidthOverLimit

Packet loss due to excess private bandwidth usage of ALB instances

Critical

Packet loss due to excess private bandwidth usage of ALB instances

Not supported by CloudMonitor

The outbound or inbound bandwidth on the VIP address of the ALB instance has reached its limit. A single VIP resolved from an ALB domain name has a bandwidth limit.

Critical: Within 10 minutes, there are more than 8 data points where the traffic dropped by the ALB instance is greater than 100 bps.

Add a CNAME record for the ALB instance. For more information, see Add a CNAME record for an ALB instance.

problem-alb-sessionOverLimit

Dropped new connections due to excess ALB sessions

Critical

Dropped new connections due to excess ALB sessions

LoadBalancerRejectedConnection (dropped connections per second for a load balancer instance)

The number of new or concurrent connections on the VIP address of the ALB instance exceeds the limit, causing new sessions to fail. A single VIP resolved from an ALB domain name has a new connection limit.

Critical: Within 10 minutes, there are more than 8 data points where the number of connections dropped by the ALB instance per second is greater than 0.

problem-alb-qpsOverLimit

503 error code returned because QPS exceeds the limit

Critical

503 error code returned because QPS exceeds the limit

Not supported by CloudMonitor

The number of queries per second (QPS) on the VIP address of the ALB instance has reached the VIP limit. A single VIP resolved from an ALB domain name has a QPS limit.

Critical: Within 10 minutes, there are more than 8 data points where the number of requests dropped per second is greater than 200 qps, and for 10 consecutive minutes, the number of dropped requests per second increases by 30% or more compared with 7 minutes earlier.

Cloud Enterprise Network (CEN)

problem-cen-routeOverLimit

Excess CEN routes

Critical

Excess CEN routes

Not applicable (event-based metric)

The CEN route quota is exceeded, which may cause network issues.

Critical: The CEN route quota is exceeded, causing network issues.

Upgrade the Transit Router (TR). For more information, see Upgrade a basic transit router to an enterprise transit router.

TR

problem-cen-vpcAttachBandwidthOverLimit

Packet loss due to excess VPC connection bandwidth

Critical

Packet loss due to excess VPC connection bandwidth

Not supported by CloudMonitor

The actual traffic of a CEN transit router exceeds the bandwidth specification, causing packet loss.

Critical: Within 10 minutes, there are more than 5 data points where the inbound packet loss rate is greater than 0.

Increase the bandwidth limit. For more information, see Manage CEN quotas.

problem-cen-peerAttachBandwidthOverLimit

Packet loss due to excess inter-region connection bandwidth

Critical

Packet loss due to excess inter-region connection bandwidth

InterRegionRateLimitDropPackets (outbound rate-limiting packet drop rate for inter-region connections), InterRegionPeakBandwidthUtilization (peak outbound bandwidth utilization for inter-region connections)

The actual traffic of a CEN transit router exceeds the bandwidth specification, causing packet loss.

Critical: An alert is triggered if the actual traffic of the TR instance meets all of the following conditions:

Condition 1: Within 10 minutes, there are more than 8 data points where the peak outbound bandwidth utilization is 90% or higher.

Condition 2: Within 10 minutes, there are more than 8 data points where the outbound rate-limiting packet drop rate is greater than 100 pps.

Increase the bandwidth limit. For more information, see Manage CEN quotas.

Risk events

Event code

Event name

Event level

CloudMonitor event name

CloudMonitor metric name

Description and impact

Alert rules

Recommended action

Internet-facing instance

risk-internetPacketLoss

Risk of Internet link packet loss

Warn

Risk of Internet link packet loss

Not applicable (Internet link probe event)

Probing has detected a packet loss on the physical link from the Alibaba Cloud {Region} to {Country} - {Area} - {ISP}. Traffic on this link in your account may experience jitter.

Critical: An alert is triggered if either of the following conditions is met:

Condition 1: The detected packet loss rate of a regional-level ISP link is greater than 50%.

Condition 2: Packet loss is detected on a national-level ISP link, and the average bandwidth of the traffic on this link within your account is 0.05 Mbps or higher in the last 10 minutes.

Note
  • Regional-level: A physical link to {Country}-{Area}-{ISP}.

  • National-level: A physical link to {Country}-{ISP}.

Warn: The Internet link packet loss rate is less than 50%, and the average bandwidth is greater than 0.5 Mbps in the last 10 minutes.

Check whether the instance bandwidth on this link meets your business requirements (you can refer to the 5-tuple data in traffic analysis). If there is an issue, consider migrating critical services to another region. If not, you can ignore this alert.

risk-internetBandwidthOverlimit

Packet loss risk due to excess bandwidth usage

Warn

Packet loss risk due to excess bandwidth usage

net_out.rate_percentage (outbound bandwidth utilization)

Historical data indicates a >90% probability that the instance's bandwidth usage will exceed its specification.

Warn: There is a greater than 90% probability that traffic will exceed the specification at a certain time, causing packet loss.

Monitor the usage. If the specification is exceeded, consider upgrading the instance specification.

VPN Gateway

risk-vpn-bpsOverLimit

Risk of excess VPN bandwidth usage

Warn

Risk of excess VPN bandwidth usage

in_bandwidth_utilization (inbound bandwidth utilization of the VPN gateway), out_bandwidth_utilization (outbound bandwidth utilization of the VPN gateway)

The bandwidth utilization of the VPN instance has exceeded 90% three times in the last 10 minutes.

Warn: Within 10 minutes, there are more than 3 data points where the bandwidth utilization is greater than 90%.

 

risk-vpn-bgpRouteLimit

Risk of excess BGP routes

Warn

Risk of excess BGP routes

Not supported by CloudMonitor

The number of BGP dynamic routes learned by the VPN instance has exceeded 90% of the instance's BGP route quota in the last 10 minutes.

Warn: Within 10 minutes, there is more than 1 data point where the route utilization is greater than 90%.

Monitor BGP route usage. If the quota is nearly full, consider route aggregation on the peer VPN Gateway based on your network plan.

Express Connect

risk-ec-physicalConnectionFail

Express Connect circuit or port failure

Warn

Express Connect circuit or port failure

Not applicable (link status event)

A failure in the ISP's physical Express Connect circuit or a device port failure causes a service interruption.

Warn: The inbound traffic rate (from the data center to the VPC) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met:

Condition 1: 3 ≤ number of Express Connect port down events < 20.

Condition 2: The Express Connect port is down for more than 2 consecutive data points.

Condition 3: Not all Express Connect ports are in a down state.

Contact your business manager for assistance.

risk-ec-bgpRouterFail

BGP connection failure

Warn

BGP connection failure

Not applicable (BGP connection status event)

A network connectivity failure on the physical Express Connect circuit or an abnormal BGP configuration causes a BGP connection failure and route loss.

Warn: An alert is triggered if the BGP connection status changes from Connected to any other state.

Contact your business manager for assistance.

risk-ec-inTrafficDroppedToZero

Sharp drop in inbound VBR traffic

Warn

Sharp drop in inbound VBR traffic

RateInFromIDCToVpc (inbound traffic rate from data center to VPC)

A failure in the ISP's physical Express Connect circuit or a device port failure causes the inbound traffic of the virtual border router (VBR) to drop sharply.

Warn: The inbound traffic rate (from the data center to the VPC) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met: Condition 1: For 3 consecutive minutes, the rate per minute drops by ≥ 99% compared to the average rate of the previous 7 minutes. Condition 2: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 1 Mbps compared to the average rate of the previous 7 minutes. Condition 3: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 0.5 Mbps compared to the average rates of the previous 15, 30, and 60 minutes. Condition 4 (Intelligent baseline alert): An intelligent baseline learns the historical patterns of the VBR instance's inbound traffic rate to predict a stable range for the next cycle. If the rate drops below the predicted lower bound by ≥ 99% for 2 consecutive minutes within a 3-minute period when the cycle begins, it is considered an abnormal drop.

Check whether this is normal business traffic behavior or if a health check switchover has occurred. If your business is impacted, contact your business manager for assistance.

risk-ec-outTrafficDroppedToZero

Sharp drop in outbound VBR traffic

Warn

Sharp drop in outbound VBR traffic

RateOutFromVpcToIDC (outbound traffic rate from VPC to data center)

A failure in the ISP's physical Express Connect circuit or a device port failure causes the outbound traffic of the VBR to drop sharply.

Warn: The outbound traffic rate (from the VPC to the data center) of the VBR instance is monitored at a minute-level granularity. An alert is triggered if all of the following conditions are met: Condition 1: For 3 consecutive minutes, the rate per minute drops by ≥ 99% compared to the average rate of the previous 7 minutes.

Condition 2: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 1 Mbps compared to the average rate of the previous 7 minutes.

Condition 3: For 3 consecutive minutes, the absolute value of the rate drop per minute is ≥ 0.5 Mbps compared to the average rates of the previous 15, 30, and 60 minutes.

Condition 4 (Intelligent baseline alert): An intelligent baseline learns the historical patterns of the VBR instance's outbound traffic rate to predict a stable range for the next cycle. If the rate drops below the predicted lower bound by ≥ 99% for 2 consecutive minutes within a 3-minute period when the cycle begins, it is considered an abnormal drop.

Check whether this is normal business traffic behavior or if a health check switchover has occurred. If your business is impacted, contact your business manager for assistance.

Related operations

Actions

Description and references

View events

You can view events in the following ways:

Subscribe to events

You can subscribe to events in CloudMonitor. After you subscribe, you will be notified about new events and status updates by phone, text message, or email. For more information, see Configure NIS event subscriptions.

Handle events

After you view an event, you can resolve the issue based on the provided recommendations. For more information, see Event summary.