Fault management
Fault management helps you restore services quickly after production incidents, minimize business impact, and prevent recurrence through systematic detection, response, and improvement.
Overview of fault management
Fault management is a concept from the Information Technology Infrastructure Library (ITIL). The goal is to quickly restore normal service operations after a major breakdown in the production environment, minimizing the negative business impact of component failures and ensuring that agreed-upon service level objectives and quality standards are met.
In practice, faults at IT and internet companies can be caused by the following:
-
Faults caused by scheduled hardware or operating system maintenance, such as replacing hard drives or applying operating system patches.
-
Application faults, including software performance issues, application bugs, and system application changes.
-
Human error, including incorrect or non-standard operations that do not follow established procedures.
-
System software faults, including operating system crashes and various database failures.
-
Hardware failures, including damaged hard drives or network interface controllers (NICs).
-
Failures of related devices, such as power outages caused by Uninterruptible Power Supply (UPS) failures.
-
Natural disasters, including floods, fires, and earthquakes.
To reduce the impact of faults, Alibaba Group uses a systematic governance approach that includes defining, detecting, and responding to scenarios that affect business operations, along with handling subsequent administration. The system incorporates Alibaba Group's innovative "risk warning" feature to manage potential issues before they escalate, covering both common faults that cause performance degradation and major faults that severely impact the business.
This system is also optimized for internet companies, which often require extremely fast responses and use rapid development environments such as DevOps and Agile. Major fault emergencies also require a coordinated response from multiple departments, including legal, government affairs, public relations, customer service, and technical support.

Importance of fault management
Both theory and practice show that if a fault can happen, it will happen. According to Murphy's Law, if the probability of an unexpected event occurring in one experiment is p (p > 0), then the probability of it occurring at least once in n experiments is P = 1 - (1 - p)n. As the number of experiments (n) approaches infinity, P approaches 1. This means the event becomes inevitable.
Fault management helps you achieve the following:
-
Prevent problems by detecting and resolving risks in advance.
-
Reduce the impact of faults by detecting, locating, and recovering from them quickly. This is known as the 1-5-10 solution.
-
Ensure that improvement measures are effectively implemented to prevent similar faults from recurring.
A standardized, end-to-end, closed-loop fault management system combined with technological improvements reduces fault occurrence, shortens the Mean Time To Repair (MTTR), and ultimately minimizes business impact.
In daily operations, a fault is any event that causes a service interruption, a decline in service quality, or a poor user experience. This excludes issues caused by the user's environment or their own actions.
-
"Poor user experience" refers to the impact on the user, which is the core of a fault. This impact can be identified through customer complaints or inferred by monitoring user-side activity.
-
"Service interruption or decline in service quality" means that a problem with a company-provided service is considered a fault, even if there are no user complaints or active users.
-
"Any event" means that any issue affecting a user is a fault, regardless of the cause. The cause could be the company, a third-party supplier, or an ISP.
Fault management
Fault management is an end-to-end emergency response process that includes emergency response, fault convergence, fault tracking, post-mortems, and improvements. Establishing this mechanism helps ensure stable service operation and a reliable service experience.
A fault management system should include the following features or characteristics:
-
Fault level definitions: For different lines of business, bring together relevant personnel to create unified definitions that are agreed upon by all parties. When defining fault levels, consider the following:
-
Feature importance
-
Impact on products, services, and applications
-
Scope of impact (number of users, amount of loss, public sentiment, etc.)
-
Fault emergency response: Supports global emergency announcements through multiple notification channels, such as phone calls, text messages, emails, and instant messaging (IM), ensuring that key personnel are promptly informed of critical progress.
-
Fault convergence: Supports alert convergence based on time or frequency, grouping related alerts into a single fault for unified handling.
-
Fault tracking: Provides online management and collaboration on the latest progress, scope of impact (affected services), public feedback, and timelines, giving teams a unified view for collaborative fault handling.
-
Fault post-mortem: Provides structured requirements for in-depth post-mortems with online checkpoints. This includes root cause analysis (such as the cause, recent activities, how the fault was introduced, and the recovery method), checks for related changes, and reviews of monitoring data. An owner and a team must be assigned to each fault.
-
Fault improvement: Supports defining clear improvement measures, acceptance criteria, owners, and completion dates for each fault, ensuring that every post-mortem leads to actionable improvements and helps prevent recurrence.
Best practices
O&M Event Center is a cloud-based fault management service from Alibaba Cloud that helps you establish a fault emergency response process, standardize procedures, and build a complete system to support stable business growth.
Based on years of fault management experience, teams at Alibaba Group have developed a feature-rich fault management platform that facilitates the digital implementation of fault management tasks. All aspects of fault management can be configured and managed in O&M Event Center.
Defining and entering fault levels
A standard approach to defining fault levels is as follows:
-
First, divide the business into major child classes based on business attributes (at the overall technical architecture layer).
-
Second, distinguish between core, sub-core, and non-core modules within each business child class (at the feature layer).
-
Third, adapt the scope of impact and fault level definition templates based on the business volume of each functional module.
The key is adapting the scope of impact and the corresponding fault level definition template to the business volume. The following examples are for reference only. Adapt the recommended values to your specific business needs.
For core features:
-
For high-volume services (for example, a peak rate of over 1,000 Transactions Per Second (TPS) and a daily average of over 1 million transactions), a drop of 30% or more in the successful transaction rate per minute is defined as P1.
-
For medium-volume services (for example, a peak rate of 100 to 1,000 TPS and a daily average of 100,000 to 1 million transactions), a drop of 45% or more in total successful transactions over a 10-minute period is defined as P1.
-
For low-volume services (for example, a peak rate of 10 to 100 TPS and a daily average of 10,000 to 100,000 transactions), a drop of 45% or more in total successful transactions over a 15- or 30-minute period is defined as P1.
-
For very low-volume services (with a daily average of fewer than 10,000 transactions), a drop of 45% or more in total successful transactions over a 60-minute period can be defined as P2.
We recommend defining business functional modules from the user's perspective or from a perspective that is perceptible to external calls. For example, you can measure a drop in business volume from users or a drop in successful external calls.
After the highest fault level, P1, is determined, you can sequentially lower the scope of impact to define the P2 to P4 standards. For example, impact scopes of 20% to 30% or 30% to 45% can be mapped to the remaining levels. For failures in the main path of high-volume services, you might consider starting at P3 and not setting a P4 level.
For sub-core features (such as marketing or registration services), you can lower the level by one from the core feature standard.
For non-core features (such as query services or background services), you can lower the level by two from the core feature standard.
This generates a fault level definition template as shown below. In actual use, you can simplify it to avoid redundancy.

After the fault level definitions are created, the technical owner must approve them. The definitions must also be communicated to the technical teams and downstream teams, which may require a presentation.
In O&M Event Center, you can enter the corresponding fault levels. When an associated monitor is triggered, the system automatically matches the event to the corresponding level definition. This helps you quickly determine the severity of the fault.
Service groups and emergency response groups
A service group is a set of personnel associated with one or more fault scenarios. When a fault is triggered, on-duty members of the corresponding service group are automatically notified by a call and added to the fault emergency response group. Service groups also support on-duty scheduling.
An emergency response group is a group chat automatically created after a fault is announced. In addition to automatically added members, other relevant personnel can join to help investigate the fault. The group also provides fault handling features such as check-in responses, investigation assistance, and playbooks.
Fault records
During a fault, record relevant information, such as key time points and key operations.
Post-mortems and improvement measures
After the fault is resolved, synchronize post-mortem information, identify the root cause, and assign an owner.
After the post-mortem, implement specific improvements to prevent similar faults from recurring.