Event management

更新时间:
复制 MD 格式

Learn how to detect, classify, assign, and resolve events to minimize business impact and restore service quickly.

An event is any unplanned incident that interrupts or threatens service quality — for example, business risks, slow server performance, or high interface latency. Events affect work efficiency and degrade customer experience, regardless of severity.

Two main types of event sources:

  • Manual reporting

  • System detection

Event management covers detecting, recording, classifying, assigning, analyzing, resolving, and closing events. The goal is to restore service promptly and minimize business impact. It helps locate problems quickly, improve resolution efficiency, reduce recurring issues, enhance business continuity, improve the user experience, and standardize workflows.

image.png

The event management process:

  • Detection and recording: Detect and record events through monitoring tools, log analysis, or manual reporting.

  • Prioritization and classification: Prioritize and classify events based on their impact and cause.

  • Prioritization: Assign P1–P4 priority levels based on impact.

  • Classification: Classify by cause — monitoring false positives, business fluctuations, or code logic issues.

  • Assignment: Assign events to the appropriate person or group based on affected scope, service, or application. This enables quick response and improves internal communication efficiency.

  • Resolution and analysis: The event owner assesses the alert details, resolves the event, and records the solution and reasoning for future reference.

  • Closure: Close the event after resolution. The saved record serves as a reference for similar future issues.

Benefits of a standardized event workflow:

  • Faster event resolution.

  • Reduced business losses and costs.

  • Continuous improvement and learning.

O&M Event Center is a cloud-based event management service from Alibaba Cloud. When alert data from integrated monitoring sources is assigned and triggers a notification based on rule conditions, it becomes an event. Events have higher priority than alerts and are assigned to a specific owner for follow-up, resolution, and archiving.

Events manage tasks triggered by rules or created manually. O&M Event Center supports flexible task flows, priority response for critical events, and closure tracking to improve Mean Time to Acknowledge (MTTA) and Mean Time to Repair (MTTR). It also supports one-click escalation of events to incidents, enabling online management of the entire event lifecycle.

  • Alert data integration: Integrate with dozens of monitoring systems such as Application Real-Time Monitoring Service (ARMS), Simple Log Service (SLS), Cloud Monitor, Prometheus, and Dynatrace. Custom integrations automatically parse alert information.

  • Classification and assignment: Maintain relationships between services, personnel, and service groups. Use routing rules to classify alerts by affected service or application. Set trigger rules based on alert fields to automatically assign events to the corresponding owner or group.

  • Handling and resolution: The event owner acknowledges the event, analyzes the cause, and references similar past resolutions. The system records handling details for future analysis.

  • Closure and continuous improvement: Close resolved events with tags, trigger causes, and solutions recorded. This data supports analysis and provides a reference for future events and architecture optimization.