Failure Emergency Collaboration

更新时间:
复制 MD 格式

Coordinate fault response across teams by using automated notifications, emergency collaboration groups, and clearly defined roles to minimize recovery time.

Failure Notification and Update

With 24/7 monitoring in place, when a business exception reaches the fault level, fault impact information and handling progress are sent to designated recipients or groups within the agreed time frame through voice, SMS, or IM. Notifications continue until the fault is resolved.

Failure Emergency Collaboration Group

After a fault occurs, DingTalk automatically creates an emergency collaboration group and pulls in relevant members, including the emergency contact person of the affected business and the emergency contact person of the suspected cause business, then auto-dials them. Members can check in directly in the group. We recommend creating a separate group for each fault so that membership stays relevant and collaboration stays focused. The emergency collaboration group spans the entire handling lifecycle: 24/7 fault detection -> automatic group creation -> member notification -> troubleshooting and mitigation plan sharing -> one-click conference call -> fault war room -> recovery metrics summary. The key roles and responsibilities during fault emergency are:

  • Fault Handler (technical support engineer, monitoring duty engineer): Initiates the fault emergency, coordinates resources across teams to drive rapid recovery, keeps the fault war room updated so all stakeholders have timely access to fault information, and escalates as needed.

  • Emergency Handler (R&D, testing engineer, stability contact person, etc.): Troubleshoots issues and drives rapid recovery as assigned by the Emergency Commander, responds within SLA requirements, and reports progress.

  • Emergency Commander: Assigned based on fault level. For example, the stability department leader or on-duty manager handles P1 and P2 faults, while the technical team leader or designated stability contact person handles P3 and P4 faults. The commander assigns emergency handlers their tasks (investigate the cause, recover quickly, synchronize progress) and coordinates overall response to ensure rapid recovery. Note: The impact of any mitigation action should not be more severe than the impact of the fault itself.