Overview

更新时间:
复制 MD 格式

More than half of all major production failures originate from changes. Change risk control is Alibaba Cloud's framework for managing that risk across the full change lifecycle — combining organization-wide standards with technical enforcement to reduce failure incidence, standardize execution, and protect business continuity.

What is a change

A change is any operation on a live system that may affect production services: publishing a release, adding or modifying components, or removing components. The scope is broader than most teams expect.

A change system is any tool or platform that can perform such operations — including GUI consoles, command-line scripts, stress testing platforms, and open APIs. If a platform can modify the production environment, it is a change system, regardless of its primary purpose.

Goals

Change risk control has three objectives:

  • Reduce the incidence of major failures caused by changes.

  • Standardize change operations across business teams with shared execution standards.

  • Help change systems enforce risk controls that safeguard the execution of business changes.

Two layers of change risk control

Change risk control operates at two levels:

  • Business philosophy: Organization-wide standards that govern how change operations are executed and how change systems are built. Every team that touches production must operate within these standards.

  • Technical framework: Tooling and automation that enforces the standards across the change lifecycle — admission checks before a change begins, gradual execution with monitoring gates during the change, and impact analysis afterward.

The change lifecycle

(应用上云规划-应用上云实施-图5)  备份 8 2.jpg

Every change moves through three phases.

  1. Planning phase: Submit and approve a change request. The request must specify the change plan, maintenance window, potential impact, and rollback plan.

  2. Execution phase: Before touching production, re-validate the environment and confirm that service traffic has stopped as expected. Validate the change in a staging environment first. In production, deploy in grayscale batches with a pause between each batch. During each pause, verify at least one core business health indicator — such as Business Monitoring metrics or log file names — before proceeding. A rollback capability must be available.

  3. Completion phase: Confirm that the business is operating normally using monitoring data and logs. Record and report the outcome.

Three principles every change must follow

Any operation on a production system — publishing, adding, modifying, or removing components — must satisfy three principles:

  • Observability: Before executing a change, identify the metrics and logs that will tell you whether the change succeeded. During execution, actively monitor those signals at each batch boundary. After completion, use impact topology data to trace any anomalies back to their source.

  • Grayscale release: Never apply a change to the full production environment in a single step. Deploy in batches, starting with a small exposure. Pause between batches to verify health before expanding. This limits the blast radius of any problem and gives you time to detect issues before they affect all users.

  • Rollback capability: Every change must have a rollback plan before it is approved. A rollback capability must be available.