Observability
Change observability is the ability to detect unexpected online business anomalies—such as monitoring alerts, alarms, and log errors—triggered during change execution in real time. It enables change executors to proactively identify problems and reduce the blast radius of major failures. As a fundamental capability, change observability is a baseline requirement for any change management system.
Three Principles of Change Observability
-
Effective observation during change execution: The change system enforces progressively stronger controls, with observation starting from the first batch of execution.
-
Observation required during each gray release batch: Observation must continue throughout the change execution process. Verify that the current batch shows no anomalies before proceeding to the next batch.
-
Sufficient observation interval for each batch: Each team can define observation intervals based on its own business characteristics and experience to ensure adequate observation coverage.
Levels of Observability
Observability coverage spans four layers, categorized by monitoring target and method:
-
Infrastructure monitoring: Covers data centers, networks, and other infrastructure. In cloud-based Kubernetes environments, this includes performance monitoring of host nodes and core network components. Alibaba Cloud CloudMonitor provides this coverage, tracking metrics such as node load, CPU, memory, and network utilization.
-
System application monitoring: Covers instances, middleware, and other foundational services. CloudMonitor provides this observability. For cloud-native metrics, Alibaba Cloud Managed Service for Prometheus (ARMS Prometheus) also meets these requirements.
-
Business monitoring: Collects application-level data such as request counts, success rates, and response times to produce business health metrics. Alibaba Cloud ARMS provides code-level visualization for defining business requests along with rich performance metrics and diagnostic capabilities. Alibaba Cloud Log Service (SLS) supports custom metrics—you can define custom log formats, collect them through SLS, and configure dashboards to monitor business health or perform system audits.
-
User feedback monitoring: Captures user-reported issues on feature availability through channels such as public feedback and customer complaints.