Design Principles for Observability-Well-Architected Framework(WAF)-阿里云帮助中心

Observability design gives you the visibility to understand, monitor, and continuously improve your system's health — across metrics, traces, logs, dashboards, and alerts.

In cloud-native and microservices architectures, a single user request can touch dozens of services across distributed infrastructure. Without deliberate observability design, failures are difficult to detect and diagnose. A well-designed observability system covers five interconnected areas: monitoring metrics, distributed tracing, logging, monitoring dashboards, and alerts. Together, they provide a complete picture of system health and a feedback loop for continuously improving reliability and performance.

Monitoring metrics

Metrics are quantitative measurements of system behavior over time — CPU usage, memory consumption, network traffic, request latency, error rates, and more. Collecting the right metrics lets you understand system health at a glance and detect anomalies before they become outages.

To implement effective metrics monitoring:

Instrument your services to emit metrics at regular intervals.
Use a time-series database to store and query metric data efficiently.
Define alert thresholds so anomalies trigger notifications automatically.

Common tools include Prometheus, Grafana, Zabbix, and Alibaba Cloud CloudMonitor.

Distributed tracing

In distributed systems, a single request flows through multiple services. When something fails, you need to trace that request end to end — across every service, queue, and database it touched — to pinpoint the root cause.

Distributed tracing works by attaching a unique trace ID to each incoming request. As the request propagates through the system, every component records its span — the segment of work it performed — tagged with that ID. When an issue occurs, you can reconstruct the full call chain and identify exactly where latency or failures originated.

Common open-source tracing tools include Jaeger, Zipkin, SkyWalking, and CAT. Alibaba Cloud provides ARMS for distributed tracing.

Logging

Logs capture discrete events as they happen — successful operations, errors, warnings, configuration changes, and security-relevant actions. They complement metrics and traces by providing the granular context needed for root-cause analysis.

Effective logging requires more than writing events to disk. As log volume grows, unmanaged logs consume storage and slow down retrieval. Apply these practices to keep logs useful:

Record the right events: Focus on errors, warnings, and critical operations. Avoid excessive verbosity in production logs.
Structure your logs: Use a consistent format (such as JSON) so logs can be parsed, queried, and correlated programmatically.
Filter and archive: Route high-volume debug logs to cheaper storage tiers, and set retention policies to control costs.
Correlate with traces: Include trace IDs in log entries so you can jump from a trace span directly to the associated log lines.

Monitoring dashboards

Raw metrics and traces are most useful when surfaced in dashboards that make system state immediately readable. A well-designed dashboard lets you assess health at a glance, compare current behavior against historical baselines, and detect emerging trends before they escalate.

Design dashboards around your system's health model — the set of conditions that indicate it is operating within acceptable bounds. Organize panels by tier (infrastructure, platform, application) and correlate related signals on the same view so problems in one layer can be traced to their effect in another. Common tools include Grafana and Kibana.

Alerts

Alerts notify your team when system behavior crosses a defined threshold — whether that's a spike in error rates, an unusual access pattern, or a potential security event such as unauthorized access or a malicious attack.

Effective alerting requires clear signal-to-noise discipline:

Alert on symptoms, not causes: Trigger on user-visible impact (elevated error rate, high latency) rather than individual low-level metrics.
Set meaningful thresholds: Base thresholds on historical baselines, not arbitrary values.
Route alerts to the right audience: Send infrastructure alerts to platform engineers and application alerts to development teams.
Suppress duplicates: Implement alert grouping to prevent notification storms during widespread incidents.

Security monitoring is a specialized case of alerting. Collect security logs to detect vulnerabilities and attack patterns, and configure real-time alerts to notify the appropriate personnel so they can respond quickly.

---

Observability is not a one-time design decision. As your system evolves — new services, changing traffic patterns, updated infrastructure — revisit your instrumentation, refine alert thresholds, and add dashboards for newly critical components. Treat production observability data as a feedback loop: use it to inform architectural improvements, capacity planning, and reliability investments.