Performance Monitoring-Well-Architected Framework(WAF)-阿里云帮助中心

Performance monitoring tracks system and application metrics in real time, helping you identify bottlenecks, optimize resources, and maintain reliability.

Performance degradation can strike at any time due to traffic surges, system changes, or code decay. For example, a spike in visits during a major sales event may cause request timeouts and failed orders, an app update may introduce page lag that drives user complaints, or a long-running system may encounter Out Of Memory (OOM) errors or connection exhaustion.

Performance degradation directly affects user experience. If the time to open a product details page increases from 0.5s to 3s, users are far less likely to continue browsing. Further degradation past timeout thresholds (for example, 5s) can disrupt service availability, cause significant business losses, and damage your reputation. In short, performance degradation can determine the success or failure of a business.

The best practice for addressing performance degradation follows a prevention-first approach. Because degradation inevitably affects user experience or business metrics once it occurs, performance optimization should be built into architectural design, code development, and testing to avoid common issues before they reach production. Equally important is the ability to detect performance risks early, pinpoint bottlenecks quickly, and resolve them promptly when they do arise.

Whether you are preventing or responding to degradation, an accurate, real-time performance monitoring system helps your team identify bottlenecks and their impact so you can take targeted action. The larger and more complex your IT system, the more critical it is to have a comprehensive, easy-to-use monitoring system that enables early intervention and fast root-cause analysis.

Performance monitoring tracks and records software, hardware, or system performance metrics at runtime for analysis and optimization. By collecting and analyzing performance data, you can identify bottlenecks, optimize resource allocation, and improve system reliability and stability. Typical metrics include system resources such as CPU, memory, disk, and network, as well as application-level indicators such as response time, throughput, and concurrency.