Performance testing

更新时间:
复制 MD 格式

Performance testing evaluates system behavior under various load conditions to identify bottlenecks and validate capacity.

Performance testing uses automated tools to evaluate system performance metrics by simulating normal, peak, and abnormal load conditions. You must have a basic understanding of the system's performance and conduct tests within a defined environment. Based on objectives, performance testing is categorized into load testing, stress testing, concurrency testing, configuration testing, and reliability testing.

  • Load testing evaluates how a system's performance metrics change as the load gradually increases.

  • Stress testing identifies a system's bottleneck or the point of unacceptable performance to determine the maximum service level the system can provide.

  • Concurrency testing simulates concurrent user access to identify performance issues, such as deadlocks, that occur when multiple users access the same software, module, or data record simultaneously.

  • Configuration testing involves adjusting a system's software or hardware environment to understand the performance impact of different configurations. This process helps determine the optimal resource allocation principles for the system.

  • Reliability testing applies a specific business load to a system and allows it to run for an extended period to verify its stability.

Scenarios

Performance testing can be used in the following scenarios:

  1. Support for new system launches: Before a new system goes online, you can run performance tests to obtain a clear understanding of its load capacity. This helps ensure a positive user experience after launch, based on the estimated number of users.

  1. Validation for technology upgrades: During system refactoring, you can use comparative performance tests to validate the efficiency of new technologies and guide the refactoring process.

  1. Ensuring stability during peak business hours: Before peak business hours, you can conduct thorough performance tests to ensure the stability of services during peak times, such as sales promotions, and prevent business disruption.

  1. Site capacity planning: Performance testing enables fine-grained capacity planning for your site and guides resource allocation in distributed systems.

  1. Detection of performance bottlenecks: Performance testing helps detect system bottlenecks, enabling targeted optimizations to improve overall performance.

Performance testing is integral to every phase of the system lifecycle, from development and refactoring to launch and optimization, and is crucial for ensuring system stability.

Best practices

Define performance test objectives and baselines

Performance test objectives typically originate from project plans or business requirements. Define the business, system resource, and application metrics before testing.

For business metrics, focus on the following three "golden metrics":

  • Business response time: The key metric is response time (RT), which business departments typically prioritize. The expected value varies by system, but less than 1 second is recommended. For large e-commerce systems, business RT is typically in the tens of milliseconds.

  • Service throughput: The key metric is Transactions Per Second (TPS), a critical measure of system processing capacity. Determine the target TPS by referencing similar systems in your industry and considering specific business needs. For small and medium-sized enterprises, TPS typically ranges from 50 to 1,000. For banks, it ranges from 1,000 to 50,000. For large e-commerce systems, it ranges from 30,000 to 300,000.

  • Request success rate: Measures the proportion of successful business operations under load. The industry standard is generally greater than 99.6%.

After you define these business metrics, establish a performance baseline. You can then compare test results, whether from manual or automated runs, against this baseline to determine whether the test passed.

Determine the performance testing environment

New production environment

If you have just migrated to the cloud or a new data center, you can run a full-trace business stress test before putting the system into production. This method is effective because the test environment is the actual production environment. You can import desensitized data from your actual production environment. The data volume, including transaction data, records, and core business logs, should cover at least six months. You must also ensure data integrity, including data that spans multiple systems. You can then build traffic for core business operations, such as logons, shopping cart activities, and transactions, based on this data. Before going live, you must clean up the test data and re-initialize the foundational data.

Proportional performance environment

A proportional performance environment is a separately partitioned area within the production environment that has proportional capacity and shares the access layer. This approach has a higher cost but offers simplicity, manageable risks, and more accurate capacity planning.

In a proportional performance environment, you must have a shared access layer. This includes services such as CDN, Border Gateway Protocol (BGP), WAF, SLB, and Layer 4 and Layer 7 load balancing. This ensures the critical network access layer is identical to production, which helps in detecting issues. The capacity ratio of backend services should be at least 1/4 of the production environment. A ratio smaller than this can significantly reduce accuracy. It is recommended that you use the same configuration for the database. You can prepare the foundational data using the same method as for a new production environment. You must ensure that the business data volume covers at least six months and that data integrity is maintained.

Production environment

There are two main ways to handle foundational data in a production environment. The first method requires no changes to the database. You can run tests using test accounts in the base tables, which must also have complete and relevant data. After testing, you must purge the generated transaction data. You can do this with a fixed SQL script or through the system interface. The second method is to tag the stress testing traffic, for example, with a custom header. The system then identifies and propagates this tag during business processing, which includes asynchronous messages and middleware. The data is ultimately written to a shadow table or a shadow database. Additionally, you should conduct stress tests in the production environment during off-peak hours to avoid affecting live business operations. For either method, you can deploy a dedicated cluster for stress testing to further minimize the impact on production services.

In cloud environments, the proportional performance approach is recommended. You can quickly scale out during stress testing for sufficient compute resources, then scale in afterward to reduce costs.

Build performance test scenarios

Before running a performance test, create test scripts and prepare the corresponding input parameter files to simulate actual request traces and business load as closely as possible. To ensure scripts reflect real user behavior and cover all interfaces, record user actions from a browser or client. These recordings are automatically converted into stress testing scripts. Both the open-source JMeter tool and Alibaba Cloud PTS provide script recording capabilities to help you build test scripts efficiently.

Run the performance test and analyze the results

After you create the test scripts and input parameter files, run the test with a performance testing tool according to your design. Monitor the request success rate, response time, and service throughput during the test. A significant inflection point in the metrics, such as a sharp drop in success rate or throughput or a sharp increase in response time, indicates a performance bottleneck. Use resource monitoring and application monitoring to locate the bottleneck and perform elastic scaling as needed. After optimization, rerun the test to validate the scaling effect.

Conduct continuous testing to prevent performance degradation

After the performance test passes, run regression tests regularly and compare results against the performance baseline to prevent degradation during continuous iteration. Align the testing frequency with your agile development cycle, typically every one to two weeks. If you detect significant performance degradation, promptly roll back the system version and use performance monitoring to investigate the bottleneck.