Resource usage optimization-Well-Architected Framework(WAF)-阿里云帮助中心

Optimizing resource usage is one of the most effective ways to reduce cloud costs. Poor utilization often stems from limited cloud experience or technical debt — for example, enterprises new to cloud-native architectures may lack cost visibility and control, while misuse of cloud-native technologies can drive up spending. As businesses grow, resource scale and costs tend to increase annually. Efficient cluster and resource management improves overall utilization and lowers costs.

Typical resource optimization scenarios

Infrastructure cloud-native transformation

Before a cloud-native transformation, traditional IT architectures are migrated to the cloud or refactored for a cloud-native architecture. Because this stage involves major infrastructure changes, it is a critical time to plan and manage costs. Focus on the following items.

Use stress testing tools and the ACK cost insight feature to evaluate the capacity of your business systems.
Select appropriate instance types for infrastructure resources. For more information, see Instance Type Selection Guide.
Use discounts to save on infrastructure costs. For more information, see Savings Plans.

Application cloud-native transformation

During a cloud-native transformation, applications can use Kubernetes elastic scaling and co-location mechanisms to achieve high availability and stability. When traffic fluctuates, containers scale based on standard deployment units, enabling peak-load shifting through intelligent, demand-driven resource allocation.

Stable cloud-native business

After the cloud-native transformation, create cost governance policies that adapt to dynamic business changes. Common scenarios include:

Your business has clear cyclical fluctuations, such as peak traffic from 9:00 AM to 5:00 PM. Use the cost insights feature to monitor these patterns and adopt scaling capabilities to optimize costs. For more information, see Cost insights feature description.
Frequent turnover between new and legacy services is common in emerging business sectors, making it difficult to accurately size resources early on. Use the cost insights feature to monitor resource costs and the resource profiling feature to choose appropriate specifications. For more information, see Cost Insights and Resource Profiling.

Assessing the current state of your resources helps you understand usage patterns. You can then design a resource usage architecture and optimize utilization to further reduce costs.

Assess the current state of resources

As cloud adoption grows, enterprises often find it challenging to manage resources scattered across different services. Periodically review, adjust, and update your cloud resources using visualization tools. This helps identify and eliminate waste from orphaned resources, idle or underutilized resources, external service resources not bound to public IP addresses or gateways, resources without attached disks, and databases without multi-zone deployments.

Alibaba Cloud Cloud Architect Design Tools (CADT) is a free tool for visualizing and managing cloud resources. It takes a resource-centric approach that reduces the complexity and time required for architecture management. CADT provides pre-built architecture templates, automatically discovers existing resources to generate visual diagrams, and supports drag-and-drop architecture design. It also automates cloud service configuration and provides full lifecycle management for resources, from creation to deletion.

Select suitable cloud products and resource specifications

Select compute instances and storage classes that match your application and resource requirements. Use the latest generation of instances and technologies for new computing power scenarios.

Compute instance selection

When you use ECS with self-built or open source deployment modes, select instances tailored to your workload type. For example, choose high-memory instances for general-purpose computing and caching, and heterogeneous computing instances for big data and AI deep learning. Alibaba Cloud recommends specific instance families for each scenario in the ECS Selection Best Practices. This lets you leverage fine-grained features such as CPU, memory, and IOPS for an optimal price-performance ratio.

Focus on the latest generation of instances

The latest generation of instances can deliver the same performance as older generations with fewer instances or lower specifications, which reduces overall costs. Alibaba Cloud continuously upgrades its underlying infrastructure. For the same specifications, the latest generation of elastic computing instances typically offers higher clock speeds, greater network throughput, and an improved user experience. These performance advantages translate into cost savings and better value.

Storage selection

Evaluate your business and technical requirements to select the right storage solution. Elastic Block Storage is a high-performance, low-latency block device for ECS that supports random read and write operations and meets most general-purpose storage needs. You can format an EBS device and create a file system on it, just like a physical hard disk. File Storage NAS is a distributed file system that provides shared access, elastic scalability, high reliability, and high performance across thousands of compute nodes, such as ECS instances and ACK clusters. You can migrate business systems to the cloud without modifying your applications. When selecting a storage solution, also consider business metrics such as user count, total data volume, compression ratio, daily data growth, read/write patterns (read-intensive or write-intensive), data requirements (transactional or analytical), supported data engines (relational, non-relational, key-value, row, column, graph, or document), and concurrency requirements during peak and off-peak hours.

Database selection

Different business scenarios require different database products and specifications. Relational databases are ideal for Online Transaction Processing (OLTP) services, where row data is updated through add, delete, and modify operations that require high real-time performance and stability. Non-relational databases address performance and cost challenges when data volumes grow into the tens of billions of records. ApsaraDB for MongoDB uses a No-Schema approach, making it ideal for startups. You can store structured data in a relational database, flexible-schema data in MongoDB, and hot data in ApsaraDB for Redis or ApsaraDB for Memcache. This hybrid approach enables efficient data access and reduces storage costs. When selecting a database, balance consistency, availability, and partition tolerance — it is impossible to satisfy all three simultaneously. Also evaluate performance, elastic scaling, ease of maintenance, data security, and disaster recovery to achieve the best cost-performance ratio.

Design and use resource architecture properly

Shut down unused resources

Automatically shut down virtual machines that are not needed during non-working hours, and delete servers after temporary testing tasks are complete. To identify idle resources, review resource performance data from Alibaba Cloud Monitor for the past 30 days. A server is considered idle if its peak CPU utilization is less than 1%, disk I/O is less than 10, and network utilization is below 1%. Memory is not included in these metrics because it is an occupied resource.

Optimize snapshot costs

Snapshots are a cost-effective solution for data backup and disaster recovery, but costs increase with the number and size of snapshots. Set snapshot policies based on actual business needs — for example, create snapshots daily for core applications and weekly for non-core applications. Regularly delete outdated snapshots and avoid storing application data on system disks.

Optimize storage resource usage

Different storage products and creation methods support different billing methods. Regularly evaluate your storage resources, configure appropriate lifecycles, and clean up unused content. For big data scenarios, use storage capacity plans to reduce costs. You can also optimize your storage architecture — for example, when you use Simple Log Service (SLS), optimize both the storage structure and content:

Optimize the storage structure

If your application writes 100 GB of logs daily, storing them all for 30 days with a full-text index can be costly. Suppose operation logs and error logs account for 20% of the total and need 30-day retention, while the remaining logs only need 7-day retention. The following data transformation plan can save nearly 25% of the cost:

Create a source Logstore to store data for 3 days without creating an index.
Create a destination Logstore 1 to store operation logs and error logs for 30 days with an index.
Create a destination Logstore 2 to store general logs for 7 days with an index.

Optimize the storage content

If you only need certain fields from the raw logs, use data transformation to store relevant fields for 30 days with an index and redundant fields for only 3 days:

Create a source Logstore to store data for 3 days without creating an index.
Create a destination Logstore to store operation logs and error logs for 30 days with an index.

Assuming that each log is about 60% of its original size after transformation, you can save about 30% of the cost compared to the cost before transformation.

Plan your network properly

Use internal networks for communication between applications whenever possible. For traffic between different accounts or VPCs, plan your network products for cross-region and cross-country connections. Re-evaluate your public network egress design and use services such as NAT Gateway for unified ingress and egress management. Monitor network traffic usage and costs in real time to prevent cost spikes from unexpected large-scale data transfers.

Analyze and optimize database services

Analyze your database instances across multiple dimensions and monitor metrics such as peak CPU utilization, disk space, memory usage, connections, QPS, and IOPS. Evaluate instance status based on actual performance and make dynamic adjustments to achieve optimal load utilization.Database Autonomy Service (DAS) is a cloud service based on machine learning and expert experience that provides self-perception, self-healing, self-optimization, self-O&M, and self-protection for databases. It eliminates the complexity of manual database management and helps ensure stability, security, and efficiency. DAS also performs automatic SQL diagnosis, optimization, and index creation to keep your database system running in an optimal state.

Introduce elasticity for application workloads

Elastic services for compute resources reduce waste during off-peak hours and lower O&M costs.

ECS elasticity

Elastic compute scaling automatically adjusts compute resources based on business demands and policies — adding ECS instances when demand increases and removing them when demand decreases. Auto Scaling (ESS) is a free service, though you are charged for automatically created ECS instances at standard pay-as-you-go rates. With Auto Scaling, you do not need to manually adjust compute resources, provision capacity in advance, or worry about releasing redundant resources.

Use containerization to improve resource utilization

Containers isolate processes from each other and from the host OS, with each container having its own file system, resources, and child processes. Unlike virtual machines, containers share the underlying OS without the overhead of a hypervisor, resulting in better performance and lower system load. You can precisely allocate CPU and memory to each application, ensuring they do not interfere with each other. Containers significantly reduce the number of virtual machines required by eliminating duplicated operating systems, which directly lowers compute overhead and costs. Using Alibaba Cloud Elastic Container Instance, you can significantly increase compute resource utilization. For more information about container elastic policies, see the following sections.

Optimize resource utilization

Improving resource utilization means maximizing computing power with the fewest resources, while considering business layout, disaster recovery, machine failure rates, and reserved buffer space. Key focus areas include: clarifying utilization statistics standards, optimizing cluster architecture deployment, driving resource operations based on allocation and utilization rates, unifying resource pools and node management, building comprehensive resource monitoring, and implementing fine-grained resource scheduling with isolation controls. Improving utilization in production environments — which account for the largest share of costs — maximizes overall cost-benefit while maintaining service quality.

Clarify resource utilization statistics with Cloud Monitor

Use Cloud Monitor to monitor resource metrics, check service availability, and configure alarms. This gives you a comprehensive overview of resource usage and business operations. You can then replace faulty resources, upgrade high-load resources to ensure continuity, and downgrade underutilized resources to reduce waste.

Use cloud-native elastic scaling for unified resource and node management

Auto Scaling for Container Service automatically and cost-effectively adjusts elastic computing resources based on business needs and policies. It is widely applicable for online service elasticity, large-scale computing training, GPU-based deep learning training and inference, and scheduled periodic workload changes.

Auto Scaling operates at two layers. Scheduling layer elasticity modifies the workload's scheduling capacity — for example, Horizontal Pod Autoscaler (HPA) adjusts replica counts to scale workloads at the scheduling layer. Resource layer elasticity supplements scheduling capacity by adding container resources when the cluster's capacity planning is insufficient.

The elastic components and capabilities of the two layers can be used separately or together. They are decoupled from each other through the capacity status at the scheduling layer.

Use cloud-native resource scheduling to properly schedule resources based on application load

For precise, real-time instance scaling and placement, base resource scheduling on application payload characteristics with elastic scheduling policies. The platform scales out resources when payload increases and revokes them when payload decreases, accelerating resource flow among tenants and improving utilization. The fine-grained scheduling of ACK provides real-time, proactive, and intelligent scheduling that creates a closed loop of metric collection, online decision-making, batch analytics, and decision optimization.

Manage storage lifecycle

As applications and business systems run over time, enterprises accumulate large volumes of data from increasingly diverse sources.

Recently written data is typically accessed most frequently and is considered "hot." Over time, access frequency decreases — data accessed only a few times a week becomes "warm." Within 3 to 6 months, data accessed only a few times a month or less becomes "cold." Data accessed only once or twice a year is considered "frozen."

Growing volumes of cold and frozen data increase storage pressure and costs. When designing a storage lifecycle management architecture, also leave room for performance optimization of frequently accessed hot data.

Manage the full data lifecycle by storing data on different storage media at different cost tiers. In the cloud, storage classes are distinguished by data access frequency, covering all scenarios from hot to cold. This achieves optimal storage costs throughout the data lifecycle while meeting daily business needs.

Alibaba Cloud's various cloud-native storage products provide hot and cold storage technologies:

Object Storage Service (OSS) offers multiple storage classes, including Standard, Infrequent Access, Archive, Cold Archive, and Deep Cold Archive, to cover the full spectrum of data storage scenarios from hot to cold.

PolarDB provides a cold data archiving feature. If you have data in your cluster's databases and tables that is rarely updated, inserted, or modified and has a very low read frequency, you can use the cold data archiving feature of PolarDB for MySQL to archive this data to low-cost OSS and reduce your data storage costs. By intelligently analyzing statistics, AnalyticsDB for MySQL provides three types of optimization suggestions to help you reduce cluster usage costs and improve cluster efficiency: hot and cold data optimization, index optimization, and distribution key optimization.

Resource governance in cloud-native scenarios

Cloud-native technology provides resource sharing, isolation, and elasticity to improve usage efficiency and reduce costs. However, many enterprises actually experience cost increases when adopting containerized elastic computing resources.

Two factors typically drive this. First, technical debt from improper use of cloud-native technologies increases costs. Second, enterprises new to cloud-native architectures often lack effective cost insight and control measures, making it difficult to identify the causes of cost increases. Properly managing resources at the container layer is a real challenge for every enterprise.

By following the Cost administration practices for cloud-native scenarios, you can integrate cost optimization capabilities into the container management platform. This process involves performing aggregation and analysis from two dimensions: physical and logical. The physical dimension includes cluster nodes, node pools, and resource groups, while the logical dimension includes pods, application workloads, and namespaces. By correlating the costs between these dimensions and establishing a complete resource cost profile, you can perform administration tasks more accurately and effectively.

Use automation to achieve proper resource usage

Use automatic elasticity and automation tools to manage online resources, reducing manual O&M costs and errors from manual operations.

Auto Scaling (ESS): Continuously maintains instance clusters across billing methods, zones, and instance types. This service is suitable for scenarios in which business workloads fluctuate.
Auto provisioning: Deploy instance clusters across multiple billing methods, zones, and instance types with a single click. This feature is suitable for scenarios where stable computing power must be provisioned quickly and spot instances are used to reduce costs.
O&M Orchestration: Defines a set of O&M operations as a template to efficiently execute O&M tasks. This feature is suitable for scenarios that require event-driven, scheduled, batch, or cross-region O&M.
Resource Orchestration Service: Allows you to deploy and maintain stacks that contain multiple cloud resources and dependencies with one click. It is suitable for scenarios in which delivery of an integrated system or environment clone is required.