Get container configuration recommendations with resource profiling

更新时间:
复制 MD 格式

Container Compute Service (ACS) provides a resource profiling feature for Kubernetes-native workloads. This feature analyzes historical resource usage data to recommend container resource configurations, greatly simplifying setting resource requests and limits. This topic explains how to use the resource profiling feature in an ACK cluster with the CLI.

Prerequisites and notes

  • The ack-koordinator component must be installed. For more information, see ack-koordinator (FKA ack-slo-manager).

  • To ensure accurate resource profiling results, wait for at least one day after you enable resource profiling for a workload to allow sufficient data to accumulate.

Billing

ack-koordinator is free to install and use. However, additional charges may apply in the following scenarios:

  • After installation, ack-koordinator requests two ACS general-purpose pods. You can configure the resource requests for each module when you install the component.

  • By default, ack-koordinator exposes monitoring metrics for features such as resource profiling in Prometheus format. If you enable the Enables Prometheus metrics for ACK-Koordinator option during configuration and use Managed Service for Prometheus, these metrics are considered custom metrics and will incur charges. The fees depend on factors such as your cluster size and the number of applications. Before enabling this feature, review the Billing documentation for Managed Service for Prometheus to understand the free quota and billing policy for custom metrics. Use Query the amount of observable data and bills to monitor and manage your resource usage.

Requirements

Component

Required version

metrics-server

≥ v0.3.9.7

ack-koordinator

≥ v1.5.0-ack1.14

Resource profiling

Kubernetes provides resource requests (Requests) to describe the resources required by containers. When a container specifies resources.requests, the scheduler matches the request with a node's allocatable resources (Allocatable) to decide which node to schedule the Pod on. Container resource requests are typically configured based on human experience. Administrators set these values based on the container's historical utilization and application stress test performance, and continuously adjust them based on feedback from the production environment. However, this manual approach to configuring resource specifications has the following limitations:

  • To ensure application stability, administrators often over-provision resources to handle fluctuating loads. Consequently, the container's resource request is set much higher than its actual utilization, leading to low cluster utilization and significant resource waste.

  • When cluster allocation rates are high, administrators may reduce resource requests to improve cluster utilization. This action increases the deployment density but can compromise cluster stability during traffic spikes.

To address these issues, ack-koordinator provides a resource profiling feature that recommends container resource specifications and simplifies their configuration. ACK lets you access resource profiling through the CLI and manage application resource profiles directly with CRDs.For more information about resource profiling, see Yunqi video course - ACK resource profiling.

Use resource profiling in the console

Step 1: Enable resource profiling

  1. Log on to the ACS console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of the target cluster. In the left navigation pane, choose Cost Suite > Cost Optimization.

  3. On the Cost Optimization page, click the Resource Profiling tab. Then, in the Resource Profiling section, follow the on-screen instructions to enable the feature.

    • Install or update the component: Follow the on-screen instructions to install or update the ack-koordinator component. You must install the ack-koordinator component the first time you use resource profiling.

    • Profiling configuration: If this is your first time using resource profiling, select Default Settings to define the scope after the component is installed or updated. You can modify the settings later by clicking Profiling Configuration in the console.

  4. Click Enable Resource Profiling to open the Resource Profiling page.

Step 2: Configure resource profiling

  1. On the Cost Optimization page, click the Resource Profiling tab, and then click Profiling Configuration.

    You can choose between Global Configuration and Custom Configuration modes. The Global Configuration mode is the default recommendation during component installation. You can modify the mode and parameters here. After you complete the configuration, click OK to apply the changes.

    Global configuration (recommended)

    The Global Configuration mode enables resource profiling for all workloads but excludes the arms-prom and kube-system namespaces by default. In the Profiling Configuration dialog box, click the Global Configuration or Custom Configuration tab and set the parameters.

    Parameter

    Description

    Values

    Excluded Namespace

    Specifies the namespaces where resource profiling is disabled. These are typically namespaces for system components. The final scope is the intersection of the enabled namespaces and workload types.

    You can select one or more existing namespaces in the cluster. By default, kube-system and arms-prom are selected.

    Workload Type

    Specifies the workload types to profile. The final scope is the intersection of the enabled namespaces and workload types.

    Kubernetes-native workloads are supported, including Deployment, StatefulSet, and DaemonSet. You can select multiple types.

    CPU/memory resource buffer

    The safety buffer used to generate resource profile recommendations. For more information, see the description below.

    The value must be a non-negative number. Three common buffer percentages are provided: 70%, 50%, and 30%.

    Custom configuration

    The Custom Configuration mode enables resource profiling only for workloads in specific namespaces. If your cluster is large (for example, with more than 1,000 nodes) or you want to enable resource profiling for only a subset of workloads, you can use this mode to specify the scope as needed. In the Profiling Configuration dialog box, click the Custom Configuration tab, configure the following parameters, and then click OK.

    Parameter

    Description

    Values

    Enabled namespaces

    The namespaces for which to enable resource profiling. The final scope is the intersection of the enabled namespaces and workload types.

    You can select one or more existing namespaces in the cluster.

    Workload type

    The types of workloads for which to enable resource profiling. The final scope is the intersection of the enabled namespaces and workload types.

    The supported Kubernetes-native workloads are Deployment, StatefulSet, and DaemonSet. You can select multiple types.

    CPU/memory resource buffer

    The safety buffer used to generate resource profile recommendations. For more information, see the description below.

    The value must be a non-negative number. Three common buffer percentages are provided: 70%, 50%, and 30%.

    Note

    Resource consumption buffer: When estimating application capacity (for example, QPS), administrators typically do not plan for 100% physical resource utilization. This accounts for both physical resource limitations, such as hyper-threading, and the application's need to reserve resources for peak loads. The console suggests a downgrade when the gap between the profile value and the original resource request exceeds the specified buffer. For information about the algorithm, see the description of profile recommendations in the application profile overview.资源冗余

Step 3: View the profile overview

  1. After you configure the resource profiling policy, you can view the resource profile of each workload on the Resource Profiling page.

    To ensure accurate results, the system must collect at least 24 hours of data. When you use resource profiling for the first time, a message on the page indicates Time remaining for data collection: 24 hours. Because newly created profiles have limited data, we recommend waiting for at least one day to ensure the workload runs stably and that data covering both peak and off-peak traffic is collected.

  2. The following table provides an overview of the profile and describes each column.

    Note

    A hyphen (-) in the following table indicates that the item is not applicable.

    Column

    Description

    Values

    Filterable

    Workload name

    The name of the workload (Name).

    -

    Yes. You can search for a workload by its exact name in the search bar at the top.

    Namespace

    The namespace of the workload (Namespace).

    -

    Yes. By default, the kube-system namespace is excluded from the filter conditions.

    Workload type

    The type of the workload.

    Valid values: Deployment, DaemonSet, and StatefulSet.

    Yes. By default, all types are selected.

    CPU request

    The CPU request for the workload's Pods.

    -

    No

    Memory request

    The memory request for the workload's Pods.

    -

    No

    Profile data status

    The status of the workload's resource profile.

    • Collecting: The resource profile has just been created and has limited data. When you first use this feature, we recommend waiting for at least one day to ensure the workload runs stably and data covering both peak and off-peak traffic is collected.

    • Normal: The resource profile has been generated.

    • Workload Deleted: The corresponding workload has been deleted. The system automatically deletes the profile result after a retention period.

    No

    CPU profile, memory profile

    Right-sizing recommendations for the workload's original resource request. The recommendations are based on the profile value, original resource request, and the configured resource buffer.

    The values include Upgrade, Downgrade, and Maintain. The percentage indicates the deviation magnitude, calculated by the formula: Abs(Profile Value - Original Request) / Original Request.

    Yes. By default, the filter includes Upgrade and Downgrade.

    Creation time

    The time when the profile result was created.

    -

    No

    Change resource configuration

    After evaluating the profile results and recommendations, click Change Resource Configuration to upgrade or downgrade resources. For more information, see Step 5: Change application resource specifications.

    -

    No

    Note

    Alibaba Cloud resource profiling generates a profile value for each container's resource specification. The console suggests actions, such as an upgrade or downgrade, by comparing the profile value (Recommend), the original resource request (Request), and the configured resource buffer (Buffer). If a workload has multiple containers, the suggestion is based on the container with the largest deviation. The calculation logic is as follows.

    • If the profile value (Recommend) is greater than the original resource request (Request), the container is consistently using more resources than requested. This poses a stability risk, and you should increase the resource specification promptly. The console displays an "Upgrade" suggestion.

    • If the profile value (Recommend) is less than the original resource request (Request), the container may have some degree of resource waste, and you can reduce the resource specification. This decision must also consider the configured resource buffer.

      1. Calculate the target resource specification (Target) based on the profile value and the configured buffer: Target = Recommend * (1 + Buffer)

      2. Calculate the deviation (Degree) of the target specification from the original request: Degree = 1 - Request / Target

      3. The system generates a CPU and memory suggestion based on the profile value and the deviation (Degree). If the absolute value of Degree is greater than 0.1, the console suggests a "Downgrade" recommendation.

    • In other cases, the suggested action for the application resource specification is Maintain, which means no immediate adjustment is needed.

Step 4: View profile details

  1. On the Resource Profiling page, click a workload name to go to its Resource Profile Details page.

    The details page includes three sections: basic workload information, resource profile curves for each container, and a panel for changing the application's resource specifications.image As shown in the preceding figure, the following table describes the metrics for these curves, using CPU as an example.

    Curve name

    Description

    cpu limit

    The CPU resource limit of the container.

    cpu request

    The CPU resource request of the container.

    cpu recommend

    The recommended CPU resource value for the container based on its profile.

    cpu usage (average)

    The average CPU usage across all container replicas in the workload.

    cpu usage (max)

    The maximum CPU usage among all container replicas in the workload.

Step 5: Change resource specifications

  1. In the Change Resource Configuration section at the bottom of the Resource Profile Details page, modify the resource specifications for each container based on the generated profile values.

    The following table describes the columns.

    The Total Pod Resources section below the table displays the sum of the target resource requests and limits for all containers. Click Auto-fill to automatically populate the target resource fields based on the profile values.

    Parameter

    Description

    Current resource request

    The container's current resource request (Request).

    Current resource limit

    The container's current resource limit (Limit).

    Profile value

    The profile value generated for the container, which can be used as a reference for the resource request.

    Safety buffer

    The safety buffer configured in the resource profiling policy. This value helps calculate the target resource request. For example, with a 30% buffer, the target request is Profile Value * (1 + 0.3). Example: 4.28 * 1.3 ≈ 5.56.

    Target resource request

    The target value to which you plan to adjust the container's resource request.

    Target resource limit

    The new resource limit for the container. Note: If topology-aware CPU scheduling is enabled for the workload, the CPU resource limit must be an integer.

    Important

    The profile values generated by resource profiling are the actual recommended values calculated by the algorithm. If you click the Apply button to change the resource configuration, Alibaba Cloud normalizes the application's resource specifications based on compute types. For more information, see Resource specifications.

  2. After you complete the configuration, click Apply and then OK. This starts a rolling update of the workload, applies the new resource specifications, and redirects you to the workload details page.

    Important

    When you update the resource specifications, the controller performs a rolling update on the workload and recreates its Pods. Proceed with caution.

Use resource profiling with the CLI

Step 1: Enable resource profiling

  1. Create a file named recommendation-profile.yaml with the following content to enable resource profiling for your workloads.

    You can create a RecommendationProfile custom resource (CR) to enable resource profiling for a workload and get resource configuration recommendations for its containers. A RecommendationProfile CR lets you define the scope by namespace and workload type. The feature applies only to workloads that match both criteria.

    apiVersion: autoscaling.alibabacloud.com/v1alpha1
    kind: RecommendationProfile
    metadata:
      # The name of the object. For non-namespaced objects, do not specify a namespace.
      name: profile-demo
    spec:
      # The workload types for which to enable resource profiling.
      controllerKind:
      - Deployment
      # The namespaces for which to enable resource profiling.
      enabledNamespaces:
      - default

    The following table describes the parameters.

    Parameter

    Type

    Description

    metadata.name

    String

    The name of the object. Since RecommendationProfile objects are non-namespaced, you do not need to specify a namespace.

    spec.controllerKind

    String

    The workload types for which to enable resource profiling. Valid values: Deployment, StatefulSet, and DaemonSet.

    spec.enabledNamespaces

    String

    The namespaces for which to enable resource profiling.

  2. Run the following command to apply the profile:

    kubectl apply -f recommendation-profile.yaml
  3. Create a file named cpu-load-gen.yaml with the following content.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cpu-load-gen
      labels:
        app: cpu-load-gen
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: cpu-load-gen-selector
      template:
        metadata:
          labels:
            app: cpu-load-gen-selector
        spec:
          containers:
          - name: cpu-load-gen
            image: registry.cn-zhangjiakou.aliyuncs.com/acs/slo-test-cpu-load-gen:v0.1
            command: ["cpu_load_gen.sh"]
            imagePullPolicy: Always
            resources:
              requests:
                cpu: 8 # The CPU request for this application is 8 cores.
                memory: "1Gi"
              limits:
                cpu: 12
                memory: "2Gi"
  4. Run the following command to deploy the cpu-load-gen application.

    kubectl apply -f cpu-load-gen.yaml
  5. Run the following command to get the resource configuration recommendations.

    kubectl get recommendations -l \
      "alpha.alibabacloud.com/recommendation-workload-apiVersion=apps-v1, \
      alpha.alibabacloud.com/recommendation-workload-kind=Deployment, \
      alpha.alibabacloud.com/recommendation-workload-name=cpu-load-gen" -o yaml
    Note

    For accurate results, wait at least 24 hours for sufficient data collection.

    ack-koordinator generates a resource recommendation for each enabled workload and stores it in a Recommendation CR. The following example shows the Recommendation CR for the cpu-load-gen workload.

    apiVersion: autoscaling.alibabacloud.com/v1alpha1
    kind: Recommendation
    metadata:
      labels:
        alpha.alibabacloud.com/recommendation-workload-apiVersion: apps-v1
        alpha.alibabacloud.com/recommendation-workload-kind: Deployment
        alpha.alibabacloud.com/recommendation-workload-name: cpu-load-gen
      name: f20ac0b3-dc7f-4f47-b3d9-bd91f906****
      namespace: recommender-demo
    spec:
      workloadRef:
        apiVersion: apps/v1
        kind: Deployment
        name: cpu-load-gen
    status:
      recommendResources:
        containerRecommendations:
        - containerName: cpu-load-gen
          target:
            cpu: 4742m
            memory: 262144k
          originalTarget: # This is an intermediate result from the resource profiling algorithm. We recommend that you do not use it directly.
           # ...

    For easy retrieval, the Recommendation CR is created in the same namespace as the workload. It includes labels for the workload's API version, type, and name. The following table describes these labels.

    Label key

    Description

    Example

    alpha.alibabacloud.com/recommendation-workload-apiVersion

    The API version of the workload. Forward slashes (/) are replaced with hyphens (-) to comply with Kubernetes label syntax.

    app-v1 (originally app/v1)

    alpha.alibabacloud.com/recommendation-workload-kind

    The type of the workload, such as Deployment or StatefulSet.

    Deployment

    alpha.alibabacloud.com/recommendation-workload-name

    The name of the workload. In compliance with Kubernetes label specifications, the name cannot exceed 63 characters.

    cpu-load-gen

    The resource profiling results for each container are stored in status.recommendResources.containerRecommendations. The following table describes the fields.

    Parameter

    Description

    Format

    Example

    containerName

    The name of the container.

    string

    cpu-load-gen

    target

    The resource profiling results, including recommendations for CPU and memory.

    map[ResourceName]resource.Quantity

    cpu: 4742m memory: 262144k

    originalTarget

    Contains an intermediate result from the resource profiling algorithm. Do not use this value directly.

    -

    -

    Note

    The minimum recommended CPU is 0.025 core and the minimum recommended memory is 250 MB per container.

    Comparing the declared resources in the cpu-load-gen application with the profiling results shows that the CPU request is overprovisioned. You can reduce the request to save cluster resources.

    Category

    Original configuration

    Recommended configuration

    CPU

    8 cores

    4.742 cores

(Optional) Step 2: View results in Prometheus

The ack-koordinator component exposes resource profiling results as Prometheus metrics. If you have a self-managed Prometheus instance, you can use the following metrics to configure a dashboard.

# The CPU resource profile for a container in the specified workload.
koord_manager_recommender_recommendation_workload_target{exported_namespace="$namespace", workload_name="$workload", container_name="$container", resource="cpu"}
# The memory resource profile for a container in the specified workload.
koord_manager_recommender_recommendation_workload_target{exported_namespace="$namespace", workload_name="$workload", container_name="$container", resource="memory"}

FAQ

Resource profiling algorithm

The resource profiling algorithm uses a multi-dimensional data model. The key principles are as follows:

  • The algorithm continuously collects container resource usage data and calculates aggregate statistics, such as sample peaks, weighted averages, and percentile values for CPU and memory usage.

  • In the final recommendation, the algorithm sets the CPU value to the 95th percentile (P95) and the memory value to the 99th percentile (P99). It also adds a safety buffer to both to ensure workload reliability.

  • The algorithm is optimized for time. It considers only data from the last 14 days and uses a half-life sliding window model for aggregation, which gradually reduces the weight of older data samples.

  • The algorithm also considers container runtime events, such as out-of-memory (OOM) errors, to improve the accuracy of the recommended values.

For more information, see Technologies behind resource profiling and How resource profiling works and suggestions.

Application requirements

Resource profiling works best for online services.

The recommendations from resource profiling prioritize meeting the container's resource demands, ensuring that most usage patterns are covered. However, for offline applications, such as batch processing tasks, the primary goal is often overall throughput, and a certain level of resource contention is acceptable to improve cluster-wide utilization. For these applications, the recommendations may seem conservative. Additionally, critical system components are often deployed in an active/standby configuration with multiple replicas. The standby replica remains idle for long periods, and its low resource usage can skew the algorithm's accuracy. For these scenarios, you should adjust the recommendations based on your specific needs. Keep an eye on future product updates for improvements in these areas.

Using recommended values

Evaluate the recommendations against your business needs. The values from resource profiling summarize your application's current resource demand. You should use these values as a baseline and adjust them as needed. For example, you might need to reserve extra capacity to handle traffic spikes or to support seamless zone-disaster recovery. In these cases, you need to add a resource buffer. Similarly, if your application is resource-sensitive and cannot run smoothly on a host with high load, you may need to increase the resource specifications above the recommended values.

Unexpected recommendations

The values generated by resource profiling are the raw recommendations from the algorithm. When you apply these changes through the console, Container Service for Kubernetes (ACK) normalizes the resource specifications based on different compute classes. For more information, see Resource specifications. This normalization can result in a final specification that differs slightly from the one you entered.

Self-managed Prometheus integration

The Koordinator Manager module of the ack-koordinator component exposes the resource profiling metrics as an HTTP endpoint in Prometheus format. You can run the following command to get the Pod IP address and then access the endpoint to view the metrics.

# Run the following command to get the Pod IP address.
$ kubectl get pod -A -o wide | grep koord-manager
# Sample output. The actual output may vary.
kube-system   ack-koord-manager-5479f85d5f-7xd5k                         1/1     Running            0                  19d   192.168.12.242   cn-beijing.192.168.xx.xxx   <none>           <none>
kube-system   ack-koord-manager-5479f85d5f-ftblj                         1/1     Running            0                  19d   192.168.12.244   cn-beijing.192.168.xx.xxx   <none>           <none>
# Run the following command to view the metrics. Note that Koordinator Manager runs in a dual-replica active/standby mode, and data is available only from the active Pod.
# For the IP address and port, refer to the Deployment configuration of the Koordinator Manager module.
# Before you run the command, make sure the host where you run it can communicate with the cluster's container network.
$ curl -s http://192.168.12.244:9326/all-metrics | grep koord_manager_recommender_recommendation_workload_target
# Sample output. The actual output may vary.
# HELP koord_manager_recommender_recommendation_workload_target Recommendation of workload resource request.
# TYPE koord_manager_recommender_recommendation_workload_target gauge
koord_manager_recommender_recommendation_workload_target{container_name="xxx",namespace="xxx",recommendation_name="xxx",resource="cpu",workload_api_version="apps/v1",workload_kind="Deployment",workload_name="xxx"} 2.406
koord_manager_recommender_recommendation_workload_target{container_name="xxx",namespace="xxx",recommendation_name="xxx",resource="memory",workload_api_version="apps/v1",workload_kind="Deployment",workload_name="xxx"} 3.861631195e+09

After the ack-koordinator component is installed, it automatically creates Service and ServiceMonitor objects for the corresponding Pods.

Prometheus supports multiple data collection methods. If you use a self-managed Prometheus, refer to its official documentation for configuration. Use the preceding process for debugging. After debugging is complete, you can configure a Grafana dashboard as described in Step 2: (Optional) View the results in Prometheus.

Delete results and rules

The profiling results and rules are stored in the Recommendation and RecommendationProfile CustomResourceDefinitions (CRDs), respectively. Run the following commands to delete all results and rules.

# Delete all profiling results.
kubectl delete recommendation -A --all
# Delete all profiling rules.
kubectl delete recommendationprofile -A --all

Authorize RAM users

The authorization system for Container Compute Service (ACS) uses a two-layer model: RAM authorization for infrastructure resources and Role-Based Access Control (RBAC) authorization for the ACK cluster itself. For more information about the authorization system, see Authorization best practices. To grant a RAM user permissions to use the resource profiling feature in a cluster, follow these best practices:

  1. RAM authorization

    Use your primary account to log on to the RAM console and grant the built-in system policy AliyunAccReadOnlyAccess (read-only) to the RAM user. For more information, see Attach system policies.

  2. RBAC authorization

    After you complete RAM authorization, you also need to grant the RAM user the developer role or a higher-level RBAC role in the target cluster. For more information, see Configure RBAC permissions for RAM users or roles.

Note

The built-in developer role and higher-level RBAC roles provide read and write permissions on all Kubernetes resources in the cluster. If you want to grant more fine-grained permissions to a RAM user, you can create or edit a custom ClusterRole as described in Customize RBAC authorization policies. To use the resource profiling feature, add the following rules to the ClusterRole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: recommendation-clusterrole
rules:
- apiGroups:
  - autoscaling.alibabacloud.com
  resources:
  - '*'
  verbs:
  - '*'