Overview of node pools

更新时间:
复制 MD 格式

A node pool is a logical group of nodes that share the same properties. Node pools allow for unified node management and Operations and Maintenance (O&M), such as node upgrades and elastic scaling. ACK also provides various automated O&M capabilities for node pools, such as automatic remediation for OS CVE vulnerabilities and automatic recovery of failed nodes, to help reduce O&M costs.

Introduction to node pools

A node pool is a configuration template. Nodes that are scaled out in a node pool adopt its configuration. You can create multiple node pools with different configurations and types in a cluster. The configuration of a node pool includes node properties, such as instance types, billing methods, zones (vSwitches), operating system images, CPU architectures, labels, and taints. You can specify these properties when you create a node pool or edit them after the node pool is created.

Older clusters that were created before the node pool feature was introduced may contain unmanaged worker nodes. We recommend that you add these nodes to a node pool for easier management. For more information, see Migrate unmanaged nodes to a node pool.

You can use a single node pool to reduce management and configuration complexity. You can also use multiple node pools to implement fine-grained resource isolation and manage a hybrid deployment of different node types.

Single node pool

Multiple node pools

Manage compute resources for multiple teams or workloads through a single node pool to simplify operations and maintenance. A single node pool supports the following features.

  • Manage compute resources for multiple teams.

  • Configure multiple instance types, such as regular ECS instances, GPU-accelerated instances, ECS Bare Metal instances, and high-performance computing (HPC)-optimized instances, to meet the needs of different workloads.

  • Distribute nodes across multiple zones to improve high availability.

Instances with different operating systems and CPU architectures (Arm and x86) cannot be mixed.

Create multiple node pools to provide independent compute resources for different workloads or teams. This helps avoid resource contention and potential security risks. This is suitable for the following scenarios.

  • Isolate tenants and provide independent compute resources for different teams. This also simplifies billing management.

  • Isolate machines with different device specifications, such as CPU architecture, GPU, and FPGA, to ensure rational allocation of hardware resources.

  • Enhance security isolation for sensitive applications.

  • Deploy different operating systems.

When you use multiple node pools, you can use scheduling policies to define the priority of different node pools to optimize resource and cost management. For example:

  • Control the provisioning priority of compute resources with different costs, such as spot instances and subscription instances, to reduce overall costs.

  • Allocate different instance types based on workload requirements, such as the required ratio of x86 to Arm architectures.

Node pool features

ACK provides various node management capabilities at the node pool level. If you want to reduce the O&M workload for worker nodes and focus more on application development, you can enable the managed node pool feature to use various automated O&M capabilities.

Basic Features

Feature

Description

References

Create, edit, delete, and view

  • Create node pools in the console and configure basic information, network settings, instance specifications, storage configurations, and the expected number of nodes.

  • Adjust the configurations of existing node pools. For more information about the editable configuration items and important considerations, see the referenced document.

  • Delete node pools when they are no longer needed. The node release behavior depends on whether the expected number of nodes is enabled for the node pool and the billing method of the nodes.

  • View the details of a node pool, including basic configuration information, resource monitoring dashboards, node lists, and scaling activities.

Create and manage a node pool

Manual or automatic scaling

  • Manually adjust the expected number of nodes in a node pool to scale it in or out. This keeps the number of nodes at the desired level and helps save resource costs.

    Certain non-standard removal, modification, and release operations may prevent the node pool from scaling out as expected. For more information, see the referenced document.

  • Configure a node autoscaling solution to automatically scale node resources when the cluster's capacity cannot meet the scheduling requirements of application pods.

Add existing nodes

Use the Add Existing Nodes feature to add a purchased ECS instance to an ACK cluster as a worker node or to add a worker node back to a node pool after it was removed. This feature has some limitations and important considerations. For more information, see the referenced document.

Adding existing nodes

Remove nodes

If you no longer need certain nodes, you can remove them from the cluster or node pool. Follow the standard procedures to avoid unexpected behavior.

Remove nodes

Upgrade the kubelet version

You can automatically upgrade the kubelet and runtime using the automatic cluster upgrade feature.

Upgrade the kubelet and containerd versions of nodes in a node pool.

Upgrade a node pool

Change the operating system

Upgrade the operating system version or change the operating system type. For example, you can switch from an end-of-life (EOL) operating system to ContainerOS or Alibaba Cloud Linux.

Change the operating system of nodes in a node pool

CVE vulnerability remediation

You can enable automated O&M capabilities

Manually scan for CVE vulnerabilities and fix security vulnerabilities in the node's operating system. Some CVE vulnerability fixes require a node restart. For more information about this feature and its considerations, see the referenced document.

Manually fix OS CVE vulnerabilities

Customize kubelet parameters for a node pool

Customize kubelet parameters for nodes at the node pool level to adjust node behavior. For example, you can adjust cluster resource reservations to manage resource allocation.

Customize kubelet configurations for a node pool

Customize OS parameters for a node pool

Customize OS parameters for nodes at the node pool level to tune system performance.

Manage OS parameters for a node pool

Cost insights

Analyze resource usage and cost distribution at the node pool level to optimize costs and improve cluster resource utilization.

Cost insights

Automated O&M capabilities

Enabling automated O&M capabilities for a node pool can reduce the O&M workload for worker nodes. This allows ACK to automatically perform certain O&M operations, such as automatic remediation for operating system (OS) CVE vulnerabilities, automatic kubelet upgrades, and automatic fault recovery for nodes. However, this approach is not recommended if your services are sensitive to changes in underlying nodes and cannot tolerate node restarts or application pod migrations.

Preparations

  • Ensure that the operating system is Alibaba Cloud Linux 3 Container-Optimized Edition, ContainerOS, Alibaba Cloud Linux, Red Hat, or Ubuntu.

    For more information about the operating system images supported by ACK clusters and their limitations, see Operating systems.

  • Before you use the automated O&M capabilities of a node pool, you must complete the following operations on the Node Pool page in the Container Service Management Console. You can modify these configurations at any time.

    For more information about how to create and edit a node pool, see Create and manage a node pool.

    Expand to view the procedure

    • Enable the managed feature for the node pool.

      • New node pool: Select Managed Node Pool.

      • Existing node pool: In the Actions column for the desired node pool, choose More > Enable Managed Feature.

    • Enable the required automated O&M capabilities for the node pool.

      This includes enabling node auto-healing, automatic upgrades of kubelet, runtime, and OS versions, and automatic remediation of OS CVE vulnerabilities.

    • Configure a maintenance window for the cluster.

      The automated O&M capabilities of a node pool require a cluster maintenance window. The node pool performs automated O&M tasks within this window.

      • New node pool: Configure the Cluster Maintenance Window for the Managed Node Pool during creation.

      • Existing node pool: In the node pool list, find the managed node pool, and in the Actions column, choose More > Managed Configurations to configure the maintenance window.

Feature introduction

Feature

Description

Node auto-healing

ACK automatically monitors node status and performs auto-healing tasks when a node becomes abnormal. This fixes issues with the system, K8s components, and node instances. For more information, see Enable node auto-healing.

Automatic OS CVE vulnerability remediation

ACK scans for security vulnerabilities on nodes, schedules, and executes a CVE vulnerability remediation plan based on the cluster O&M window. This improves cluster stability, security, and compliance. For more information about the notes, see Fix OS CVE vulnerabilities for a node pool.

Automatic response to ECS system events

Supports automatic response to ECS system events. The following system event types are currently supported.

  • Instance restart due to system maintenance (SystemMaintenance.Reboot)

    Automatic response process

    1. After receiving the event, ACK sends a notification by text message or internal message. Follow the notifications promptly.

    2. Before performing the operation, ACK makes arrangements based on the scheduled execution time of ECS:

      • If the node pool has an available maintenance window before the scheduled execution time, ACK will execute the automatic response process within the maintenance window.

      • Otherwise, ACK will execute the process one hour before the scheduled execution time of ECS.

    3. After the process starts, ACK performs a node drain on the affected ECS instance. It attempts to migrate the pods on the node to other available nodes, and then restarts the ECS instance.

    Node draining rules

    When draining a node, ACK first evicts all pods on the node that can be safely drained. For pods that cannot be safely drained, an error message is returned, and the automatic response process is terminated.

    Pods in the following situations are considered unsafe to drain:

    • The pod has the label goatscaler.io/safe-to-evict=false.

    • The pod is not managed by a controller (that is, a standalone pod without an OwnerReference).

    • The pod uses a volume of the emptyDir type.

    Because the drain operation evicts pods on the node, to prevent node maintenance from affecting the overall availability of your services, we strongly recommend that you:

    • Ensure that the service application backend is deployed with multiple replicas on different nodes.

    • Configure a PodDisruptionBudget (PDB) for important applications to avoid affecting the overall availability of the service after pods on the node are evicted.

Starting from January 31, 2026, the configuration entries for automatic upgrades of kubelet and container runtimes in managed node pools will be removed. You can configure automatic cluster upgrades to automatically upgrade node pools. For more information, see Product Change | Announcement on changes to security vulnerability remediation and automatic upgrades for managed node pools.

Node pool lifecycle

The lifecycle of an ACK cluster node pool involves multiple stages and states, from creation and deployment, to running and maintenance (including scaling, updating, and node removal), and finally to deletion. The following section describes the different states and their transitions.

image

Node pool state

Description

Initializing (initial)

The node pool is being created.

Active (active)

The node pool is successfully created and is running.

Failed (failed)

The node pool failed to be created.

Scaling (scaling)

The node pool is scaling out or adding nodes.

Updating (updating)

The node pool configuration is being updated.

Removing nodes (removing_nodes)

Nodes are being removed from the node pool.

Upgrading (upgrading)

The node pool is being upgraded.

Repairing (repairing)

The node pool is being repaired, for example, repairing nodes or fixing CVE vulnerabilities in the node pool.

Deleting (deleting)

The node pool is being deleted.

Deleted (deleted, this state is not visible to you)

The node pool is successfully deleted.

Deletion failed (deleted_failed)

The node pool failed to be deleted. Try to delete it again.

If the deletion still fails, submit a ticket.

Node pool billing

The use of node pools and their automated O&M capabilities is free of charge. However, you are charged for the cloud resources within the node pool, such as ECS instances, by the corresponding cloud services.

Glossary

Before you use node pools, we recommend that you familiarize yourself with the following concepts and terms.

  • Scaling group: When you scale a node pool in or out, ACK performs scale-out and node removal operations using the Auto Scaling (ESS) service. Each node pool has a one-to-one relationship with an Auto Scaling scaling group. A scaling group is a collection of one or more ECS instances (worker nodes).

  • Scaling configuration: A node pool uses a scaling configuration to manage node configurations. A scaling configuration is a template that is used to create ECS instances during elastic scaling. When a scaling activity is triggered, Auto Scaling uses the specified scaling configuration to automatically create ECS instances.

  • Scaling activity: Every scale-in, scale-out, node addition, or node removal in a node pool triggers a scaling activity. After a scaling activity is triggered, all scaling actions are automatically performed by the system and the relevant records are saved. You can view the scaling activity history of a node pool.

  • Replace the system disk: Some operations in a node pool, such as automatically adding existing nodes and changing the container runtime, initialize the node by replacing its system disk. The instance properties of the node, such as node name, instance ID, and IP address, do not change, but the data on the node's system disk is deleted. Data disks that are attached to the node are not affected.

    When ACK replaces a system disk, it drains the node. This process evicts the pods on the node to other available nodes and adheres to the Pod Disruption Budget (PDB). To ensure high service availability, we recommend that you use a multi-replica deployment to distribute workloads across multiple nodes and configure PDBs for critical services. This helps control the number of pods that can be disrupted at the same time.

  • In-place upgrade: An upgrade method that serves as an alternative to replacing the system disk. This method directly updates and replaces the required components on the node. An in-place upgrade does not replace the system disk or reinitialize the node, and the data on the node is not affected.

References