Common issues with nodes and node pools-Container Service for Kubernetes(ACK)-阿里云帮助中心

This topic answers frequently asked questions about nodes and node pools. Learn how to change the maximum number of pods on a node, update the OS image for a node pool, and troubleshoot node timeout issues.

Index

To diagnose and troubleshoot node issues, see Troubleshoot abnormal nodes.

Category	FAQ
Node pool creation	How do I create a custom image from an existing ECS instance and use it to create nodes? How do I use a spot instance in a node pool? Can I configure multiple ECS instance types in a single node pool? How do I calculate the maximum number of pods per node? How do I adjust the pod capacity when a node reaches its pod limit? How do I modify node configurations? Can I disable the Expected Nodes feature? What is the difference between a node pool with and without Expected Nodes enabled? How do I add free nodes (nodes that are not managed by any node pool) to a node pool?
Node pool management	Node pool OS image: How do I change the OS image of a node pool? Node resource reservation: How do I view total CPU and memory on a node? Adding an existing node: After an ECS instance is added to a cluster, will upgrading or downgrading the instance affect cluster services? What should I do if adding an existing node times out? Can I add existing nodes of multiple ECS instance types to an ACK cluster? How do I move a node across ACK clusters? Does the expected number of nodes for a node pool automatically change after I add an existing node? Removing a node: What should I do if removing a node fails? Customizing kubelet configurations for a node pool: Will my custom configurations be deprecated? How do I use a configuration file to manage kubelet? How do I modify a kubelet parameter that is not on the supported list? Customizing OS parameters for a node pool: Configuration using a configuration file Node auto-repair: What do I do if node auto-repair fails? How do I release a specific ECS instance? How do I change the hostname of a worker node in an ACK cluster? How do I manually upgrade the kernel on GPU nodes in an existing cluster? How do I fix container startup issues on GPU nodes? If a cluster with nodes across multiple zones fails, how does the cluster determine the node eviction policy? What is the kubelet directory path in an ACK cluster? Can I customize it? Can I mount a data disk to a custom directory in an ACK node pool? How do I modify the maximum number of file descriptors? What do I do if I receive the "UNPROTECTED PRIVATE KEY FILE!" error when I log on to a ContainerOS administrative container? Why does the console display the source of a node's node pool as Other Nodes? How do I configure a network ACL for the vSwitches to which cluster nodes are connected?
Node pool upgrades	Can a node pool upgrade be rolled back? Are my services affected during an upgrade? How long does each batch upgrade take? Is node data lost during a node upgrade? Does the IP address of a node change after its system disk is replaced? How do I upgrade cluster nodes that do not belong to any node pool? How do I restore data from a snapshot? How do I upgrade the container runtime for worker nodes that do not belong to any node pool?
Adjusting node pod capacity	In Terway mode, how do I view the maximum number of pods that use the container network on a node? How do I view the maximum number of pods supported by an existing node? Why is the pod count on a node near its limit immediately after cluster creation? In Terway mode, can I manually modify the number of ENIs or the total pod quota to increase the pod limit per node? Why do nodes with the same CPU and memory specifications support different maximum numbers of pods?
Migrating the node runtime from Docker to containerd	How long does each batch upgrade take? Are my services affected during an upgrade? Can I roll back the migration from Docker to containerd? Is node data lost during the migration from Docker to containerd? Does the IP address of a node change after its system disk is replaced? How compatible is containerd with Docker? What do I do if I previously built images on cluster nodes using Docker and now the runtime is upgraded to containerd? What do I do if the Docker directory is not cleaned up and occupies disk space after the node runtime switches from Docker to containerd?
Virtual nodes	How do I use virtual nodes to implement high availability for a service deployed across zones? Do virtual nodes support GPU resources? How do I prioritize ECS instances over elastic container instances for pod scheduling and prioritize elastic container instances over ECS instances for pod scale-in? What do I do if certificate verification fails when a virtual node pulls images from a self-managed image repository over HTTPS? After I create an Elastic Container Instance-based pod by specifying the number of vCPUs and memory size, is the pod billed based on the resource specification or the actual resource usage?

Using spot instances in a node pool

To use spot instances, create a new node pool or use the spot-instance-advisor command. For more information, see Best practices for spot instance node pools.

To maintain consistency within a node pool, you cannot convert a spot instance node pool to a pay-as-you-go or subscription node pool, or vice versa.

Multiple ECS instance types per node pool

Yes, you can. We recommend configuring your node pool with multiple ECS instance types to prevent node scale-out failures due to instance type unavailability or inventory shortages. To do this, configure multiple vSwitches across multiple availability zones, select multiple ECS instance types, or specify instance types based on vCPU and memory. After the node pool is created, you can add instance types based on the scalability level recommendations in the console or view the scalability level of a node pool.

For a list of supported instance types and node configuration recommendations, see ECS instance type configuration recommendations.

Maximum number of pods per node

The calculation for the maximum number of Pods per node depends on the cluster's network plugin. For more information, see Maximum number of Pods per node.

Terway: The maximum number of Pods per node is the sum of Pods on the container network and on the host network.
Flannel: The limit is the Number of Pods per Node value specified during cluster creation.

You can view the maximum number of Pods per node, also known as the Pod Quota, on the NodesNodes page of the console.

You cannot change the maximum number of Pods per node after creating a cluster. If a node reaches this limit, scale out your nodes to increase Pod capacity. For more information, see Adjust available Pods on a node.

Adjust node pod capacity

The network plug-in determines the maximum number of pods that a worker node can support, a limit that typically cannot be changed. In Terway mode, the limit depends on the number of elastic network interfaces (ENIs) that the ECS instance provides. In Flannel mode, the limit is specified when you create a cluster and cannot be modified later. If you reach the pod limit, we recommend adding nodes to your node pool to increase the total number of available pods.

For more information, see Increase the maximum number of pods in a cluster.

Modify node configuration

To ensure service stability, certain parameters—specifically those related to availability and networking—are immutable after a node pool is created. For example, you cannot change the container runtime or the VPC to which a node belongs.
For mutable parameters, changes to the node pool configuration apply only to new nodes. Existing nodes are not affected unless you use specific options such as Update ECS Tags of Existing Nodes or Update Labels and Taints of Existing Nodes.

See Create and manage node pools for details on modifiable parameters and when changes take effect.

Alternatively, to apply a new configuration, you can create a new node pool with the desired configuration. Then, cordon and drain the nodes in the old node pool to migrate your workloads. After the migration is complete, you can release the instances in the old node pool. For instructions, see Cordon and drain nodes.

Can I disable Expected Nodes?

If the Scaling Mode of a node pool is set to Manual, you must configure the Expected Nodes. This feature cannot be disabled.

To remove a specific node, see Remove a node. To add a specific node, see add an existing node. After you remove a node or add an existing node, Expected Nodes automatically updates to the new node count. You do not need to change it manually.

Node pools with and without Expected Nodes

The Expected Nodes parameter defines the intended capacity of a node pool. You can scale a node pool in or out by adjusting this value. However, some legacy node pools may not have this feature enabled.

The following table describes how the system responds to operations for node pools with and without the Expected Nodes feature enabled.

Actions	Expected nodes enabled	Expected nodes disabled (legacy)	Recommendation
Scale in by reducing Expected Nodes in the ACK console or by using OpenAPI.	The system terminates nodes until the actual node count matches the new Expected Nodes value.	If the current node count is greater than the specified value, the system terminates the excess nodes. This action also enables the Expected Nodes feature for the node pool.	None.
Remove a specific node from the ACK console or by using OpenAPI.	The Expected Nodes value decreases by the number of nodes removed. For example, if the Expected Nodes value is 10 and you remove 3 nodes, the value becomes 7.	The specified node is removed from the cluster.	None.
Remove a node by running `kubectl delete node`.	The Expected Nodes value remains unchanged.	No change.	Not recommended.
Manually release an ECS instance from the ECS console or by using OpenAPI.	The system automatically creates a new ECS instance to maintain the Expected Nodes count.	The node pool is unaware of the change. No new ECS instance is created. The deleted node temporarily displays an Unknown status.	Not recommended. This causes data inconsistency between ACK and Auto Scaling (ESS). Use the recommended method to remove nodes. For more information, see Remove a node.
A subscription ECS instance expires.	The system automatically creates a new ECS instance to maintain the Expected Nodes count.	The node pool is unaware of the change. No new ECS instance is created. The deleted node temporarily displays an Unknown status.	Not recommended. This causes data inconsistency between ACK and ESS. Use the recommended method to remove nodes. For more information, see Remove a node.
An ECS instance in an ESS scaling group with health checks enabled fails a health check (for example, because the instance is stopped).	The system automatically creates a new ECS instance to maintain the Expected Nodes count.	The system creates a new ECS instance to replace the failed one.	Not recommended. Do not directly manage scaling groups that are associated with a node pool.
You remove an ECS instance from an ESS scaling group without modifying the expected instance count.	The system automatically creates a new ECS instance to maintain the Expected Nodes count.	No new ECS instance is created.	Not recommended. Do not directly manage scaling groups that are associated with a node pool.

Migrate unmanaged nodes to a node pool

In older ACK clusters created before the node pool feature was released, some worker nodes may not belong to any node pool. To bring these unmanaged nodes under grouped management and automated maintenance, migrate them into a node pool: create a node pool, remove the unmanaged nodes from the cluster, and re-add their ECS instances to the node pool.

To do this, create a new node pool or scale out an existing one, remove the unmanaged nodes from the cluster, and then add them to the target node pool. For more information, see Migrate unmanaged nodes to a node pool.

Replace the OS image of a node pool

You can replace the operating system of a node pool, for example, to migrate from a version that has reached its end-of-life (EOL) to a supported one. Before you begin, consult the OS image release notes for supported operating systems, the latest image versions, and usage limitations.

See Replace the OS of a node pool for detailed instructions and considerations.

Release a specific ECS instance

To release a specific ECS instance, remove the node. This action automatically updates the expected node count. Do not attempt to release a specific instance by changing the expected node count, as this triggers a random scale-in and is not guaranteed to remove the intended instance.

How to fix node addition timeouts?

Check network connectivity between the node and the API Server CLB. First verify that the security group meets the requirements. For security group limitations when adding an existing node, see Limitations. For other network connectivity issues, see Network management FAQ.

Change worker node hostnames

You cannot customize a worker node's hostname after you create the cluster. As a workaround, you can use the node pool's naming rule to change the hostname.

Note

When you create a cluster, you can define the hostname of a worker node in the Custom Node Name parameter. For more information, see Create an ACK managed cluster.

Remove the node. For more information, see Remove a node.
Add the node that you removed back to the node pool. For more information, see Manually add nodes.
The node is then automatically renamed based on the naming rule of the node pool.

Manually upgrade a GPU node kernel

This topic describes how to manually upgrade the kernel and the corresponding NVIDIA driver on a GPU node in an existing cluster.

Note

The current kernel version is lower than 3.10.0-957.21.3.

Upgrading the kernel is a sensitive operation. Confirm your target kernel version and proceed with caution.

This guide focuses on the NVIDIA driver upgrade required after a kernel upgrade. The kernel upgrade process itself is not covered.

Obtain a cluster's kubeconfig and connect with kubectl.

Cordon the GPU node (for example, node cn-beijing.i-2ze19qyi8votgjz*****).

kubectl cordon cn-beijing.i-2ze19qyi8votgjz*****

node/cn-beijing.i-2ze19qyi8votgjz***** cordoned

Drain the GPU node where you want to upgrade the driver.

kubectl drain cn-beijing.i-2ze19qyi8votgjz***** --grace-period=120 --ignore-daemonsets=true

node/cn-beijing.i-2ze19qyi8votgjz***** cordoned
WARNING: Ignoring DaemonSet-managed pods: flexvolume-9scb4, kube-flannel-ds-r2qmh, kube-proxy-worker-l62sf, logtail-ds-f9vbg
pod/nginx-ingress-controller-78d847fb96-***** evicted

Uninstall the current NVIDIA driver.
Note
The driver package uninstalled in this step is version 384.111. If your driver version is not 384.111, you need to download the corresponding driver installer from the official NVIDIA website and replace 384.111 in this step with your actual version.
1. Log in to the GPU node and run nvidia-smi to check the driver version.
```
sudo nvidia-smi -a | grep 'Driver Version'
Driver Version                      : 384.111
```
2. Download the NVIDIA driver installer.
```
cd /tmp/
sudo curl -O https://cn.download.nvidia.cn/tesla/384.111/NVIDIA-Linux-x86_64-384.111.run
```
  Note
  You must use the installer to uninstall the NVIDIA driver.
3. Uninstall the current NVIDIA driver.
```
sudo chmod u+x NVIDIA-Linux-x86_64-384.111.run
sudo sh ./NVIDIA-Linux-x86_64-384.111.run --uninstall -a -s -q
```
Upgrade the kernel.
Follow your operating system's procedures to upgrade the kernel.
Restart the GPU instance.
```
sudo reboot
```
Log in to the GPU node again and install the corresponding kernel devel.
```
sudo yum install -y kernel-devel-$(uname -r)
```

Go to the official NVIDIA website to download and install the required NVIDIA driver. This topic uses version 410.79 as an example.

# Change to the /tmp directory.
cd /tmp/

# Download the NVIDIA driver installer.
sudo curl -O https://cn.download.nvidia.cn/tesla/410.79/NVIDIA-Linux-x86_64-410.79.run

# Add executable permissions to the installer.
sudo chmod u+x NVIDIA-Linux-x86_64-410.79.run

# Run the installer in silent mode.
sudo sh ./NVIDIA-Linux-x86_64-410.79.run -a -s -q

# Warm up the GPU.
sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Check /etc/rc.d/rc.local to confirm whether it contains the following configuration. If not, add it manually.

sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Restart kubelet and Docker.

sudo service kubelet stop
sudo service docker restart
sudo service kubelet start

Uncordon the GPU node to allow pods to be scheduled on it again.

kubectl uncordon cn-beijing.i-2ze19qyi8votgjz*****

node/cn-beijing.i-2ze19qyi8votgjz***** uncordoned

Verify the version of the device plugin pod on the GPU node.

kubectl exec -n kube-system -t nvidia-device-plugin-cn-beijing.i-2ze19qyi8votgjz***** nvidia-smi
Thu Jan 17 00:33:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   27C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note

If you run the docker ps command and find that no containers are started on the GPU node, see Fix GPU node container startup issues.

Fix container startup on GPU nodes

On a GPU node running certain versions of Kubernetes, containers may fail to start after you restart the kubelet and Docker services. The sudo docker ps command returns an empty list.

sudo service kubelet stop
# Redirecting to /bin/systemctl stop kubelet.service
sudo service docker stop
# Redirecting to /bin/systemctl stop docker.service
sudo service docker start
# Redirecting to /bin/systemctl start docker.service
sudo service kubelet start
# Redirecting to /bin/systemctl start kubelet.service

sudo docker ps
# CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

This issue occurs when the Cgroup Driver used by Docker does not match the one expected by the kubelet. To diagnose the issue, check Docker's Cgroup Driver.

sudo docker info | grep -i cgroup
Cgroup Driver: cgroupfs

If the output is cgroupfs, it confirms a mismatch, as the kubelet is configured to use the systemd driver.

To fix this issue, change the Docker Cgroup Driver to systemd.

Back up /etc/docker/daemon.json, and then run the following command to update /etc/docker/daemon.json.

sudo cat >/etc/docker/daemon.json <<-EOF
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "10"
    },
    "oom-score-adjust": -1000,
    "storage-driver": "overlay2",
    "storage-opts":["overlay2.override_kernel_check=true"],
    "live-restore": true
}
EOF

Restart Docker and kubelet to apply the changes.

sudo service kubelet stop
# Redirecting to /bin/systemctl stop kubelet.service
sudo service docker restart
# Redirecting to /bin/systemctl restart docker.service
sudo service kubelet start
# Redirecting to /bin/systemctl start kubelet.service

Verify that the Docker Cgroup Driver is set to systemd.

sudo docker info | grep -i cgroup
Cgroup Driver: systemd

Migrate Pods from a failed node

To migrate application Pods from a failed node, mark the node as unschedulable and then drain it. This process safely evicts the Pods and reschedules them onto healthy nodes.

Log on to the ACK console. On the Nodes page, find the failed node. In the Actions column, choose More > Drain.
Troubleshoot the failed node. For more information, see Troubleshoot node issues.

Node eviction policy during availability zone failures

When a node becomes unhealthy, the node controller initiates an eviction. The default eviction rate is 0.1 nodes per second, controlled by the --node-eviction-rate parameter. This means Pods are evicted from at most one node every 10 seconds.

However, for an ACK cluster with nodes in multiple availability zones, the node controller adjusts this policy based on the health status of each availability zone and the cluster size.

An availability zone can be in one of three health states.

FullDisruption: The availability zone has no healthy nodes and at least one unhealthy node.
PartialDisruption: The availability zone contains at least two unhealthy nodes, and the ratio of unhealthy nodes to total nodes (calculated as (unhealthy nodes / (unhealthy nodes + healthy nodes))) exceeds 0.55.
Normal: The availability zone does not meet the criteria for FullDisruption or PartialDisruption.

Clusters are also classified by size:

Large cluster: A cluster with more than 50 nodes.
Small cluster: A cluster with 50 or fewer nodes.

The node controller determines the eviction rate based on these states:

If all availability zones are in a FullDisruption state, eviction is disabled for the entire cluster.
If at least one availability zone is not in a FullDisruption state, the eviction rate is determined as follows:
- For an availability zone in a FullDisruption state, the eviction rate is set to the default value of 0.1 nodes per second, regardless of cluster size.
- For an availability zone in a PartialDisruption state, the eviction rate depends on the cluster size. In a large cluster, the rate is reduced to 0.01 nodes per second. In a small cluster, the rate is set to 0, which disables eviction.
- For an availability zone in a Normal state, the eviction rate is set to the default value of 0.1 nodes per second, regardless of cluster size.

For more information, see Rate limits on eviction.

Kubelet path customization

No. The kubelet path in an ACK cluster is /var/lib/kubelet and cannot be changed. Do not change this path.

Mount a data disk to a custom directory

This feature is currently in canary release. To enable this feature, submit a ticket. Once enabled, the system automatically formats and mounts any data disk that you add to the node pool to a specified directory. The mount directory has the following restrictions.

Do not mount a data disk to the following critical operating system directories:
- /
- /etc
- /var/run
- /run
- /boot
Do not mount a data disk to the following directories used by the system and container runtimes, or their subdirectories:
- /usr
- /bin
- /sbin
- /lib
- /lib64
- /ostree
- /sysroot
- /proc
- /sys
- /dev
- /var/lib/kubelet
- /var/lib/docker
- /var/lib/containerd
- /var/lib/container
Each data disk must have a unique mount directory.
The mount directory must be an absolute path that starts with /.
The mount directory must not contain carriage return or line feed characters (\r and \n) or end with a backslash (\).

Modify file descriptor limits

The maximum number of file descriptors limits the number of files that can be open simultaneously. Alibaba Cloud Linux and CentOS systems have two levels of file descriptor limits:

System-level: The maximum number of files that all processes on the system can open simultaneously.
User-level: The maximum number of files that a single user's processes can open.

Container environments have an additional file descriptor limit: the maximum number of file descriptors per process within a container.

Note

A node pool upgrade may overwrite changes made manually from the command line. To ensure your settings persist, edit the node pool.

Modify system-level file descriptor limit

For instructions, see Customize OS parameters for a node pool.

Modify per-process file descriptor limit

Log on to the node and check the /etc/security/limits.conf file.
```
cat /etc/security/limits.conf
```
Use the following parameters to configure the maximum number of file descriptors for a single process on the node:
```
...
root soft nofile 65535
root hard nofile 65535
* soft nofile 65535
* hard nofile 65535
```
Run the sed command to modify the maximum number of file descriptors. The recommended value is 65535.
```
sed -i "s/nofile.[0-9]*$/nofile 65535/g" /etc/security/limits.conf
```
Log on to the node again and run the following command to verify your change.
If the output matches your configured value, the change was successful.
```
# ulimit -n
65535
```

Modify container file descriptor limit

Important

Modifying the file descriptor limit for a container requires restarting the Docker or containerd service, which will interrupt running containers. To avoid service interruptions, perform this operation during off-peak hours.

Log on to the node and run the following command to view the configuration file.
- containerd node: cat /etc/systemd/system/containerd.service
- Docker node: cat /etc/systemd/system/docker.service
The following parameters set the file descriptor limit for a single process inside a container:
```
...
LimitNOFILE=1048576
LimitNPROC=1048576
...
```

Run the following commands to modify the parameter values. The recommended value for the file descriptor limit is 1048576.

containerd node:

sed -i "s/LimitNOFILE=[0-9a-zA-Z]*$/LimitNOFILE=1048576/g" /etc/systemd/system/containerd.service;sed -i "s/LimitNPROC=[0-9a-zA-Z]*$/LimitNPROC=1048576/g" /etc/systemd/system/containerd.service && systemctl daemon-reload && systemctl restart containerd

Docker node:

sed -i "s/LimitNOFILE=[0-9a-zA-Z]*$/LimitNOFILE=1048576/g" /etc/systemd/system/docker.service && sed -i "s/LimitNPROC=[0-9a-zA-Z]*$/LimitNPROC=1048576/g" /etc/systemd/system/docker.service && systemctl daemon-reload && systemctl restart docker

Run the following command to check the file descriptor limit for a single process inside the container.

If the output matches your configured value, the change was successful.

containerd node:

# cat /proc/`pidof containerd`/limits | grep files
Max open files            1048576              1048576              files

Docker node:

# cat /proc/`pidof dockerd`/limits | grep files
Max open files            1048576              1048576              files

Upgrade container runtime for unmanaged worker nodes

Legacy clusters created before node pools were introduced may contain unmanaged worker nodes. To upgrade the container runtime for these nodes, you must migrate them to a node pool.

Follow these steps:

Create a node pool: If no suitable node pool exists in the cluster, create one with a configuration that matches the unmanaged nodes.
Remove the node: When you remove a node, the system cordons it (marks it as unschedulable) and then drains its pods to evict them. If the drain fails, the removal process halts. The node is removed from the cluster only if the drain succeeds.
Add an existing node: Add the node to an existing node pool. Alternatively, you can create a node pool with zero nodes and then add the node to it. After the node is added, its container runtime automatically updates to match the one specified in the node pool's configuration.
Note
While the node pool feature itself is free of charge, you are billed for the underlying cloud resources, such as ECS instances, used by the node pool. For more information, see Cloud resource fees.

Node pool displayed as "Other Nodes"

ACK provides standard methods to add compute resources to a cluster through the console, OpenAPI, or CLI. For more information, see Add an existing node. If you add nodes using methods outside of standard ACK workflows, ACK cannot identify their source and assigns them to the Other Nodes group on the Nodes page. ACK cannot manage these nodes through a node pool, so features like lifecycle management, automated O&M, and guaranteed technical support are unavailable.

If you continue to use these nodes, you must ensure their compatibility with cluster add-ons and assume all potential risks. These risks include, but are not limited to, the following:

Version compatibility: During control plane or system component upgrades, the operating system and components on these unmanaged nodes may become incompatible with the new versions, which can cause service disruptions.
Workload scheduling compatibility: The cluster may fail to accurately report the status of these nodes, such as their availability zone and remaining resource capacity. This can lead to incorrect workload scheduling decisions, causing availability issues or performance degradation.
Data plane compatibility: The compatibility of node-side components and the operating system with the cluster's control plane and system components is not validated, posing potential stability risks.
O&M compatibility: Maintenance operations on these nodes through the console or OpenAPI may fail or produce unexpected results because the management channel and execution environment for these nodes are not verified.

Configure network ACLs for node vSwitches

If a node pool's vSwitch has a network ACL that denies traffic from required CIDR blocks, new nodes will fail to join the cluster and remain in a Failed or Offline state.

Follow these steps to allow the required CIDR blocks and re-add nodes:

Configure network ACL rules. In the inbound and outbound rules, allow traffic from the following CIDR blocks:
1. 100.104.0.0/16: The management CIDR block for the ACK control plane.
2. 100.64.0.0/10: The Alibaba Cloud internal service CIDR block.
3. 100.100.100.200/32: The ECS instance metadata service endpoint.
4. The primary and any secondary CIDR blocks of the cluster's VPC, or the CIDR block of the vSwitch containing the nodes.
Remove faulty nodes. Remove any nodes that were in a Failed or Offline state before the new network ACL rules took effect.
Create and manage node pools or expand an existing node pool to add new nodes. A Ready status on the new nodes confirms that the network ACL rules are configured correctly.