Common issues and solutions for cluster creation and management-Container Service for Kubernetes(ACK)-阿里云帮助中心

This topic describes common issues and solutions for creating, using, and managing clusters.

Index

To troubleshoot issues with console access, components, nodes, pods, storage, or networks, see Troubleshooting.

Category	Subcategory	FAQ
Cluster creation and management	Cluster creation	Can I create a zero-node cluster? How do I add an existing ECS instance to a cluster? Can I add an existing pay-as-you-go ECS instance to a subscription node pool? Why am I prompted with "insufficient pods" for a newly created cluster? After I purchase a node, why are its available CPU and memory resources less than the specifications of its instance type?
Cluster creation and management	Cluster deletion	Delete a cluster Delete a cluster How do I determine whether the desired number of nodes is enabled for a node pool? Billing FAQ Billing FAQ In which lifecycle states does an ACK cluster not incur management fees? Why is the associated CLB instance not automatically released after an ACK cluster is deleted?
Cluster versions and upgrades	Cluster version	Can I stay on a specific cluster version and never upgrade? My cluster version is outdated. How can I quickly upgrade it? Does ACK support upgrades across multiple versions? How do I switch from Docker to containerd when upgrading a cluster from version 1.22 to 1.24? How does ACK ensure cluster upgrade stability? What do I need to know before upgrading a cluster?
Cluster versions and upgrades	Cluster upgrade	How do I manually upgrade my cluster? What are the recommended practices for upgrading a cluster? How long does a cluster upgrade take? Can I stay on a specific cluster version and never upgrade? Does ACK support upgrades across multiple versions? My cluster version is outdated. How can I quickly upgrade it? How do I switch from Docker to containerd when upgrading a version 1.22 cluster to 1.24? How does ACK ensure cluster upgrade stability? What do I need to know before upgrading a cluster? Can a cluster with an outdated version still function normally? Is version rollback supported after a cluster upgrade? If I need to both upgrade a cluster and migrate it to an ACK Managed Cluster Pro instance, which operation should I perform first? What should I do if the precheck reports deprecated APIs? What should I do if the precheck reports that a component version is too low? How do I handle a cluster upgrade failure with the error message "the aliyun service is not running on the instance"? How do I resolve the "PLEG not healthy" error on a node? What should I do if the "invalid object doesn't have additional properties" error occurs during a cluster upgrade?
Cluster connection and KubeConfig management	Obtain KubeConfig	How do I find the identity information associated with the certificate used in a KubeConfig file? How do I find the expiration date of the certificate used by a KubeConfig file? How do I resolve the "certificate is valid for" error when I use kubectl to connect to a cluster? How do I obtain the client certificate, client private key, and API Server information? Can an ACK managed cluster provide the cluster's root certificate key for generating KubeConfig certificates?
	Revoke KubeConfig credentials	What is the seven-day log check when revoking KubeConfig credentials? How do I interpret the results of the seven-day log check? In which scenarios can KubeConfig credentials not be revoked? Can accidentally revoked KubeConfig credentials be recovered? Can a specific version of KubeConfig credentials be recovered? What are the security best practices for KubeConfig management?
	KubeConfig recycle bin	Why are there multiple KubeConfig records for the same RAM user in the recycle bin for the same cluster? If multiple KubeConfig records are in the recycle bin, how do I identify the one to recover? Why is the Recover button for some entries in the recycle bin grayed out? What might cause a KubeConfig recovery to fail?
Cluster migration	Basic to Pro	Are services on a basic ACK managed cluster affected during migration? How long does the migration process take? Will the access path change after the cluster is migrated?
	Dedicated to managed cluster	Are services on a dedicated ACK cluster affected during migration? How long does the migration process take? Will the access path change after the migration? What do I do if the precheck for ACK Virtual Node environment variable configuration fails?
	Self-managed cluster	How do I migrate a self-managed Kubernetes cluster to ACK?
Other		Are clusters that run Alibaba Cloud Linux compatible with CentOS container images? If I select the containerd runtime when creating a cluster, can I change it to Docker later? What are the differences between the containerd, Docker, and Sandboxed-Container runtimes? Is ACK certified for MLPS 2.0 Level 3? Do ACK clusters support Istio? How do I collect diagnostic information for a Kubernetes cluster? How do I troubleshoot issues in an ACK cluster? How do I configure fine-grained authorization for a RAM user to manage an ACK cluster? Which IP address ranges must be allowed in the access control policy of the SLB instance for the API Server? What do I do if a namespace is stuck in the Terminating state for a long time? Related to dedicated clusters: How do I access a master node? Can I still upgrade a dedicated cluster after I accidentally delete one of its master nodes? Can master nodes be added to or removed from a dedicated cluster? What are the high-risk operations? What should I do if the API Server of a dedicated ACK cluster returns the error "api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after xxx?

Migrating self-managed clusters to ACK

ACK provides a migration solution to smoothly migrate a self-managed Kubernetes cluster to an ACK cluster with minimal impact on your business. For more information, see Overview of Kubernetes migration solutions.

Alibaba Cloud Linux and CentOS image compatibility

Yes, they are compatible. For more information, see Alibaba Cloud Linux 3.

Changing container runtime after cluster creation

No, you cannot change the container runtime after the cluster is created. However, you can create node pools that use different runtimes. For more information, see Create and manage a node pool.

To migrate the container runtime of a node from Docker to containerd, see Migrate the container runtime of a node from Docker to containerd.

Note

Docker is not supported as a built-in container runtime in clusters that run Kubernetes 1.24 or later. You must use containerd as the runtime for node pools in these clusters.

Comparison of container runtimes

Container Service for Kubernetes supports three runtimes: containerd, Docker, and Sandboxed-Container. We recommend that you use the containerd runtime. The Docker runtime is supported only in clusters that run Kubernetes 1.22 or earlier. The Sandboxed-Container runtime is supported only in clusters that run Kubernetes 1.24 or earlier. For a comparison of these runtimes, see Comparison of containerd, Sandboxed-Container, and Docker runtimes. When you upgrade an ACK cluster to Kubernetes 1.24 or later, you must migrate the container runtime of nodes from Docker to containerd. For more information, see Migrate the container runtime of a node from Docker to containerd.

ACK and MLPS 2.0 Level 3 certification

You can enable MLPS-based hardening for your cluster and configure baseline check policies. Based on Alibaba Cloud Linux, you can implement MLPS 2.0 Level 3 and configure baseline checks for MLPS compliance to meet the following requirements:

Identity authentication
Access control
Security audit
Intrusion prevention
Malicious code prevention

For more information, see Enable MLPS-based hardening for ACK clusters.

ACK and Istio support

Yes. You can use Alibaba Cloud Service Mesh (ASM). ASM is a service mesh product that is fully compatible with community Istio. Its fully managed control plane lets you focus on developing and deploying your business applications. ASM is compatible with various node operating systems and network plug-ins in ACK clusters. You can add an existing ACK cluster to an ASM instance and use features such as traffic management, fault handling, unified monitoring, and log management. For more information, see Add a cluster to an ASM instance. For information about ASM billing, see ASM billing.

Collect diagnostic information

If a Kubernetes cluster encounters an issue or a node is abnormal, you can use the one-click diagnostic feature provided by ACK to identify the problem. For more information, see Use cluster diagnostics.

If the cluster diagnostics feature does not meet your needs and you need to collect diagnostic information from master and abnormal worker nodes, follow the steps in the following sections to collect information from Linux or Windows nodes.

Linux node

Worker nodes can run Linux or Windows, but master nodes can run only Linux. The following method applies to both master and worker nodes that run Linux. This example uses a master node.

Log on to a master node of the Kubernetes cluster and run the following command to download the diagnostic script.
```
curl -o /usr/local/bin/diagnose_k8s.sh http://aliacs-k8s-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/public/diagnose/diagnose_k8s.sh
```
Note
The diagnostic script for Linux nodes can be downloaded only from the China (Hangzhou) region.
Run the following command to grant the diagnostic script execution permissions:
```
chmod u+x /usr/local/bin/diagnose_k8s.sh
```
Run the following command to change to the specified directory:
```
cd /usr/local/bin
```

Run the following command to run the diagnostic script:

diagnose_k8s.sh

The output is similar to the following. The name of the generated log file varies each time you run the script. This example uses diagnose_1514939155.tar.gz.

......
+ echo 'please get diagnose_1514939155.tar.gz for diagnostics'
please get diagnose_1514939155.tar.gz for diagnostics
+ echo 'Please upload diagnose_1514939155.tar.gz'
Please upload diagnose_1514939155.tar.gz

Run the following command to view the file that contains the cluster diagnostic information:
```
ls -ltr | grep diagnose_1514939155.tar.gz
```
Note
Replace diagnose_1514939155.tar.gz with the actual name of the log file in your environment.

Windows node

To collect cluster diagnostic information from a Windows worker node, download and run the diagnostic script.

Note

Windows is supported only for worker nodes.

Log on to the abnormal worker node and open a command-line tool.
Run the following command to enter PowerShell mode:
```
powershell
```
Run the following command to download and run the diagnostic script.
You can download the diagnostic script for Windows nodes from your cluster's region. Replace [$Region_ID] in the command with the ID of your cluster's region.
```
Invoke-WebRequest -UseBasicParsing -Uri http://aliacs-k8s-[$Region_ID].oss-[$Region_ID].aliyuncs.com/public/pkg/windows/diagnose/diagnose.ps1 | Invoke-Expression
```
The following output indicates that the diagnostic information is successfully collected.
```
INFO: Compressing diagnosis clues ...
INFO: ...done
INFO: Please get diagnoses_1514939155.zip for diagnostics
```
Note
The diagnoses_1514939155.zip file is saved in the directory where the script is run.

Troubleshoot ACK cluster issues

Step 1: Check cluster nodes

Run the following command to view the status of the nodes in the cluster. Verify that all nodes exist and are in the Ready state.

kubectl get nodes

The expected output is similar to the following.

NAME                      STATUS   ROLES    AGE   VERSION
cn-hxxx.20   Ready    master   86m   v1.18.8-aliyun.1
cn-hxxx.21   Ready    master   84m   v1.18.8-aliyun.1
cn-hxxx.22   Ready    master   81m   v1.18.8-aliyun.1
cn-hxxx.23   Ready    <none>   78m   v1.18.8-aliyun.1
cn-hxxx.24   Ready    <none>   78m   v1.18.8-aliyun.1
cn-hxxx.25   Ready    <none>   78m   v1.18.8-aliyun.1

If all nodes exist and are in the Ready state, the cluster nodes are healthy.
If a node is abnormal, go to Step 2.

Run the following command to view detailed information and events for a node.
Replace [$NODE_NAME] with the name of your node.
```
kubectl describe node [$NODE_NAME]
```
Note
For more information about the output of kubectl, see Node Status.

Step 2: Check cluster components

If you cannot identify the issue after checking the cluster nodes, check the cluster component logs on the control plane.

Run the following command to view all components in the kube-system namespace.

kubectl get pods -n kube-system

The expected output is as follows.

NAME                                             READY   STATUS      RESTARTS   AGE
alicloud-monitor-controller-6fbd5454f9-tvsls     1/1     Running     0          91m
aliyun-acr-credential-helper-587bf4b6f8-bq2bg    1/1     Running     0          91m
cloud-controller-manager-74q86                   1/1     Running     0          91m
cloud-controller-manager-sktzk                   1/1     Running     0          91m
cloud-controller-manager-tkvtz                   1/1     Running     0          91m
coredns-64d57b9c4b-222pj                         1/1     Running     0          91m
coredns-64d57b9c4b-fcr8t                         1/1     Running     0          91m
csi-plugin-5hnn8                                 4/4     Running     0          91m
csi-plugin-6wxtm                                 4/4     Running     0          91m
csi-plugin-jdvg4                                 4/4     Running     0          91m
csi-plugin-njd28                                 4/4     Running     0          91m
csi-plugin-tvf2h                                 4/4     Running     0          91m
csi-plugin-zt76m                                 4/4     Running     0          91m
csi-provisioner-84c4866d86-874wm                 7/7     Running     0          91m
csi-provisioner-84c4866d86-wvj86                 7/7     Running     0          91m
ingress-nginx-admission-create-9gnv8             0/1     Completed   0          91m
ingress-nginx-admission-patch-wjskw              0/1     Completed   2          91m
kube-apiserver-cn-huhehaote.1xxx 0               1/1     Running     0          95m
kube-apiserver-cn-huhehaote.1xxx 1               1/1     Running     0          95m
kube-apiserver-cn-huhehaote.1xxx 2               1/1     Running     0          95m
kube-controller-manager-cn-huhehaote.1xxx 20     1/1     Running     1          100m
kube-controller-manager-cn-huhehaote.1xxx 21     1/1     Running     1          97m
kube-controller-manager-cn-huhehaote.1xxx 22     1/1     Running     1          91m
kube-flannel-ds-b5zt4                            1/1     Running     0          91m
kube-flannel-ds-blj25                            1/1     Running     0          91m
kube-flannel-ds-d8v7j                            1/1     Running     0          91m
kube-flannel-ds-dq6nz                            1/1     Running     0          91m
kube-flannel-ds-vx97g                            1/1     Running     0          91m
kube-flannel-ds-wp8cj                            1/1     Running     0          91m
kube-proxy-master-8kl67                          1/1     Running     0          91m
kube-proxy-master-mnqmt                          1/1     Running     0          91m
kube-proxy-master-zfns9                          1/1     Running     0          91m
kube-proxy-worker-j2gr2                          1/1     Running     0          91m
kube-proxy-worker-n69x8                          1/1     Running     0          91m
kube-proxy-worker-qrft5                          1/1     Running     0          100m
kube-scheduler-cn-huhehaote.1xxx l20             1/1     Running     0          97m
kube-scheduler-cn-huhehaote.1xxx l21             1/1     Running     0          97m
kube-scheduler-cn-huhehaote.1xxx l22             1/1     Running     0          95m
metrics-server-84f55db549-h9n4k                  1/1     Running     0          91m
nginx-ingress-controller-7474b6cc84-7pk7v        1/1     Running     0          91m
nginx-ingress-controller-7474b6cc84-hg8l9        1/1     Running     0          91m

Pods whose names start with kube- are system components of the Kubernetes cluster. Pods whose names start with coredns- are DNS add-onscoredns- are DNS plug-ins. This output indicates that the components are in a normal state. If a component is in an abnormal state, proceed to the next step.

Run the following command to view the logs of the abnormal component to identify and resolve the issue.
Replace [$Component_Name] with the name of the abnormal component.
```
kubectl logs -f [$Component_Name] -n kube-system
```

Step 3: Check the kubelet component

Run the following command to check the status of the kubelet.
```
systemctl status kubelet
```
If the status of the kubelet is not active (running), run the following command to view the kubelet logs to identify and resolve the issue.
```
journalctl -u kubelet
```

Common cluster issues

The following table lists some common causes of ACK cluster failures and their solutions.

Issue	Solution
The API Server component or a master component stops: You cannot create, stop, or update resources such as pods, Services, and Deployments. Existing pods and Services continue to work unless they need to call the Kubernetes API, which applications like the Kubernetes Dashboard rely on.	ACK components have built-in high availability. We recommend that you check whether the component itself is abnormal. For example, the API Server of an ACK cluster uses a CLB instance by default. You can troubleshoot the abnormal status of the CLB instance.
The backend data for the API Server is lost: The API Server cannot start. Existing pods and Services continue to work unless they need to call the Kubernetes API, which applications like the Kubernetes Dashboard rely on. You must restore or rebuild the data of the API Server to start the API Server.	If you created a snapshot, you can restore data from the snapshot when an issue occurs. If you did not create a snapshot, submit a ticket. After the issue is resolved, take the following measures to prevent this issue: Use the storage plug-ins provided by ACK for persistent storage. For more information, see Use dynamically provisioned disk volumes. Periodically create a snapshot for the data volume used by the kubelet. For more information, see Create a snapshot for a single disk volume.
An individual node shuts down, and all pods on that node stop running.	Use a workload such as a Deployment, StatefulSet, or DaemonSet to create pods instead of directly creating pods. This ensures that replacement pods are scheduled on healthy nodes if a node fails.
The kubelet component fails: Pods cannot be created on the node where the kubelet has failed. The kubelet may have incorrectly deleted some pods. The node is marked as `NotReady`. The Deployment or Replication Controller creates new pods on other nodes.	If you have created a snapshot, you can use it to restore data when an issue occurs. If you have not created a snapshot, submit a ticket to report the issue. After the issue is resolved, periodically create snapshots for the data volume that is used by the kubelet. For more information, see Create a snapshot for a single disk volume. Use a workload such as a Deployment, StatefulSet, or DaemonSet to create pods instead of directly creating pods. This ensures that pods are rescheduled to other healthy nodes.
Issues are caused by manual configurations or other reasons.	If you have created a snapshot, you can use it to restore data when an issue occurs. If you have not created a snapshot, submit a ticket to report the issue. After the issue is resolved, periodically create snapshots for the data volume that is used by the kubelet. For more information, see Create a snapshot for a single disk volume.

Configure fine-grained authorization

By default, a RAM user or RAM role does not have permissions to call the OpenAPI of any cloud service. To use and manage ACK clusters, you must grant the AliyunCSFullAccess system policy or a custom policy for Container Service for Kubernetes to the RAM user or RAM role. For more information, see Grant access permissions to clusters and cloud resources by using RAM.
Based on the Kubernetes RBAC mechanism, you must use RBAC to authorize a RAM user to manage internal cluster resources, such as creating Deployments and Services.
In scenarios that require fine-grained control over read and write permissions on resources, see Use custom RBAC policies to restrict operations on resources in a cluster to configure more fine-grained RBAC permissions by using a custom ClusterRole and Role.
When a RAM user accesses the console, you must also configure the corresponding cloud service permissions to use features such as viewing node pool scaling activities and cluster monitoring dashboards. For more information, see Permissions required for the Container Service for Kubernetes console.

What IP address ranges must be allowed for the SLB access control policy of a cluster's API Server?

The access control list (ACL) rules for the API Server's SLB instance must allow the following CIDR blocks.

The 100.104.0.0/16 CIDR block, which is reserved for the Container Service for Kubernetes control plane.
The primary and any additional CIDR blocks of the cluster's Virtual Private Cloud (VPC), or the CIDR blocks of the vSwitches where the cluster nodes reside.
The egress CIDR blocks of clients that need to access the API server.
For ACK Edge clusters, you must also add the egress CIDR blocks of edge nodes.
For ACK Lingjun clusters, you must also add the CIDR blocks of the Lingjun Virtual Private Datacenter (VPD).

For more information, see Configure the access control policy for the API Server.

Access a master node

Dedicated cluster: For more information, see Connect to the master node of a dedicated ACK cluster by using SSH.
Managed cluster: The control plane nodes of an ACK managed cluster are fully managed. You cannot log on to the terminals of the control plane nodes. If you need to log on to control plane nodes, consider using a dedicated cluster.

If a master node of an ACK dedicated cluster is accidentally deleted, can the cluster be upgraded?

No. After you delete a master node from a dedicated cluster, you cannot add master nodes or upgrade the cluster. You can create a dedicated ACK cluster (discontinued).

ACK dedicated cluster: Can master nodes be removed or added, and what are the high-risk operations?

No. Adding or removing master nodes from a dedicated cluster may cause the cluster to become unusable and unrecoverable.

For the master nodes of a dedicated cluster, improper operations can render the master nodes or even the entire cluster unusable. High-risk operations include replacing master or etcd certificates, modifying core components, deleting or formatting data in core directories such as /etc/kubernetes on a node, and reinstalling the operating system. For more information, see High-risk operations related to clusters.

ACK dedicated clusterAPI Servererror:`api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after xxx`. What do I do?

Symptoms

When you create a pod in a dedicated cluster, the API Server returns a certificate expiration error, or the logs or events of the kube-controller-manager show a certificate expiration error. The error message is as follows.

"https://localhost:6443/api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after XXX

"https://[::1]:6443/api/v1/namespaces/xxx/resourcequotes": x509: certificate has expired or is not yet valid: current time XXX is after XXX

Cause

In Kubernetes, the API Server has a built-in certificate for its internal LoopbackClient. In the community version, this certificate has a validity period of 1 year and cannot be automatically rotated. It is rotated and updated only when the API Server pod restarts. If the cluster has not been upgraded for more than a year, the internal certificate expires, causing API requests to fail. For more information, see #86552.

To reduce the stability risks caused by the short validity period of certificates in the community version, ACK extends the default validity period of this built-in certificate to 10 years for clusters that run Kubernetes 1.24 or later. For more information about the changes and their scope of impact, see Product Change: Validity Period of ACK Cluster API Server Internal Certificates.

Solution

You can log on to a master node and run the following command to query the expiration time of the LoopbackClient certificate.

In the command, XX.XX.XX.XX is the local IP address of the master node.

curl --resolve apiserver-loopback-client:6443:XX.XX.XX.XX -k -v https://apiserver-loopback-client:6443/healthz 2>&1 |grep expire

For clusters with 1-year certificates that have expired or are about to expire, see Manually upgrade an ACK cluster to upgrade the cluster to version 1.24 or later. We recommend that you migrate to an ACK Managed Cluster Pro instance (Hot migrate a dedicated ACK cluster to an ACK Managed Cluster Pro instance).
For a dedicated cluster that cannot be upgraded in the short term, log on to each master node and manually restart the API Server to generate a new valid certificate.
- containerd node
```
crictl pods | grep kube-apiserver- | awk '{print $1}' | xargs -I '{}' crictl stopp {}
```
- Docker node
```
docker ps | grep kube-apiserver- | awk '{print $1}' | xargs -I '{}' docker restart {}
```