Usage notes and risky operations

更新时间:
复制 MD 格式

Container Service for Kubernetes (ACK) manages the technical architecture and core components of container services. However, improper operations on unmanaged components and applications that run in ACK clusters can cause service failures. To anticipate and prevent operational risks, carefully read the recommendations and notes in this topic before you use ACK.

Index

Item

Related topics

Usage notes

Risky operations

Usage notes

Data plane components

Data plane components are system components that run on your ECS instances, such as CoreDNS, Ingress, kube-proxy, Terway, and kubelet. Because these components run on your ECS instances, both Alibaba Cloud and you share the responsibility for maintaining their stability.

ACK provides the following support for data plane components:

  • Provides features for component parameter management, periodic feature optimization, bug fixes, and CVE patches, along with corresponding guidance.

  • Provides observability features, such as monitoring and alerts for components. For some core components, logs are provided and exposed to you through Simple Log Service (SLS).

  • Provides configuration best practices and recommendations. ACK provides component configuration recommendations based on the cluster scale.

  • Provides periodic inspection and alert notification capabilities for components. Inspections cover items such as component versions, configurations, payloads, deployment topologies, and the number of instances.

Follow these recommendations when you use data plane components:

  • Use the latest component versions. New versions are frequently released to fix bugs or provide new features. After a new version is released, choose an appropriate time to upgrade the component according to the instructions in the component upgrade guide to ensure service stability. For more information, see Components.

  • In the ACK console, set the email address and mobile phone number for your contacts and specify how to receive alert notifications. Alibaba Cloud pushes alert notifications, service notices, and other information for ACK through these channels. For more information, see Manage alerts for ACK and How do I configure message receiving?.

  • If you receive a component stability risk report, follow the relevant instructions to handle the issue and eliminate security risks promptly.

  • When you use data plane components, configure custom parameters for the components in the ACK console on the Operations > Add-ons page or using the OpenAPI. Modifying component configurations through other channels may cause component features to become abnormal. For more information, see Manage components.

  • Do not use the OpenAPI of Infrastructure as a Service (IaaS) products to change the runtime environment of components. Examples include using the ECS OpenAPI to change the running status of an ECS instance, modifying the security group configuration of a worker node, changing the network configuration of a worker node, or using the Server Load Balancer (SLB) OpenAPI to modify SLB configurations. Unauthorized changes to IaaS layer resources can cause data plane components to become abnormal.

  • Some data plane components are affected by upstream open source components and may have bugs or vulnerabilities. Upgrade the components promptly to prevent your services from being affected.

Cluster upgrades

Always use the ACK cluster upgrade feature to upgrade the Kubernetes version of your cluster. Upgrading the Kubernetes version on your own can cause stability and compatibility issues. For more information, see Upgrade a cluster and independently upgrade the control plane and node pools of a cluster.

ACK provides the following support for cluster upgrades:

  • Provides a feature to upgrade the cluster to a new Kubernetes version.

  • Provides a pre-check feature for Kubernetes version upgrades to ensure that the current state of the cluster supports the upgrade.

  • Provides version guides for new Kubernetes versions that describe the changes from the previous version.

  • Notifies you about potential risks that may occur due to resource changes when you upgrade to a new Kubernetes version.

Follow these recommendations when you use the cluster upgrade feature:

  • Run a pre-check before the cluster upgrade and fix any blocking issues based on the pre-check results.

  • Carefully read the version guide for the new Kubernetes version. Confirm the status of the cluster and your services based on the upgrade risks identified by ACK, and assess the upgrade risks on your own. For more information, see [Discontinued] Overview of Kubernetes version releases.

  • The cluster upgrade feature does not support rollbacks. Create a thorough upgrade plan and perform backups before you proceed.

  • Upgrade the Kubernetes version of your cluster promptly within the support lifecycle of the current version, according to the ACK version support policy. For more information, see Version guide.

Native Kubernetes configurations

  • Do not modify key Kubernetes configurations without authorization. Examples include the paths, links, and content of the following files:

    • /var/lib/kubelet

    • /var/lib/docker

    • /etc/kubernetes

    • /etc/kubeadm

    • /var/lib/containerd

  • Do not use reserved annotations for Kubernetes clusters in YAML templates. This can cause resources to become unavailable, fail requests, or become abnormal. Annotations that start with kubernetes.io/ and k8s.io/ are reserved for core components. The following is an example of a violation: pv.kubernetes.io/bind-completed: "yes".

ACK serverless clusters

In the following scenarios, compensation is not provided for ACK serverless clusters:

  • To simplify cluster operations and maintenance (O&M), ACK serverless clusters provide managed capabilities for some system components. After you enable the managed feature for components in a cluster, ACK is responsible for their deployment and maintenance. If your services are affected because you mistakenly delete Kubernetes objects that managed components depend on, or due to other similar situations, ACK Serverless does not provide compensation.

  • ACK Serverless does not provide compensation for the situations and reasons listed under "Exclusions" in the Service-Level Agreement (SLA) for Container Service for Kubernetes.

Registered clusters

  • When you connect an external Kubernetes cluster using the registered cluster feature in the ACK console, ensure network stability between the external cluster and Alibaba Cloud.

  • ACK lets you register and connect to external Kubernetes clusters but cannot control the stability of or prevent improper operations on the external clusters. Therefore, be cautious when you configure information such as labels, annotations, and tags for nodes in an external cluster using the registered cluster feature. This may cause applications to run abnormally.

App Catalog

To enrich Kubernetes applications, the App Catalog in the ACK Marketplace provides applications that are adapted and customized based on open source software. ACK cannot control the bugs inherent in the open source software. Be aware of this risk. For more information, see App Catalog.

Risky operations

Certain operations in ACK are considered high-risk and may significantly affect service stability. Before you perform these operations, make sure that you understand the operations and their impacts.

Cluster-related risky operations

Category

Risky operation

Impact

Recovery solution

API Server

You can reuse the CLB instance of the API server for other scenarios, such as a Service of type LoadBalancer.

The cluster becomes unavailable and service traffic is affected.

Revert to the original configuration or contact after-sales support.

Modifying the listener, vServer group, access control list (ACL), or other configurations that control SLB forwarding, or modifying the tag configuration of the SLB instance used by the API Server.

The cluster becomes abnormal.

Revert to the original configuration.

Delete the CLB instance that the API server uses.

Operations on the cluster fail.

This operation cannot be reversed. You must re-create the cluster. For more information, see Create an ACK managed cluster.

Worker node

Modifying the security group of a node in the cluster.

The node may become unavailable.

Add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).

The node expires or is destroyed.

The node becomes unavailable.

This operation cannot be reversed.

Reinstalling the operating system.

Components on the node are deleted.

Remove the node from the cluster and then add it back. For more information, see Remove a node and Add existing nodes to a cluster.

Upgrading node component versions on your own.

The node may become unusable.

Roll back to the original version.

Changing the node IP address.

The node becomes unavailable.

Revert the IP address to the original one.

Modifying parameters of core components, such as kubelet, docker, and containerd, on your own.

The node may become unavailable.

Configure the parameters as recommended in the official documentation.

Modifying the operating system configuration.

The node may become unavailable.

Try to revert the configuration item. Otherwise, delete the node and purchase a new one.

Modifying the node time.

Components on the node may work abnormally.

Revert the node time.

Adding computing power to the cluster in a way that is not supported by ACK.

ACK provides methods such as the console, OpenAPI, and command-line interface (CLI) to add computing power to a cluster. For more information, see Add existing nodes to a cluster. If you add nodes to the cluster using other methods, ACK cannot identify the source of these nodes and cannot provide product capabilities such as node lifecycle management, automated O&M, and technical support. For a detailed risk description, see Why does the console show that the source of the node pool to which a node belongs is "Other Nodes"?.

We recommend that you manage computing power using node pools. If you want to continue using the current method, ensure the compatibility of the nodes with various cluster components, such as Kubernetes, network, storage, and security components.

Master node (ACK dedicated cluster)

Modifying the security group of a node in the cluster.

The master node may become unavailable.

Add the node back to the security group that was automatically created for the cluster. For more information, see Associate a security group with an instance (primary ENI).

The node expires or is destroyed.

The master node becomes unavailable.

This operation cannot be reversed.

Reinstalling the operating system.

Components on the master node are deleted.

This operation cannot be reversed.

Upgrading the Master or etcd component versions on your own.

The cluster may become unusable.

Roll back to the original version.

Deleting or formatting data in core directories such as /etc/kubernetes on the node.

The master node becomes unavailable.

This operation cannot be reversed.

Changing the node IP address.

The master node becomes unavailable.

Revert the IP address to the original one.

Modifying parameters of core components, such as etcd, kube-apiserver, and docker, on your own.

The master node may become unavailable.

Configure the parameters as recommended in the official documentation.

Replacing the Master or etcd certificate on your own.

The cluster may become unusable.

This operation cannot be reversed.

Adding or removing master nodes on your own.

The cluster may become unusable.

This operation cannot be reversed.

Modifying the node time.

Components on the node may work abnormally.

Revert the node time.

Other

Changing permissions or performing modifications through RAM.

Some cluster resources, such as SLB instances, may fail to be created.

Revert to the original permissions.

Note

This applies only to clusters that run a Kubernetes version earlier than 1.26.

Modifying or deleting pre-configured PodSecurityPolicy-related resources in the cluster. This includes the PodSecurityPolicy resource named ack.privileged, and ClusterRole, ClusterRoleBinding, Role, and RoleBinding resources with names that start with ack:podsecuritypolicy:.

Core cluster components may become abnormal. You may not be able to create or update pod resources in the cluster.

Recover the related resources. For more information, see Configure or recover the default pod security policies of ACK.

Node pool-related risky operations

Risky operation

Impact

Recovery solution

Deleting a scaling group.

The node pool becomes abnormal.

This operation cannot be reversed. You can only re-create the node pool. For more information, see Create a node pool.

Removing a node using kubectl.

The displayed number of nodes in the node pool does not match the actual number.

Remove the specified node from the ACK console or using the node pool-related API (see Remove a node), or scale in by modifying the expected number of nodes in the node pool (see Create and manage node pools).

Directly releasing an ECS instance.

The node pool product page may display information abnormally. For a node pool with a specific number of expected nodes, the pool automatically scales out to the expected number of nodes based on the node pool configuration to maintain that number.

This operation cannot be reversed. The correct procedure is to scale in by modifying the expected number of nodes in the node pool from the ACK console or using the node pool-related API (see Create and manage node pools) or to remove a specified node (see Remove a node).

Manually scaling out or scaling in a node pool with auto scaling enabled.

The auto scaling component automatically adjusts the number of nodes based on the policy. This causes the result to not match your expectations.

This operation cannot be reversed. Auto-scaling node pools do not require manual intervention.

Modifying the maximum or minimum number of instances in an ESS scaling group.

Scaling may become abnormal.

  • For a node pool without auto scaling enabled, change the maximum and minimum number of instances in the ESS scaling group to the default values of 2000 and 0.

  • For a node pool with auto scaling enabled, change the maximum and minimum number of instances in the ESS scaling group to be consistent with the maximum and minimum number of nodes in the node pool.

Not backing up data before adding an existing node.

Any data on the instance will be lost.

This operation cannot be reversed.

  • Before you manually add an existing node, you must back up all data that you want to keep.

  • When a node is added automatically, the system disk is replaced. You need to back up useful data that is stored on the system disk in advance.

Saving important data on the node's system disk.

The self-healing operation of a node pool may repair a node by resetting its configuration. This can lead to data loss on the system disk.

This operation cannot be reversed. The correct practice is to store important data on an extra data disk or on a cloud disk, NAS, or OSS.

Virtual node-related risky operations

Risky operation

Impact

Recovery solution

Uninstalling the virtual node component.

The serverless pod management feature becomes abnormal: created ECI pods and ACS pods cannot be deleted normally, and new ECI pods and ACS pods cannot be created normally.

Re-install the virtual node component.

Network and Server Load Balancer-related risky operations

Risky operation

Impact

Recovery solution

Modifying the kernel parameter net.ipv4.ip_forward=0.

Network connectivity fails.

Modify the kernel parameter to net.ipv4.ip_forward=1.

Modifying the kernel parameters:

  • net.ipv4.conf.all.rp_filter = 1|2

  • net.ipv4.conf.[ethX].rp_filter = 1|2

    Note

    ethX represents all network interface controllers (NICs) that start with eth.

Network connectivity fails.

Modify the kernel parameters to:

  • net.ipv4.conf.all.rp_filter = 0

  • net.ipv4.conf.[ethX].rp_filter = 0

Modifying the kernel parameter net.ipv4.tcp_tw_reuse = 1.

Pod health checks become abnormal.

Modify the kernel parameter to net.ipv4.tcp_tw_reuse = 0.

Modifying the kernel parameter net.ipv4.tcp_tw_recycle = 1.

NAT becomes abnormal.

Modify the kernel parameter to net.ipv4.tcp_tw_recycle = 0.

Modifying the kernel parameter net.ipv4.ip_local_port_range.

Intermittent network connectivity failures occur.

Modify the kernel parameter to the default value net.ipv4.ip_local_port_range="32768 60999".

Installing firewall software, such as Firewalld or ufw.

Container network connectivity fails.

Uninstall the firewall software and restart the node.

The node security group configuration does not allow UDP traffic on port 53 for the container CIDR block.

DNS in the cluster does not work correctly.

Configure the security group to allow traffic as recommended in the official documentation.

Modifying or deleting the tags of an SLB instance added by ACK.

The SLB instance becomes abnormal.

Revert the tags of the SLB instance.

Modifying the configuration of an SLB instance managed by ACK from the SLB console, including the SLB instance, listener, and vServer group.

The SLB instance becomes abnormal.

Revert the configuration of the SLB instance.

Removing the annotation for reusing an existing SLB instance from a Service, which is service.beta.kubernetes.io/alibaba-cloud-loadbalancer-id: ${YOUR_LB_ID}.

The SLB instance becomes abnormal.

Add the annotation for reusing an existing SLB instance to the Service.

Note

You cannot directly change a Service that reuses an existing SLB instance to a Service that uses an automatically created SLB instance. You must re-create the Service.

Deleting an SLB instance created by ACK from the SLB console.

The cluster network may become abnormal.

Delete the SLB instance by deleting the Service. For more information, see Delete a Service.

Manually deleting the nginx-ingress-lb Service in the kube-system namespace when the Nginx Ingress Controller component is installed.

The Ingress Controller does not work correctly and may crash in severe cases.

Create a new Service with the same name using the following YAML.

apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: nginx-ingress-lb
  name: nginx-ingress-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Local
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 80
  - name: https
    port: 443
    protocol: TCP
    targetPort: 443
  selector:
    app: ingress-nginx
  type: LoadBalancer

Adding or modifying the nameserver option in the DNS configuration file /etc/resolv.conf on an ECS node.

If the configured DNS server is not configured properly, DNS resolution may fail. This affects the normal operation of the cluster.

If you want to use a self-managed DNS server as an upstream server, we recommend that you configure it on the CoreDNS side. For more information, see Configure an unmanaged CoreDNS component.

Modifying or deleting elastic network interfaces (ENIs) or Lingjun ENIs created by ACK.

The pod network is interrupted.

This operation cannot be reversed.

Modifying or deleting network-related CRDs.

podnetworkings.network.alibabacloud.com
podenis.network.alibabacloud.com
networkinterfaces.network.alibabacloud.com
nodes.network.alibabacloud.com
noderuntimes.network.alibabacloud.com
*.cilium.io
*.crd.projectcalico.org

The Terway component will not work. This may lead to network interruptions and pod abnormalities in severe cases.

This operation cannot be reversed.

Creating, modifying, or deleting network-related system CRs.

podenis.network.alibabacloud.com
networkinterfaces.network.alibabacloud.com
nodes.network.alibabacloud.com
noderuntimes.network.alibabacloud.com
*.cilium.io
*.crd.projectcalico.org

The Terway component will not work. This may lead to network interruptions and pod abnormalities in severe cases.

Delete the custom CR and re-create the associated pod.

Modifying or deleting fields that are not allowed to be modified in the Terway network configuration. For parameter declarations, see Customize Terway configurations.

The Terway component will not work. This may lead to network interruptions and pod abnormalities in severe cases.

Revert to the original configuration and restart the node.

Storage-related risky operations

Risky operation

Impact

Recovery solution

Manually detaching a cloud disk from the console.

Pod write operations report an I/O error.

Restart the pod and manually clean up mount point residuals on the node.

Running umount on the disk mount path on the node.

The pod writes to the local disk.

Restart the pod.

Directly operating on a cloud disk on the node.

The pod writes to the local disk.

This operation cannot be reversed.

Mounting the same cloud disk to multiple pods.

The pod writes to the local disk or reports an I/O error.

Ensure that one cloud disk is used by only one pod.

Important

Cloud disks are non-shared storage provided by Alibaba Cloud and can be mounted to only one pod at a time.

Manually deleting a NAS mount directory.

Pod write operations report an I/O error.

Restart the pod.

Deleting a NAS disk or mount target that is in use.

The pod experiences an I/O hang.

Restart the ECS node. For more information, see Restart an ECS instance.

Log-related risky operations

Risky operation

Impact

Recovery solution

Deleting the /tmp/ccs-log-collector/pos directory on the host.

Duplicate log collection.

This operation cannot be reversed. The files in this directory record the log collection positions.

Deleting the /tmp/ccs-log-collector/buffer directory on the host.

Log data loss.

This operation cannot be reversed. This directory is a cache file for logs that are waiting to be consumed.

Deleting the aliyunlogconfig CRD resource.

Log collection fails.

Re-create the deleted CRD and its corresponding resources. However, logs from the failure period cannot be recovered.

Deleting a CRD also deletes all its corresponding instances. Even if the CRD is recovered, you must manually create the deleted instances.

Deleting the log component.

Log collection fails.

Reinstall the log component and manually recover the aliyunlogconfig CRD instances. Logs from the deletion period cannot be recovered.

Deleting the log component is equivalent to deleting the aliyunlogconfig CRD and the Logtail collector. All log collection capabilities are lost during this period.