Resolve disk volume creation, mounting, expansion, and unmounting issues in ACK clusters.
FAQ navigation
Creation
Dynamic PV creation failure: InvalidDataDiskCatagory.NotSupported
Symptom
PV creation fails, and a PVC event displays the error message InvalidDataDiskCategory.NotSupported.
Cause
The StorageClass specifies a cloud disk type unavailable in the current zone, or inventory is insufficient.
Solution
-
Upgrade the CSI plug-in and use the
alicloud-disk-topology-alltypeStorageClass, or create a custom StorageClass with multiple cloud disk types. See Use a dynamically provisioned disk volume. -
Add multiple zones to the cluster. See High-availability configuration recommendations for disk volumes.
Dynamic PV creation failure: Insufficient AZone inventory
Symptom
PV creation fails, and a PVC event displays the error message The specified AZone inventory is insufficient.
Cause
Cloud disk creation fails due to insufficient inventory in the specified zone.
Solution
-
Upgrade the CSI plug-in and use the
alicloud-disk-topology-alltypeStorageClass, or create a custom StorageClass with multiple cloud disk types. See Use a dynamically provisioned disk volume. -
Add multiple zones to the cluster. See High-availability configuration recommendations for disk volumes.
Dynamic PV creation failure: Disk size not supported
Symptom
Dynamic PV creation fails, and a PVC event displays the error message disk size is not supported.
Cause
The PVC specifies an invalid capacity. Minimum limits vary by cloud disk type.
Solution
Adjust the capacity declared in the PVC to a supported value.
Dynamic PV creation failure: Waiting for first consumer
Symptom
PV creation fails when using a StorageClass with the WaitForFirstConsumer binding mode, and a PVC event displays the error message persistentvolume-controller waiting for first consumer to be created before binding.
Cause
The PVC has not detected the node to which the pod is scheduled.
-
If you explicitly set
nodeNamein the application's YAML, the pod bypasses the scheduler. This prevents the PVC from discovering the node, making it incompatible with theWaitForFirstConsumerbinding mode. -
No pod is referencing the current PVC.
Solution
-
Remove the
nodeNamefield from the application's YAML file and use a different scheduling method. -
Create a pod that uses the current PVC.
Dynamic PV creation failure: No topology key on CSINode
Symptom
PV creation fails, and a PVC event displays the error message no topology key found on CSINode node-XXXX.
Cause
-
Cause 1: The CSI plug-in on the corresponding node
node-XXXXfails to start. -
Cause 2: A driver that is not supported by the system is used for mounting. By default, the system supports Disk, NAS, and OSS drivers.
Solution
-
Check the pod status.
kubectl get pods -n kube-system -o wide | grep node-XXXX-
Abnormal status: Check the error logs with
kubectl logs csi-plugin-xxxx -nkube-system -c csi-plugin. This is typically caused by a node port conflict. Resolve as follows:-
Close the process that is occupying the port.
-
Add the
SERVICE_PORTenvironment variable to the CSI plug-in to specify a new port.kubectl set env -n kube-system daemonset/csi-plugin --containers="csi-plugin" SERVICE_PORT="XXX"
-
-
If the status is normal, go to the next step.
-
-
Use a default system driver, such as Disk, NAS, or OSS. See the Storage documentation.
Dynamic PV creation failure: selfLink was empty
Symptom
PV creation fails, and a PVC event displays the error message selfLink was empty, can't make reference.
Cause
-
The cluster version and the CSI plug-in version do not match.
-
The cluster uses the FlexVolume storage plug-in.
Solution
-
Upgrade the CSI plug-in version. The plug-in version must be compatible with the cluster version. For example, a Kubernetes 1.20 cluster requires CSI version 1.20 or later.
-
If your cluster uses the FlexVolume storage plug-in, see Migrate from FlexVolume to CSI.
Dynamic PV creation failure: PVC capacity < 20 GiB
Default ACK StorageClasses (e.g., alicloud-disk-topology-alltype, alicloud-disk-essd) require a minimum capacity of 20 GiB. For smaller volumes, create a custom StorageClass with a disk type supporting lower capacity, such as ESSD AutoPL or ESSD PL0.
-
For capacity ranges by disk type, see Block storage performance.
Mounting
Pod startup failure: Volume node affinity conflict
Symptom
A pod with a mounted cloud disk fails to start, and a pod event displays the error message had volume node affinity conflict.
Cause
Every PV has a nodeaffinity property. This error occurs when the PV's nodeaffinity conflicts with the pod's nodeaffinity, preventing scheduling.
Solution
Ensure the PV nodeaffinity and pod nodeaffinity are consistent.
Pod startup failure: Can't find disk
Symptom
A pod with a mounted cloud disk fails to start, and a pod event displays the error message can't find disk.
Cause
-
You configured the PV with an incorrect cloud disk ID or a cloud disk ID from another region.
-
Your account lacks permission to access the cloud disk, possibly because it belongs to another account.
Solution
-
If you are using a statically provisioned cloud disk, check whether it meets the following requirements:
-
The region of the cloud disk is the same as the region of the cluster.
-
The cloud disk ID is correct.
-
The cloud disk and the cluster belong to the same Alibaba Cloud account.
-
-
If the cloud disk is dynamically mounted, check the permissions of the CSI plug-in.
Check whether an Addon Token exists in the cluster.
-
If an Addon Token exists, check the CSI plug-in version in the cluster, upgrade the plug-in to the latest version, and then try again.
-
If an Addon Token does not exist, the custom AccessKey of the node's worker RAM role is used by default. You need to check the permissions of the corresponding RAM policy.
-
Pod startup failure: Attach action in process
Symptom
A pod with a mounted cloud disk shows Previous attach action is still in process but starts successfully after a few seconds.
Cause
ECS instances attach one cloud disk at a time. When multiple disk-backed pods land on the same host, disks attach sequentially. This message means another attach is in progress.
Solution
No action required. The system retries automatically.
Pod startup failure: InvalidInstanceType.NotSupportDiskCategory
Symptom
When you start a pod with a cloud disk mounted, the error message InvalidInstanceType.NotSupportDiskCategory is displayed.
Cause
The pod is scheduled to an ECS node whose instance type does not support this cloud disk type.
Solution
Try one of the following:
-
Verify the cluster has an ECS node whose instance type supports this cloud disk type, and that the pod can be scheduled to it.
-
If no node instance type supports this cloud disk type, switch to a compatible disk type.
For cloud disk and instance type compatibility, see Instance families.
Pod startup failure: CSI driver not registered
Symptom
When you start a pod, the following warning is displayed.
Warning FailedMount 98s (x9 over 3m45s) kubelet, cn-zhangjiakou.172.20.XX.XX MountVolume.MountDevice failed for volume "d-xxxxxxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name diskplugin.csi.alibabacloud.com not found in the list of registered CSI drivers
Cause
-
This typically occurs on new nodes where the application pod tries to mount a volume before the concurrently starting CSI plug-in finishes registration.
-
The CSI plug-in on the node failed to register, possibly due to a startup failure.
Solution
-
On a newly added node, no action needed. The system retries automatically.
-
If the CSI plug-in fails to register, check its status and logs. If the plug-in is normal, join the DingTalk user group (ID: 35532895) for assistance.
Pod startup failure: Multi-Attach error
Symptom
A pod mounting a cloud disk shows warning failedAttachVolume xxx xxx Multi-Attach error for volume "xxx" in events. Running kubectl describe pvc <pvc-name> reveals multiple pods referencing the same PVC.
Cause
-
Cause 1: Without multi-attach enabled, a cloud disk can be mounted to only one pod.
-
Cause 2: The pod using the PVC was deleted, but the cloud disk was not properly unmounted.
In the ECS console, find the node the PVC's cloud disk is attached to. Check the CSI plug-in pod logs on that node. If you see
Path is mounted, no remove: /var/lib/kubelet/plugins/kubernetes.io/csi/diskplugin.csi.alibabacloud.com/xxx/globalmount, verify the CSI plug-in directly mounts the/var/runHostPath:kubectl get ds -n kube-system csi-plugin -ojsonpath='{.spec.template.spec.volumes[?(@.hostPath.path=="/var/run/")]}'Non-empty output confirms the issue.
Solution
-
Solution for Cause 1:
Ensure only one pod references the PVC.
-
Solution for Cause 2:
Manually patch the CSI plug-in YAML file:
kubectl patch -n kube-system daemonset csi-plugin -p ' spec: template: spec: containers: - name: csi-plugin volumeMounts: - mountPath: /host/var/run/efc name: efc-metrics-dir - mountPath: /host/var/run/ossfs name: ossfs-metrics-dir - mountPath: /host/var/run/ $patch: delete volumes: - name: ossfs-metrics-dir hostPath: path: /var/run/ossfs type: DirectoryOrCreate - name: efc-metrics-dir hostPath: path: /var/run/efc type: DirectoryOrCreate - name: fuse-metrics-dir $patch: delete'
Pod startup failure: Mount timeout
Symptom
You start a pod that has a volume mounted, and a pod event displays the error message Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition.
Cause
This event is an error message reported by the kubelet. The kubelet periodically checks whether the volumes used by pods on all nodes are Ready. If a volume is not Ready, the preceding error message is displayed.
This event indicates the mount was incomplete at check time. Possible causes:
-
Cause 1: A mount error occurred, but the detailed event has been overwritten. Only the kubelet error event remains.
-
Cause 2: The kubelet times out when it retrieves the
configmap/serviceaccount defaulttoken. This is a node network issue. You must try again on a different node. -
Cause 3: When
securityContext.fsGroupis set, the kubelet recursively changes ownership of all files during mount. This is slow for volumes with many files. -
Cause 4: For statically provisioned volumes, verify the
driverfield is correct (no typos). An incorrectdrivername prevents the volume from becoming Ready.
Solution
-
Solution for Cause 1: Restart the pod by deleting it. Find and diagnose the actual error event.
-
Solution for Cause 2: Reschedule the pod to another node. See Schedule an application to a specified node.
-
Solution for Cause 3: For clusters version 1.20+, set
fsGroupChangePolicytoOnRootMismatch. This limits recursive ownership changes to the first mount when root directory permissions differ, normalizing mount time for subsequent pod restarts. For more information about thefsGroupChangePolicyparameter, see Configure a Security Context for a Pod or Container. If insufficient, use aninitContainerfor permission adjustments. -
Solution for Cause 4: Check and specify the correct driver name. Examples:
-
diskplugin.csi.alibabacloud.com
-
nasplugin.csi.alibabacloud.com
-
ossplugin.csi.alibabacloud.com
-
Pod startup failure: Invalid device format for NVMe
Symptom
A pod with a mounted cloud disk fails to start, and a pod event displays the error message validate error Device /dev/nvme1n1 has error format more than one digit locations.
Cause
The node is an NVMe-based instance type (such as g7se, r7se, c7se, or any 8th-generation ECS), but the cluster and CSI plug-in versions are too old to support it.
Solution
Ensure your ACK cluster is version 1.20+ and upgrade the CSI plug-in to v1.22.9-30eb0ee5-aliyun or later. See Manage components.
The FlexVolume plug-in is not supported. Join the DingTalk user group (ID: 35532895) for assistance with migrating from the FlexVolume plug-in to the CSI plug-in.
Pod startup failure: ECS task conflict
Symptom
A pod with a mounted cloud disk fails to start, and a pod event displays the error message ecs task is conflicted.
Cause
Some ECS tasks must be performed in sequence. When multiple requests are sent to ECS at the same time, an ECS task conflict error occurs.
Solution
Solutions:
-
Wait for the CSI plug-in to retry automatically. Once other ECS tasks complete, the mount succeeds.
Pod startup failure: Corrupted file system
Symptom
A pod with a mounted cloud disk fails to start, and a pod event displays the following error message.
wrong fs type, bad option, bad superblock on /dev/xxxxx missing codepage or helper program, or other error
Cause
The cloud disk cannot be mounted because its file system is corrupted.
Solution
This is usually caused by an improper unmount. To resolve:
-
Verify the application meets these requirements:
-
No more than one pod is mounting the same cloud disk.
-
Do not write data during the disk detachment process.
-
-
Log on to the pod's host and run
fsck -y /dev/xxxxxto repair the file system.Replace
/dev/xxxxxwith the device path from the pod event. Repair modifies file system metadata. If repair fails, the file system becomes permanently corrupted.
Pod startup failure: Exceeded max volume count
Symptom
A pod with a mounted cloud disk stays in Pending and cannot be scheduled, even though the ECS instance type allows more disk attachments:
0/1 nodes are available: 1 node(s) exceed max volume count.
Cause
Scheduling is limited by the MAX_VOLUMES_PERNODE environment variable.
Solution
-
CSI plug-in v1.26.4-e3de357-aliyun+ supports automatic disk count configuration. Delete the
MAX_VOLUMES_PERNODEenvironment variable from the CSI plug-in DaemonSet in kube-system to enable automatic configuration based on ECS instance type:kubectl patch -n kube-system daemonset csi-plugin -p ' spec: template: spec: containers: - name: csi-plugin env: - name: MAX_VOLUMES_PERNODE $patch: delete' -
For CSI plug-in versions before v1.26.4-e3de357-aliyun, manually set this environment variable based on the node with the fewest attachable data disks.
-
Automatic configuration runs only at CSI plug-in pod startup. If you manually attach or detach a data disk, recreate the CSI plug-in pod on that node to retrigger it.
-
Automatic configuration does not account for statically provisioned disk volumes. If present, the schedulable pod count may be lower than expected.
Pod startup failure: Instance disk limit reached
Symptom
A pod with a mounted cloud disk stays in ContainerCreating. Pod event:
MountVolume.MountDevice failed for volume "d-xxxx" : rpc error: code = Aborted desc = NodeStageVolume: Attach volume: d-xxxx with error: rpc error: code = Internal desc = SDK.ServerError
ErrorCode: InstanceDiskLimitExceeded
Message: The amount of the disk on instance in question reach its limits
Cause
The value of the MAX_VOLUMES_PERNODE environment variable is too high.
Solution
-
CSI plug-in v1.26.4-e3de357-aliyun+ supports automatic disk count configuration. Delete the
MAX_VOLUMES_PERNODEenvironment variable from the CSI plug-in DaemonSet in kube-system to enable automatic configuration based on ECS instance type:kubectl patch -n kube-system daemonset csi-plugin -p ' spec: template: spec: containers: - name: csi-plugin env: - name: MAX_VOLUMES_PERNODE $patch: delete' -
For CSI plug-in versions before v1.26.4-e3de357-aliyun, manually set this environment variable based on the node with the fewest attachable data disks.
-
Automatic configuration runs only at CSI plug-in pod startup. If you manually attach or detach a data disk, recreate the CSI plug-in pod on that node to retrigger it.
-
Automatic configuration does not account for statically provisioned disk volumes. If present, the schedulable pod count may be lower than expected.
OperationDenied.HpnZoneMismatch error on Lingjun nodes
The cloud disk lacks the required attachToHpnZone:XX tag for the Lingjun node. This tag can only be added at disk creation time. Create a new cloud disk with the tag.
-
Add the following tags during creation (cannot be added later):
-
createdByProduct:eflo -
attachToHpnZone:XX(ReplaceXXwith the actualHpnZone. You can run a command to query nodes to obtain the HpnZone.)
-
-
Delete the old PV and PVC, then create new ones with the replacement cloud disk. See Use a statically provisioned disk volume or Use a dynamically provisioned disk volume.
Change default StorageClass configuration
The default StorageClass cannot be changed.
After csi-provisioner installation, a default StorageClass (e.g., alicloud-disk-topology-alltype) is created. Do not modify default StorageClasses. Create a new StorageClass with your preferred volume type or reclaim policy instead.
Use a single disk volume for multiple applications
Cloud disks provide non-shared storage. Without multi-attach enabled, a cloud disk can be mounted to only one pod. See Enable NVMe-based multi-attach for cloud disks and configure reservations.
Usage
Application I/O error on disk volume
Symptom
The cloud disk mounts normally and the application starts, but shortly after reports an input/output error.
Cause
The cloud disk used by the application has been detached or deleted.
Solution
Check the status of the cloud disk and take action based on the status.
-
Identify the PVC from the pod's
VolumeMountdefinition for the affected mount directory. -
Run
kubectl get pvc <pvc-name>to check the PVC status and note the bound PV. -
Get the cloud disk ID from the
volumeHandlefield in the PV's YAML. -
On the EBS page of the ECS console, check the status of the cloud disk based on the cloud disk ID.
-
If the cloud disk is Available, it has been unmounted. Restart the pod to remount.
NoteThe pod is Running, meaning the disk was mounted then unmounted, likely due to multiple pods referencing the same disk. Run
kubectl describe pvc <pvc-name>and checkUsedByto confirm. -
If the cloud disk is not found, it has been released and is unrecoverable.
ImportantWhen mounting an ESSD, use automatic snapshots to protect disk volume data. See Data loss due to unexpected cloud disk deletion.
-
Set user permissions for mount directory
You cannot set user access permissions for cloud disks. If you need to set user access permissions for a mount directory, configure the securityContext for the pod when you create the application to modify the permissions. For more information, see Configure Volume Permission and Ownership Change Policy for Pods.
When you configure
securityContext.fsgroup, the kubelet recursively changes file permissions (chmod/chown) when mounting a volume. This can significantly increase mount time for volumes with many files.For clusters of version 1.20 or later, we recommend setting
fsGroupChangePolicytoOnRootMismatch. This optimizes mount performance by only performing the recursive permission change when the volume is first mounted and the root directory permissions do not match. If performance remains an issue or you require more granular permission control, use aninitContainerto manage permissions before the application container starts.When a Pod is rebuilt, it remounts the original cloud disk. If other constraints prevent the Pod from being scheduled to the original zone, it will remain in the Pending state because it cannot mount the cloud disk.
Expansion
Automatic disk volume expansion
Disk volumes do not expand automatically. To expand, update the PVC storage capacity. See Online expansion of disk volumes.
To enable automatic expansion, use a CRD to expand volumes when usage exceeds a threshold. See Configure automatic expansion.
For clusters before 1.16, or when online expansion requirements are not met (e.g., basic disks), expand the cloud disk directly in the ECS console. Cluster resources are unaffected — PVC and PV capacity stays at the pre-expansion size.
Disk expansion failure: Waiting for user to restart pod
Symptom
After updating the PVC storage capacity, the StorageCapacity field in the PVC Status does not change, and the PVC event reports:
Waiting for user to (re-)start a pod to finish file system resize of volume on node.
Cause
Cloud disk expansion has two steps: disk capacity expansion via the ResizeDisk API, then file system expansion. This error means step one succeeded but step two failed, indicating a node-level issue.
Solution
Determine the type of the current node.
-
If the node is ECI, run
kubectl get configmap -n kube-system eci-profile -o jsonpath="{.data.enablePVCController}"and verify the value istrue. See eci-profile configuration parameters.If the issue persists, submit a ticket for assistance.
-
If the node is ECS, run
kubectl get pods -n kube-system -l app=csi-plugin --field-selector=spec.nodeName=<node-name>to check the CSI plug-in status.-
If the CSI plug-in is normal, join the DingTalk user group (ID: 35532895) for assistance.
-
If abnormal, restart the CSI plug-in pod and retry. If the issue persists, join the DingTalk user group (ID: 35532895) for assistance.
-
Disk expansion failure: Static PVC or resize not allowed
Symptom
After updating the PVC storage capacity, the following error is reported:
only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize
Cause
-
Cause 1: The PVC and PV of the disk volume are manually created in a static way. The
storageClassNameparameter in the PVC is not specified, or no StorageClass with the specified name exists in the cluster. -
Cause 2: In the StorageClass that is referenced by the PVC, the
allowVolumeExpansionparameter is set tofalse. Expansion is not supported.
Solution
-
Solution for Cause 1: Check the
storageClassNameconfiguration of the PVC and ensure that a StorageClass with the same name exists in the cluster. If not, create a corresponding StorageClass based on the properties of the existing disk volume and configureallowVolumeExpansion: true. -
Solution for Cause 2: StorageClasses are immutable. Create a new StorageClass with the
allowVolumeExpansionparameter set totrue. Then, modify the PVC to reference the new StorageClass before you expand the PVC.
Unmounting
Pod deletion failure: Not a portable disk
Symptom
When you unmount a cloud disk, the error message The specified disk is not a portable disk is displayed.
Cause
The billing method of the cloud disk is subscription. This may result from purchasing a subscription disk or converting the billing method when upgrading an ECS instance.
Solution
Change the billing method of the cloud disk to pay-as-you-go.
Pod deletion failure: Unmount fails due to an orphaned pod
Symptom
Unmounting a pod fails, and the kubelet logs show an orphaned pod that is not managed by ACK.
Cause
The pod terminated abnormally, leaving an orphaned volume mount point. In Kubernetes before 1.22, kubelet volume GC was incomplete, requiring manual cleanup.
Solution
Run the following script on the problematic node to clean up the garbage mount points.
wget https://raw.githubusercontent.com/AliyunContainerService/kubernetes-issues-solution/master/kubelet/kubelet.sh
sh kubelet.sh
Pod restart failure: Unrecoverable mount failure
Symptom
After deletion, the pod cannot restart and reports an unrecoverable mount failure:
Warning FailedMount 9m53s (x23 over 40m) kubelet MountVolume.SetUp failed for volume “xxxxx” : rpc error: code = Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/xxxxx/globalmount: no such file or directory
Scope
-
The ACK cluster version is 1.20.4-aliyun-1.
-
The application uses a cloud disk as the storage medium.
-
A StatefulSet is used and the
podManagementPolicy: "Parallel"property is set.
Cause
See Pod fails to start after restarting rapidly.
Solution
-
Add new nodes and remove the old ones. The faulty pod recovers automatically. See Create and manage node pools and Remove a node.
-
Change the StatefulSet's
podManagementPolicyto OrderedReady or remove thepodManagementPolicy: "Parallel"field. -
For small clusters:
-
Mark the pod's node as unschedulable with
cordon. -
Delete the pod and wait for its status to become Pending.
-
Remove the
cordonfrom the node and wait for the pod to restart.
-
-
For large clusters, the pod recovers once scheduled to a different node.
Pod deletion failure: Target is busy
Symptom
When deleting a pod, the pod event or kubelet log (/var/log/messages) shows:
unmount failed, output <mount-path> target is busy
Cause
A process is still using the device. Log on to the host to identify and terminate the process.
Solution
-
Find the block device under the corresponding mount path.
mount | grep <mount-path> /dev/vdtest <mount-path> -
Find the ID of the process that is using the block device.
fuser -m /dev/vdtest -
Terminate the process.
The cloud disk unmounts automatically after the process terminates.
Cloud disk remains after PVC deletion
Symptom
A PVC is deleted, but the cloud disk remains in the ECS console.
Cause
-
Cause 1: The PV's
reclaimPolicyisRetain, so the PV and cloud disk persist after PVC deletion. -
Cause 2: The PVC and PV are deleted at the same time, or the PV is deleted before the PVC.
Solution
-
Solution for Cause 1: With
reclaimPolicyset toRetain, CSI does not delete the PV or cloud disk when the PVC is deleted. Delete them manually. -
Solution for Cause 2: If a PV has a
deleteTimestamp annotation, CSI will not reclaim the cloud disk. See controller. Delete the PVC instead — the bound PV is automatically cleaned up.
PVC remains after deletion
Symptom
PVC deletion fails even with the --force flag.
Cause
A pod still uses the PVC, so its finalizer prevents deletion.
Solution
-
View the pods that are referencing the PVC.
kubectl describe pvc <pvc-name> -n kube-system -
Confirm the referencing pod is no longer in use, delete it, then retry deleting the PVC.
Other
Change billing method of a volume to subscription
Cloud disks used as volumes must use pay-as-you-go billing and cannot be converted to subscription.
Identify the cloud disk for a volume in the ECS console
Get the cloud disk ID (d-******** format) and locate it on the EBS page in the ECS console to identify the associated cloud disks.
-
For dynamically created PVs, the PV name is the cloud disk ID. View it on the page of the cluster.
-
If the PV name is not the cloud disk ID, run
kubectl get pv <pv-name> -o yaml. ThevolumeHandlefield contains the cloud disk ID.