Disk volume FAQ

更新时间:
复制 MD 格式

Resolve disk volume creation, mounting, expansion, and unmounting issues in ACK clusters.

FAQ navigation

Type

Issue

Creation

Mounting

Usage

Expansion

Unmounting

Other

Creation

Dynamic PV creation failure: InvalidDataDiskCatagory.NotSupported

Symptom

PV creation fails, and a PVC event displays the error message InvalidDataDiskCategory.NotSupported.

Cause

The StorageClass specifies a cloud disk type unavailable in the current zone, or inventory is insufficient.

Solution

Dynamic PV creation failure: Insufficient AZone inventory

Symptom

PV creation fails, and a PVC event displays the error message The specified AZone inventory is insufficient.

Cause

Cloud disk creation fails due to insufficient inventory in the specified zone.

Solution

Dynamic PV creation failure: Disk size not supported

Symptom

Dynamic PV creation fails, and a PVC event displays the error message disk size is not supported.

Cause

The PVC specifies an invalid capacity. Minimum limits vary by cloud disk type.

Solution

Adjust the capacity declared in the PVC to a supported value.

Dynamic PV creation failure: Waiting for first consumer

Symptom

PV creation fails when using a StorageClass with the WaitForFirstConsumer binding mode, and a PVC event displays the error message persistentvolume-controller waiting for first consumer to be created before binding.

Cause

The PVC has not detected the node to which the pod is scheduled.

  • If you explicitly set nodeName in the application's YAML, the pod bypasses the scheduler. This prevents the PVC from discovering the node, making it incompatible with the WaitForFirstConsumer binding mode.

  • No pod is referencing the current PVC.

Solution

  • Remove the nodeName field from the application's YAML file and use a different scheduling method.

  • Create a pod that uses the current PVC.

Dynamic PV creation failure: No topology key on CSINode

Symptom

PV creation fails, and a PVC event displays the error message no topology key found on CSINode node-XXXX.

Cause

  • Cause 1: The CSI plug-in on the corresponding node node-XXXX fails to start.

  • Cause 2: A driver that is not supported by the system is used for mounting. By default, the system supports Disk, NAS, and OSS drivers.

Solution

  1. Check the pod status.

    kubectl get pods -n kube-system -o wide | grep node-XXXX
    • Abnormal status: Check the error logs with kubectl logs csi-plugin-xxxx -nkube-system -c csi-plugin. This is typically caused by a node port conflict. Resolve as follows:

      • Close the process that is occupying the port.

      • Add the SERVICE_PORT environment variable to the CSI plug-in to specify a new port.

        kubectl set env -n kube-system daemonset/csi-plugin --containers="csi-plugin" SERVICE_PORT="XXX"
    • If the status is normal, go to the next step.

  2. Use a default system driver, such as Disk, NAS, or OSS. See the Storage documentation.

Dynamic PV creation failure: selfLink was empty

Symptom

PV creation fails, and a PVC event displays the error message selfLink was empty, can't make reference.

Cause

  1. The cluster version and the CSI plug-in version do not match.

  2. The cluster uses the FlexVolume storage plug-in.

Solution

  1. Upgrade the CSI plug-in version. The plug-in version must be compatible with the cluster version. For example, a Kubernetes 1.20 cluster requires CSI version 1.20 or later.

  2. If your cluster uses the FlexVolume storage plug-in, see Migrate from FlexVolume to CSI.

Dynamic PV creation failure: PVC capacity < 20 GiB

Default ACK StorageClasses (e.g., alicloud-disk-topology-alltype, alicloud-disk-essd) require a minimum capacity of 20 GiB. For smaller volumes, create a custom StorageClass with a disk type supporting lower capacity, such as ESSD AutoPL or ESSD PL0.

Mounting

Pod startup failure: Volume node affinity conflict

Symptom

A pod with a mounted cloud disk fails to start, and a pod event displays the error message had volume node affinity conflict.

Cause

Every PV has a nodeaffinity property. This error occurs when the PV's nodeaffinity conflicts with the pod's nodeaffinity, preventing scheduling.

Solution

Ensure the PV nodeaffinity and pod nodeaffinity are consistent.

Pod startup failure: Can't find disk

Symptom

A pod with a mounted cloud disk fails to start, and a pod event displays the error message can't find disk.

Cause

  • You configured the PV with an incorrect cloud disk ID or a cloud disk ID from another region.

  • Your account lacks permission to access the cloud disk, possibly because it belongs to another account.

Solution

  • If you are using a statically provisioned cloud disk, check whether it meets the following requirements:

    • The region of the cloud disk is the same as the region of the cluster.

    • The cloud disk ID is correct.

    • The cloud disk and the cluster belong to the same Alibaba Cloud account.

  • If the cloud disk is dynamically mounted, check the permissions of the CSI plug-in.

    Check whether an Addon Token exists in the cluster.

    • If an Addon Token exists, check the CSI plug-in version in the cluster, upgrade the plug-in to the latest version, and then try again.

    • If an Addon Token does not exist, the custom AccessKey of the node's worker RAM role is used by default. You need to check the permissions of the corresponding RAM policy.

Pod startup failure: Attach action in process

Symptom

A pod with a mounted cloud disk shows Previous attach action is still in process but starts successfully after a few seconds.

Cause

ECS instances attach one cloud disk at a time. When multiple disk-backed pods land on the same host, disks attach sequentially. This message means another attach is in progress.

Solution

No action required. The system retries automatically.

Pod startup failure: InvalidInstanceType.NotSupportDiskCategory

Symptom

When you start a pod with a cloud disk mounted, the error message InvalidInstanceType.NotSupportDiskCategory is displayed.

Cause

The pod is scheduled to an ECS node whose instance type does not support this cloud disk type.

Solution

Try one of the following:

  • Verify the cluster has an ECS node whose instance type supports this cloud disk type, and that the pod can be scheduled to it.

  • If no node instance type supports this cloud disk type, switch to a compatible disk type.

Note

For cloud disk and instance type compatibility, see Instance families.

Pod startup failure: CSI driver not registered

Symptom

When you start a pod, the following warning is displayed.

Warning  FailedMount       98s (x9 over 3m45s)  kubelet, cn-zhangjiakou.172.20.XX.XX  MountVolume.MountDevice failed for volume "d-xxxxxxx" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name diskplugin.csi.alibabacloud.com not found in the list of registered CSI drivers

Cause

  • This typically occurs on new nodes where the application pod tries to mount a volume before the concurrently starting CSI plug-in finishes registration.

  • The CSI plug-in on the node failed to register, possibly due to a startup failure.

Solution

  • On a newly added node, no action needed. The system retries automatically.

  • If the CSI plug-in fails to register, check its status and logs. If the plug-in is normal, join the DingTalk user group (ID: 35532895) for assistance.

Pod startup failure: Multi-Attach error

Symptom

A pod mounting a cloud disk shows warning failedAttachVolume xxx xxx Multi-Attach error for volume "xxx" in events. Running kubectl describe pvc <pvc-name> reveals multiple pods referencing the same PVC.

Cause

  • Cause 1: Without multi-attach enabled, a cloud disk can be mounted to only one pod.

  • Cause 2: The pod using the PVC was deleted, but the cloud disk was not properly unmounted.

    In the ECS console, find the node the PVC's cloud disk is attached to. Check the CSI plug-in pod logs on that node. If you see Path is mounted, no remove: /var/lib/kubelet/plugins/kubernetes.io/csi/diskplugin.csi.alibabacloud.com/xxx/globalmount, verify the CSI plug-in directly mounts the /var/run HostPath:

    kubectl get ds -n kube-system csi-plugin -ojsonpath='{.spec.template.spec.volumes[?(@.hostPath.path=="/var/run/")]}'

    Non-empty output confirms the issue.

Solution

  • Solution for Cause 1:

    Ensure only one pod references the PVC.

  • Solution for Cause 2:

    Manually patch the CSI plug-in YAML file:

    kubectl patch -n kube-system daemonset csi-plugin -p '
    spec:
      template:
        spec:
          containers:
            - name: csi-plugin
              volumeMounts:
                - mountPath: /host/var/run/efc
                  name: efc-metrics-dir
                - mountPath: /host/var/run/ossfs
                  name: ossfs-metrics-dir
                - mountPath: /host/var/run/
                  $patch: delete
          volumes:
            - name: ossfs-metrics-dir
              hostPath:
                path: /var/run/ossfs
                type: DirectoryOrCreate
            - name: efc-metrics-dir
              hostPath:
                path: /var/run/efc
                type: DirectoryOrCreate
            - name: fuse-metrics-dir
              $patch: delete'

Pod startup failure: Mount timeout

Symptom

You start a pod that has a volume mounted, and a pod event displays the error message Unable to attach or mount volumes: unmounted volumes=[xxx], unattached volumes=[xxx]: timed out waiting for the condition.

Cause

This event is an error message reported by the kubelet. The kubelet periodically checks whether the volumes used by pods on all nodes are Ready. If a volume is not Ready, the preceding error message is displayed.

This event indicates the mount was incomplete at check time. Possible causes:

  • Cause 1: A mount error occurred, but the detailed event has been overwritten. Only the kubelet error event remains.

  • Cause 2: The kubelet times out when it retrieves the configmap/serviceaccount defaulttoken. This is a node network issue. You must try again on a different node.

  • Cause 3: When securityContext.fsGroup is set, the kubelet recursively changes ownership of all files during mount. This is slow for volumes with many files.

  • Cause 4: For statically provisioned volumes, verify the driver field is correct (no typos). An incorrect driver name prevents the volume from becoming Ready.

Solution

  • Solution for Cause 1: Restart the pod by deleting it. Find and diagnose the actual error event.

  • Solution for Cause 2: Reschedule the pod to another node. See Schedule an application to a specified node.

  • Solution for Cause 3: For clusters version 1.20+, set fsGroupChangePolicy to OnRootMismatch. This limits recursive ownership changes to the first mount when root directory permissions differ, normalizing mount time for subsequent pod restarts. For more information about the fsGroupChangePolicy parameter, see Configure a Security Context for a Pod or Container. If insufficient, use an initContainer for permission adjustments.

  • Solution for Cause 4: Check and specify the correct driver name. Examples:

    • diskplugin.csi.alibabacloud.com

    • nasplugin.csi.alibabacloud.com

    • ossplugin.csi.alibabacloud.com

Pod startup failure: Invalid device format for NVMe

Symptom

A pod with a mounted cloud disk fails to start, and a pod event displays the error message validate error Device /dev/nvme1n1 has error format more than one digit locations.

Cause

The node is an NVMe-based instance type (such as g7se, r7se, c7se, or any 8th-generation ECS), but the cluster and CSI plug-in versions are too old to support it.

Solution

Ensure your ACK cluster is version 1.20+ and upgrade the CSI plug-in to v1.22.9-30eb0ee5-aliyun or later. See Manage components.

Note

The FlexVolume plug-in is not supported. Join the DingTalk user group (ID: 35532895) for assistance with migrating from the FlexVolume plug-in to the CSI plug-in.

Pod startup failure: ECS task conflict

Symptom

A pod with a mounted cloud disk fails to start, and a pod event displays the error message ecs task is conflicted.

Cause

Some ECS tasks must be performed in sequence. When multiple requests are sent to ECS at the same time, an ECS task conflict error occurs.

Solution

Solutions:

Pod startup failure: Corrupted file system

Symptom

A pod with a mounted cloud disk fails to start, and a pod event displays the following error message.

wrong fs type, bad option, bad superblock on /dev/xxxxx  missing codepage or helper program, or other error

Cause

The cloud disk cannot be mounted because its file system is corrupted.

Solution

This is usually caused by an improper unmount. To resolve:

  1. Verify the application meets these requirements:

    • No more than one pod is mounting the same cloud disk.

    • Do not write data during the disk detachment process.

  2. Log on to the pod's host and run fsck -y /dev/xxxxx to repair the file system.

    Replace /dev/xxxxx with the device path from the pod event. Repair modifies file system metadata. If repair fails, the file system becomes permanently corrupted.

Pod startup failure: Exceeded max volume count

Symptom

A pod with a mounted cloud disk stays in Pending and cannot be scheduled, even though the ECS instance type allows more disk attachments:

0/1 nodes are available: 1 node(s) exceed max volume count.

Cause

Scheduling is limited by the MAX_VOLUMES_PERNODE environment variable.

Solution

  • CSI plug-in v1.26.4-e3de357-aliyun+ supports automatic disk count configuration. Delete the MAX_VOLUMES_PERNODE environment variable from the CSI plug-in DaemonSet in kube-system to enable automatic configuration based on ECS instance type:

    kubectl patch -n kube-system daemonset csi-plugin -p '
    spec:
      template:
        spec:
          containers:
          - name: csi-plugin
            env:
            - name: MAX_VOLUMES_PERNODE
              $patch: delete'
  • For CSI plug-in versions before v1.26.4-e3de357-aliyun, manually set this environment variable based on the node with the fewest attachable data disks.

Important
  • Automatic configuration runs only at CSI plug-in pod startup. If you manually attach or detach a data disk, recreate the CSI plug-in pod on that node to retrigger it.

  • Automatic configuration does not account for statically provisioned disk volumes. If present, the schedulable pod count may be lower than expected.

Pod startup failure: Instance disk limit reached

Symptom

A pod with a mounted cloud disk stays in ContainerCreating. Pod event:

MountVolume.MountDevice failed for volume "d-xxxx" : rpc error: code = Aborted desc = NodeStageVolume: Attach volume: d-xxxx with error: rpc error: code = Internal desc = SDK.ServerError
ErrorCode: InstanceDiskLimitExceeded
Message: The amount of the disk on instance in question reach its limits

Cause

The value of the MAX_VOLUMES_PERNODE environment variable is too high.

Solution

  • CSI plug-in v1.26.4-e3de357-aliyun+ supports automatic disk count configuration. Delete the MAX_VOLUMES_PERNODE environment variable from the CSI plug-in DaemonSet in kube-system to enable automatic configuration based on ECS instance type:

    kubectl patch -n kube-system daemonset csi-plugin -p '
    spec:
      template:
        spec:
          containers:
          - name: csi-plugin
            env:
            - name: MAX_VOLUMES_PERNODE
              $patch: delete'
  • For CSI plug-in versions before v1.26.4-e3de357-aliyun, manually set this environment variable based on the node with the fewest attachable data disks.

Important
  • Automatic configuration runs only at CSI plug-in pod startup. If you manually attach or detach a data disk, recreate the CSI plug-in pod on that node to retrigger it.

  • Automatic configuration does not account for statically provisioned disk volumes. If present, the schedulable pod count may be lower than expected.

OperationDenied.HpnZoneMismatch error on Lingjun nodes

The cloud disk lacks the required attachToHpnZone:XX tag for the Lingjun node. This tag can only be added at disk creation time. Create a new cloud disk with the tag.

  1. Go to ECS console - Block Storage - Disks.

  2. Add the following tags during creation (cannot be added later):

    • createdByProduct:eflo

    • attachToHpnZone:XX (Replace XX with the actual HpnZone. You can run a command to query nodes to obtain the HpnZone.)

    See Create an empty data disk.

  3. Delete the old PV and PVC, then create new ones with the replacement cloud disk. See Use a statically provisioned disk volume or Use a dynamically provisioned disk volume.

Change default StorageClass configuration

The default StorageClass cannot be changed.

After csi-provisioner installation, a default StorageClass (e.g., alicloud-disk-topology-alltype) is created. Do not modify default StorageClasses. Create a new StorageClass with your preferred volume type or reclaim policy instead.

Use a single disk volume for multiple applications

Cloud disks provide non-shared storage. Without multi-attach enabled, a cloud disk can be mounted to only one pod. See Enable NVMe-based multi-attach for cloud disks and configure reservations.

Usage

Application I/O error on disk volume

Symptom

The cloud disk mounts normally and the application starts, but shortly after reports an input/output error.

Cause

The cloud disk used by the application has been detached or deleted.

Solution

Check the status of the cloud disk and take action based on the status.

  1. Identify the PVC from the pod's VolumeMount definition for the affected mount directory.

  2. Run kubectl get pvc <pvc-name> to check the PVC status and note the bound PV.

  3. Get the cloud disk ID from the volumeHandle field in the PV's YAML.

  4. On the EBS page of the ECS console, check the status of the cloud disk based on the cloud disk ID.

    • If the cloud disk is Available, it has been unmounted. Restart the pod to remount.

      Note

      The pod is Running, meaning the disk was mounted then unmounted, likely due to multiple pods referencing the same disk. Run kubectl describe pvc <pvc-name> and check UsedBy to confirm.

    • If the cloud disk is not found, it has been released and is unrecoverable.

      Important

      When mounting an ESSD, use automatic snapshots to protect disk volume data. See Data loss due to unexpected cloud disk deletion.

Set user permissions for mount directory

You cannot set user access permissions for cloud disks. If you need to set user access permissions for a mount directory, configure the securityContext for the pod when you create the application to modify the permissions. For more information, see Configure Volume Permission and Ownership Change Policy for Pods.

Important
  • When you configure securityContext.fsgroup, the kubelet recursively changes file permissions (chmod/chown) when mounting a volume. This can significantly increase mount time for volumes with many files.

    For clusters of version 1.20 or later, we recommend setting fsGroupChangePolicy to OnRootMismatch. This optimizes mount performance by only performing the recursive permission change when the volume is first mounted and the root directory permissions do not match. If performance remains an issue or you require more granular permission control, use an initContainer to manage permissions before the application container starts.

  • When a Pod is rebuilt, it remounts the original cloud disk. If other constraints prevent the Pod from being scheduled to the original zone, it will remain in the Pending state because it cannot mount the cloud disk.

Expansion

Automatic disk volume expansion

Disk volumes do not expand automatically. To expand, update the PVC storage capacity. See Online expansion of disk volumes.

To enable automatic expansion, use a CRD to expand volumes when usage exceeds a threshold. See Configure automatic expansion.

Note

For clusters before 1.16, or when online expansion requirements are not met (e.g., basic disks), expand the cloud disk directly in the ECS console. Cluster resources are unaffected — PVC and PV capacity stays at the pre-expansion size.

Disk expansion failure: Waiting for user to restart pod

Symptom

After updating the PVC storage capacity, the StorageCapacity field in the PVC Status does not change, and the PVC event reports:

 Waiting for user to (re-)start a pod to finish file system resize of volume on node.

Cause

Cloud disk expansion has two steps: disk capacity expansion via the ResizeDisk API, then file system expansion. This error means step one succeeded but step two failed, indicating a node-level issue.

Solution

Determine the type of the current node.

  • If the node is ECI, run kubectl get configmap -n kube-system eci-profile -o jsonpath="{.data.enablePVCController}" and verify the value is true. See eci-profile configuration parameters.

    If the issue persists, submit a ticket for assistance.

  • If the node is ECS, run kubectl get pods -n kube-system -l app=csi-plugin --field-selector=spec.nodeName=<node-name> to check the CSI plug-in status.

    • If the CSI plug-in is normal, join the DingTalk user group (ID: 35532895) for assistance.

    • If abnormal, restart the CSI plug-in pod and retry. If the issue persists, join the DingTalk user group (ID: 35532895) for assistance.

Disk expansion failure: Static PVC or resize not allowed

Symptom

After updating the PVC storage capacity, the following error is reported:

only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize 

Cause

  • Cause 1: The PVC and PV of the disk volume are manually created in a static way. The storageClassName parameter in the PVC is not specified, or no StorageClass with the specified name exists in the cluster.

  • Cause 2: In the StorageClass that is referenced by the PVC, the allowVolumeExpansion parameter is set to false. Expansion is not supported.

Solution

  • Solution for Cause 1: Check the storageClassName configuration of the PVC and ensure that a StorageClass with the same name exists in the cluster. If not, create a corresponding StorageClass based on the properties of the existing disk volume and configure allowVolumeExpansion: true.

  • Solution for Cause 2: StorageClasses are immutable. Create a new StorageClass with the allowVolumeExpansion parameter set to true. Then, modify the PVC to reference the new StorageClass before you expand the PVC.

Unmounting

Pod deletion failure: Not a portable disk

Symptom

When you unmount a cloud disk, the error message The specified disk is not a portable disk is displayed.

Cause

The billing method of the cloud disk is subscription. This may result from purchasing a subscription disk or converting the billing method when upgrading an ECS instance.

Solution

Change the billing method of the cloud disk to pay-as-you-go.

Pod deletion failure: Unmount fails due to an orphaned pod

Symptom

Unmounting a pod fails, and the kubelet logs show an orphaned pod that is not managed by ACK.

Cause

The pod terminated abnormally, leaving an orphaned volume mount point. In Kubernetes before 1.22, kubelet volume GC was incomplete, requiring manual cleanup.

Solution

Run the following script on the problematic node to clean up the garbage mount points.

wget https://raw.githubusercontent.com/AliyunContainerService/kubernetes-issues-solution/master/kubelet/kubelet.sh
sh kubelet.sh

Pod restart failure: Unrecoverable mount failure

Symptom

After deletion, the pod cannot restart and reports an unrecoverable mount failure:

Warning FailedMount 9m53s (x23 over 40m) kubelet MountVolume.SetUp failed for volume “xxxxx” : rpc error: code = Internal desc = stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/xxxxx/globalmount: no such file or directory

Scope

  • The ACK cluster version is 1.20.4-aliyun-1.

  • The application uses a cloud disk as the storage medium.

  • A StatefulSet is used and the podManagementPolicy: "Parallel" property is set.

Cause

See Pod fails to start after restarting rapidly.

Solution

  • Add new nodes and remove the old ones. The faulty pod recovers automatically. See Create and manage node pools and Remove a node.

  • Change the StatefulSet's podManagementPolicy to OrderedReady or remove the podManagementPolicy: "Parallel" field.

  • For small clusters:

    1. Mark the pod's node as unschedulable with cordon.

    2. Delete the pod and wait for its status to become Pending.

    3. Remove the cordon from the node and wait for the pod to restart.

  • For large clusters, the pod recovers once scheduled to a different node.

Pod deletion failure: Target is busy

Symptom

When deleting a pod, the pod event or kubelet log (/var/log/messages) shows:

unmount failed, output <mount-path> target is busy

Cause

A process is still using the device. Log on to the host to identify and terminate the process.

Solution

  1. Find the block device under the corresponding mount path.

    mount | grep <mount-path>
    /dev/vdtest <mount-path>
  2. Find the ID of the process that is using the block device.

    fuser -m /dev/vdtest
  3. Terminate the process.

    The cloud disk unmounts automatically after the process terminates.

Cloud disk remains after PVC deletion

Symptom

A PVC is deleted, but the cloud disk remains in the ECS console.

Cause

  • Cause 1: The PV's reclaimPolicy is Retain, so the PV and cloud disk persist after PVC deletion.

  • Cause 2: The PVC and PV are deleted at the same time, or the PV is deleted before the PVC.

Solution

  • Solution for Cause 1: With reclaimPolicy set to Retain, CSI does not delete the PV or cloud disk when the PVC is deleted. Delete them manually.

  • Solution for Cause 2: If a PV has a deleteTimestamp annotation, CSI will not reclaim the cloud disk. See controller. Delete the PVC instead — the bound PV is automatically cleaned up.

PVC remains after deletion

Symptom

PVC deletion fails even with the --force flag.

Cause

A pod still uses the PVC, so its finalizer prevents deletion.

Solution

  1. View the pods that are referencing the PVC.

    kubectl describe pvc <pvc-name> -n kube-system
  2. Confirm the referencing pod is no longer in use, delete it, then retry deleting the PVC.

Other

Change billing method of a volume to subscription

Cloud disks used as volumes must use pay-as-you-go billing and cannot be converted to subscription.

Identify the cloud disk for a volume in the ECS console

Get the cloud disk ID (d-******** format) and locate it on the EBS page in the ECS console to identify the associated cloud disks.

  • For dynamically created PVs, the PV name is the cloud disk ID. View it on the Volumes > Persistent Volumes page of the cluster.

  • If the PV name is not the cloud disk ID, run kubectl get pv <pv-name> -o yaml. The volumeHandle field contains the cloud disk ID.