Enable the descheduling feature

更新时间:
复制 MD 格式

Enable Koordinator Descheduler to automatically evict and rebalance pods when node taints, affinity rules, or load profiles change.

The following procedure uses the RemovePodsViolatingNodeTaints plug-in as an example.

Prerequisites

Make sure you have:

Descheduling is not supported on virtual nodes.

Usage notes

  • Koordinator Descheduler only evicts pods — it does not recreate them. The workload controller (such as a Deployment or StatefulSet) recreates evicted pods, and the standard scheduler places them.

  • Old pods are evicted before new pods are created. Ensure your application has enough replicas to maintain availability during eviction.

  • ack-descheduler is discontinued. If you still use it, see How do I migrate from ack-descheduler to Koordinator Descheduler?

Choose a descheduling plug-in

Select a plug-in for your scenario:

Scenario Plug-in Policy type
Pods remain on nodes that acquired a NoSchedule taint after scheduling RemovePodsViolatingNodeTaints Deschedule
Pods violate inter-pod anti-affinity rules RemovePodsViolatingInterPodAntiAffinity Deschedule
Pods no longer satisfy node affinity rules RemovePodsViolatingNodeAffinity Deschedule
Pods restart too frequently RemovePodsHavingTooManyRestarts Deschedule
Pods have exceeded their time-to-live PodLifeTime Deschedule
Pods are in the Failed state RemoveFailedPod Deschedule
Replicated pods are unevenly spread across nodes RemoveDuplicates Balance
Nodes are unevenly utilized by resource allocation LowNodeUtilization Balance
Pods violate topology spread constraints RemovePodsViolatingTopologySpreadConstraint Balance
Nodes are overloaded by actual resource utilization LowNodeLoad Balance

The examples below use RemovePodsViolatingNodeTaints. Read the descheduling concepts and Koordinator Descheduler vs. Kubernetes Descheduler before you start.

How it works

RemovePodsViolatingNodeTaints checks every node for NoSchedule taints at the configured interval. If a running pod lacks a toleration for a node's NoSchedule taint, the plug-in evicts it. The workload controller recreates the pod, and the scheduler places it on a tolerable node.

Use excludedTaints to exempt specific taints. If a taint's key or key=value pair matches an entry in excludedTaints, the plug-in ignores it.

Example scenario:

A three-node cluster runs a Deployment with one pod per node. An administrator adds NoSchedule taints to two nodes after deployment:

  • Node A gets deschedule=not-allow:NoSchedule. Because deschedule=not-allow is in excludedTaints, this taint is ignored — the pod stays.

  • Node B gets deschedule=allow:NoSchedule. This taint is not excluded — the pod is evicted and rescheduled to Node C (which has no NoSchedule taint).

Step 1: Install ack-koordinator and enable descheduling

If ack-koordinator is already installed, ensure the version is 1.2.0-ack.2 or later.
  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the target cluster name. In the left navigation pane, click Add-ons.

  3. Find ack-koordinator and click Install.

  4. In the Install dialog box, select Enable Descheduler for ACK-Koordinator and complete the installation.

Koordinator Descheduler is deployed as a Deployment on cluster nodes.

Step 2: Enable the RemovePodsViolatingNodeTaints plug-in

Configure the plug-in

Create a file named koord-descheduler-config.yaml. This ConfigMap enables RemovePodsViolatingNodeTaints and excludes the deschedule=not-allow taint.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    deschedulingInterval: 120s # The interval at which the descheduler runs. Set to 120 seconds here.
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    # The preceding configuration is the system configuration.

    profiles:
    - name: koord-descheduler
      plugins:
        deschedule:
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints # Configure the node taint verification plug-in.
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.

      # Required for RemovePodsViolatingNodeTaints to take effect. Do not remove.
      - name: MigrationController # Configure the migration controller.
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly

RemovePodsViolatingNodeTaints parameters:

Parameter Type Default Description
excludedTaints list(string) Taint keys or key=value pairs to ignore. Pods on nodes with these taints are not evicted.
includePreferNoSchedule bool false When true, also checks taints with effect PreferNoSchedule, not just NoSchedule.
namespaces.include list(string) Restrict descheduling to specific namespaces. Mutually exclusive with namespaces.exclude.
namespaces.exclude list(string) Skip descheduling in specific namespaces. Mutually exclusive with namespaces.include.
labelSelector map Restrict descheduling to pods that match the specified labels.

Apply the configuration

  1. Apply the ConfigMap to the cluster:

    kubectl apply -f koord-descheduler-config.yaml
  2. Restart Koordinator Descheduler to load the new configuration:

    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0
    # Expected output:
    # deployment.apps/ack-koord-descheduler scaled
    kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1
    # Expected output:
    # deployment.apps/ack-koord-descheduler scaled

Step 3: Verify descheduling

This example uses a three-node cluster.

  1. Create a file named stress-demo.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: stress-demo
      namespace: default
      labels:
        app: stress-demo
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: stress-demo
      template:
        metadata:
          name: stress-demo
          labels:
            app: stress-demo
        spec:
          containers:
            - args:
                - '--vm'
                - '2'
                - '--vm-bytes'
                - '1600M'
                - '-c'
                - '2'
                - '--vm-hang'
                - '2'
              command:
                - stress
              image: registry-cn-beijing.ack.aliyuncs.com/acs/stress:v1.0.4
              imagePullPolicy: Always
              name: stress
              resources:
                limits:
                  cpu: '2'
                  memory: 4Gi
                requests:
                  cpu: '2'
                  memory: 4Gi
          restartPolicy: Always
  2. Deploy the test workload:

    kubectl create -f stress-demo.yaml
  3. Wait for pods to reach Running:

    kubectl get pod -o wide

    Expected output:

    NAME                         READY   STATUS    RESTARTS   AGE    IP              NODE                        NOMINATED NODE   READINESS GATES
    stress-demo-5f6cddf9-9****   1/1     Running   0          10s    192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
    stress-demo-5f6cddf9-h****   1/1     Running   0          10s    192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Running   0          10s    192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
  4. Add NoSchedule taints to two nodes:

    • Add deschedule=not-allow:NoSchedule to cn-beijing.192.XX.XX.247 (excluded by excludedTaints — pod should stay):

      kubectl taint nodes cn-beijing.192.XX.XX.247 deschedule=not-allow:NoSchedule

      Expected output:

      node/cn-beijing.192.XX.XX.247 tainted
    • Add deschedule=allow:NoSchedule to cn-beijing.192.XX.XX.248 (not excluded — pod should be evicted):

      kubectl taint nodes cn-beijing.192.XX.XX.248 deschedule=allow:NoSchedule

      Expected output:

      node/cn-beijing.192.XX.XX.248 tainted
  5. Watch pod changes. The descheduler checks taints every deschedulingInterval (120 seconds):

    kubectl get pod -o wide -w

    Expected output:

    NAME                         READY   STATUS              RESTARTS   AGE     IP             NODE                    NOMINATED NODE   READINESS GATES
    stress-demo-5f6cddf9-9****   1/1     Running             0          5m34s   192.XX.XX.27   cn-beijing.192.XX.XX.247   <none>           <none>
    stress-demo-5f6cddf9-h****   1/1     Running             0          5m34s   192.XX.XX.20   cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Running             0          5m34s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
    stress-demo-5f6cddf9-v****   1/1     Terminating         0          7m58s   192.XX.XX.32   cn-beijing.192.XX.XX.248   <none>           <none>
    stress-demo-5f6cddf9-j****   0/1     ContainerCreating   0          0s      <none>         cn-beijing.192.XX.XX.249   <none>           <none>
    stress-demo-5f6cddf9-j****   1/1     Running             0          2s      192.XX.XX.32   cn-beijing.192.XX.XX.249   <none>           <none>

    The output confirms:

    • The pod on cn-beijing.192.XX.XX.248 (taint deschedule=allow:NoSchedule, not excluded) is evicted.

    • The pod on cn-beijing.192.XX.XX.247 (taint deschedule=not-allow:NoSchedule, excluded) stays running.

    • The evicted pod is rescheduled to cn-beijing.192.XX.XX.249, which has no NoSchedule taint.

  6. Check eviction events for the evicted pod:

    kubectl get event | grep stress-demo-5f6cddf9-v****

    Expected output:

    3m24s       Normal    Evicting            podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
    2m51s       Normal    EvictComplete       podmigrationjob/b0fba65f-7fab-4a99-96a9-c71a3798****   Pod "default/stress-demo-5f6cddf9-v****" has been evicted
    3m24s       Normal    Descheduled         pod/stress-demo-5f6cddf9-v****                         Pod evicted from node "cn-beijing.192.XX.XX.248" by the reason "RemovePodsViolatingNodeTaints"
    3m24s       Normal    Killing             pod/stress-demo-5f6cddf9-v****                         Stopping container stress

    Each event maps to a migration lifecycle phase:

    Event Source Meaning
    Evicting PodMigrationJob Descheduler started a migration job to evict the pod.
    Descheduled Pod Pod received the eviction signal.
    Killing Pod Container runtime stops the container.
    EvictComplete PodMigrationJob Pod fully evicted. Workload controller recreates it.

    The pod on cn-beijing.192.XX.XX.248 had no toleration for the deschedule=allow:NoSchedule taint (not in excludedTaints), so it was evicted.

Configure advanced parameters

Configure global behavior and template settings with a ConfigMap.

Advanced configuration example

This ConfigMap uses DeschedulerConfiguration for global settings, enables RemovePodsViolatingNodeTaints as the descheduling policy, and uses MigrationController as the evictor.

# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: kube-system
data:
  koord-descheduler-config: |
    # Do not modify the following system configuration of koord-desheduler.
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    leaderElection:
      resourceLock: leases
      resourceName: koord-descheduler
      resourceNamespace: kube-system
    dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
    deschedulingInterval: 120s # The interval at which the descheduler runs. The interval is set to 120 seconds in this example.
    nodeSelector: # The nodes that are involved in descheduling. By default, all nodes are descheduled.
      matchLabels:
        alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements.
    maxNoOfPodsToEvictPerNode: 10 # The maximum number of pods that can be evicted from a node. The limit takes effect on a global scale. By default, no limit is configured.
    maxNoOfPodsToEvictPerNamespace: 10 # The maximum number of pods that can be evicted from a namespace. The limit takes effect on a global scale. By default, no limit is configured.
    # The preceding configuration is the system configuration.

    # The template list.
    profiles:
    - name: koord-descheduler # The name of the template.

      # Scope: apply this template only to the specified nodes.
      # Method 1: Select nodes in one node pool.
      nodeSelector:
        matchLabels:
          alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements
      # Method 2: Select nodes in multiple node pools.
      # nodeSelector:
      #   matchExpressions:
      #   - key: alibabacloud.com/nodepool-id
      #     operator: In
      #     values:
      #     - nodepool-1
      #     - nodepool-2

      plugins:
        deschedule: # All plug-ins are disabled by default. Specify the ones to enable.
          enabled:
            - name: RemovePodsViolatingNodeTaints  # Enable the node taint verification plug-in.
        balance: # All plug-ins are disabled by default.
          disabled:
            - name: "*" # Disable all Balance plug-ins.
        evict:
          enabled:
            - name: MigrationController # MigrationController is enabled by default.
        filter:
          enabled:
            - name: MigrationController # Use MigrationController's filtering policy by default.

      pluginConfig:
      - name: RemovePodsViolatingNodeTaints
        args:
          excludedTaints:
          - deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
          - reserved # Ignore nodes whose taint key is reserved.
          includePreferNoSchedule: false # When false, only checks NoSchedule taints.
          namespaces:
            include: # Restrict descheduling to these namespaces.
              - "namespace1"
              - "namespace2"
            # exclude: # Alternatively, exclude these namespaces.
            #   - "namespace1"
            #   - "namespace2"
          labelSelector: # Only deschedule pods matching these labels.
            accelerator: nvidia-tesla-p100

      - name: MigrationController
        args:
          apiVersion: descheduler/v1alpha2
          kind: MigrationControllerArgs
          defaultJobMode: EvictDirectly
          evictLocalStoragePods: false # When false, pods using emptyDir or hostPath are not descheduled.
          maxMigratingPerNode: 1 # Maximum pods migrated simultaneously on a node.
          maxMigratingPerNamespace: 1  # Maximum pods migrated simultaneously in a namespace.
          maxMigratingPerWorkload: 1 # Maximum pods migrated simultaneously in a workload.
          maxUnavailablePerWorkload: 2 # Maximum unavailable replicated pods allowed in a workload.
          objectLimiters:
            workload: # Throttle workload-level migration. Default: only 1 pod per workload within 5 minutes.
              duration: 5m
              maxMigrating: 1
          evictionPolicy: Eviction # Use the Eviction API by default.

System configurations

Configure global, system-level behavior in DeschedulerConfiguration.

Parameter Type Valid value Description Example
dryRun boolean true / false (default: false) Read-only mode. When enabled, no pods are migrated. false
deschedulingInterval time.Duration >0s How often the descheduler runs. 120s
nodeSelector Structure Limit which nodes are eligible for descheduling. Accepts matchLabels (one node pool) or matchExpressions (multiple node pools). See Kubernetes labelSelector. See example YAML above
maxNoOfPodsToEvictPerNode int ≥0 (default: 0) Maximum pods evicted from a single node per descheduling cycle. 0 means no limit. 10
maxNoOfPodsToEvictPerNamespace int ≥0 (default: 0) Maximum pods evicted from a single namespace per descheduling cycle. 0 means no limit. 10

Template configurations

Each template (profiles) groups descheduling policies and evictors with the following fields:

  • `name`: Template identifier.

  • `plugins`: Enables or disables descheduling policies (deschedule, balance), evictors (evict), and pre-eviction filters (filter).

  • `pluginConfig`: Per-plug-in arguments. Match the name field to the plug-in name and configure args. See Configure policy plug-ins and Configure evictor plug-ins.

  • `nodeSelector`: Limits the template to specific nodes. If unset, applies to all nodes.

Template-level nodeSelector requires ack-koordinator v1.6.1-ack.1.16 or later.

`plugins` field reference:

Field Supported plug-ins Description
deschedule RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, RemovePodsHavingTooManyRestarts, PodLifeTime, RemoveFailedPod All disabled by default. Specify plug-ins to enable.
balance RemoveDuplicates, LowNodeUtilization, HighNodeUtilization, RemovePodsViolatingTopologySpreadConstraint, LowNodeLoad All disabled by default. Specify plug-ins to enable.
evict MigrationController, DefaultEvictor The pod evictor. MigrationController is enabled by default. Do not enable multiple evict plug-ins simultaneously.
filter MigrationController, DefaultEvictor Pre-eviction filtering policy. MigrationController is enabled by default. Do not enable multiple filter plug-ins simultaneously.

Configure policy plug-ins

Koordinator Descheduler supports six Deschedule and five Balance plug-ins from Kubernetes Descheduler. LowNodeLoad is provided by Koordinator. See Work with load-aware hotspot descheduling.

Policy type Plug-in Description
Deschedule RemovePodsViolatingInterPodAntiAffinity Evicts pods that violate inter-pod anti-affinity rules.
Deschedule RemovePodsViolatingNodeAffinity Evicts pods that no longer satisfy node affinity rules.
Deschedule RemovePodsViolatingNodeTaints Evicts pods that cannot tolerate node taints.
Deschedule RemovePodsHavingTooManyRestarts Evicts pods that restart too frequently.
Deschedule PodLifeTime Evicts pods whose TTL has expired.
Deschedule RemoveFailedPod Evicts pods in the Failed state.
Balance RemoveDuplicates Spreads replicated pods evenly across nodes.
Balance LowNodeUtilization Redistributes pods based on node resource allocation.
Balance HighNodeUtilization Consolidates pods from underutilized nodes to more utilized ones.
Balance RemovePodsViolatingTopologySpreadConstraint Evicts pods that violate topology spread constraints.

Configure evictor plug-ins

Koordinator Descheduler supports two evictor plug-ins: DefaultEvictor and MigrationController.

MigrationController

MigrationController provides fine-grained eviction control and observability through migration jobs.

Parameter Type Valid value Description Example
evictLocalStoragePods boolean true / false (default: false) When false, pods using emptyDir or hostPath are not descheduled. false
maxMigratingPerNode int64 ≥0 (default: 2) Maximum pods migrated simultaneously on a node. 0 means no limit. 2
maxMigratingPerNamespace int64 ≥0 (default: 0) Maximum pods migrated simultaneously in a namespace. 0 means no limit. 1
maxMigratingPerWorkload intOrString ≥0 (default: 10%) Maximum pods or percentage migrated simultaneously in a workload. 0 means no limit. If a workload has only one pod, it is excluded from descheduling. 1 or 10%
maxUnavailablePerWorkload intOrString ≥0 and < replica count (default: 10%) Maximum unavailable replicated pods allowed in a workload. 0 means no limit. 1 or 10%
objectLimiters.workload Structure Duration >0 (default: 5m); MaxMigrating ≥0 (default: 10%) Throttles workload-level migration within a time window. Duration sets the window length. MaxMigrating sets the maximum pods migrated within that window. duration: 5m maxMigrating: 1
evictionPolicy string Eviction (default), Delete, Soft Controls how pods are evicted. Eviction: calls the Eviction API for graceful eviction. Delete: calls the Delete API. Soft: adds the scheduling.koordinator.sh/soft-eviction annotation for custom downstream handling. Eviction

DefaultEvictor

DefaultEvictor is the standard Kubernetes Descheduler evictor. See DefaultEvictor for configuration.

MigrationController vs. DefaultEvictor

Capability DefaultEvictor MigrationController
Eviction methods Eviction API only Eviction API, Delete API, or Soft annotation
Per-node eviction limit Supported Supported
Per-namespace eviction limit Supported Supported
Per-workload eviction limit Not supported Supported
Per-workload unavailability limit Not supported Supported
Eviction throttling Not supported Time window-based throttling per workload
Eviction observability Component logs only Component logs and Kubernetes events with per-pod migration status

Next steps