Enable Koordinator Descheduler to automatically evict and rebalance pods when node taints, affinity rules, or load profiles change.
The following procedure uses the RemovePodsViolatingNodeTaints plug-in as an example.
Prerequisites
Make sure you have:
-
An ACK managed Pro cluster is created.
-
kubectl is connected to the cluster.
Descheduling is not supported on virtual nodes.
Usage notes
-
Koordinator Descheduler only evicts pods — it does not recreate them. The workload controller (such as a Deployment or StatefulSet) recreates evicted pods, and the standard scheduler places them.
-
Old pods are evicted before new pods are created. Ensure your application has enough
replicasto maintain availability during eviction. -
ack-descheduler is discontinued. If you still use it, see How do I migrate from ack-descheduler to Koordinator Descheduler?
Choose a descheduling plug-in
Select a plug-in for your scenario:
| Scenario | Plug-in | Policy type |
|---|---|---|
Pods remain on nodes that acquired a NoSchedule taint after scheduling |
RemovePodsViolatingNodeTaints |
Deschedule |
| Pods violate inter-pod anti-affinity rules | RemovePodsViolatingInterPodAntiAffinity |
Deschedule |
| Pods no longer satisfy node affinity rules | RemovePodsViolatingNodeAffinity |
Deschedule |
| Pods restart too frequently | RemovePodsHavingTooManyRestarts |
Deschedule |
| Pods have exceeded their time-to-live | PodLifeTime |
Deschedule |
| Pods are in the Failed state | RemoveFailedPod |
Deschedule |
| Replicated pods are unevenly spread across nodes | RemoveDuplicates |
Balance |
| Nodes are unevenly utilized by resource allocation | LowNodeUtilization |
Balance |
| Pods violate topology spread constraints | RemovePodsViolatingTopologySpreadConstraint |
Balance |
| Nodes are overloaded by actual resource utilization | LowNodeLoad |
Balance |
The examples below use RemovePodsViolatingNodeTaints. Read the descheduling concepts and Koordinator Descheduler vs. Kubernetes Descheduler before you start.
How it works
RemovePodsViolatingNodeTaints checks every node for NoSchedule taints at the configured interval. If a running pod lacks a toleration for a node's NoSchedule taint, the plug-in evicts it. The workload controller recreates the pod, and the scheduler places it on a tolerable node.
Use excludedTaints to exempt specific taints. If a taint's key or key=value pair matches an entry in excludedTaints, the plug-in ignores it.
Example scenario:
A three-node cluster runs a Deployment with one pod per node. An administrator adds NoSchedule taints to two nodes after deployment:
-
Node A gets
deschedule=not-allow:NoSchedule. Becausedeschedule=not-allowis inexcludedTaints, this taint is ignored — the pod stays. -
Node B gets
deschedule=allow:NoSchedule. This taint is not excluded — the pod is evicted and rescheduled to Node C (which has noNoScheduletaint).
Step 1: Install ack-koordinator and enable descheduling
If ack-koordinator is already installed, ensure the version is 1.2.0-ack.2 or later.
-
Log on to the ACK console. In the left navigation pane, click Clusters.
-
On the Clusters page, click the target cluster name. In the left navigation pane, click Add-ons.
-
Find ack-koordinator and click Install.
-
In the Install dialog box, select Enable Descheduler for ACK-Koordinator and complete the installation.
Koordinator Descheduler is deployed as a Deployment on cluster nodes.
Step 2: Enable the RemovePodsViolatingNodeTaints plug-in
Configure the plug-in
Create a file named koord-descheduler-config.yaml. This ConfigMap enables RemovePodsViolatingNodeTaints and excludes the deschedule=not-allow taint.
# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-descheduler-config
namespace: kube-system
data:
koord-descheduler-config: |
# Do not modify the following system configuration of koord-desheduler.
apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
leaderElection:
resourceLock: leases
resourceName: koord-descheduler
resourceNamespace: kube-system
deschedulingInterval: 120s # The interval at which the descheduler runs. Set to 120 seconds here.
dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
# The preceding configuration is the system configuration.
profiles:
- name: koord-descheduler
plugins:
deschedule:
enabled:
- name: RemovePodsViolatingNodeTaints # Enable the node taint verification plug-in.
pluginConfig:
- name: RemovePodsViolatingNodeTaints # Configure the node taint verification plug-in.
args:
excludedTaints:
- deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
# Required for RemovePodsViolatingNodeTaints to take effect. Do not remove.
- name: MigrationController # Configure the migration controller.
args:
apiVersion: descheduler/v1alpha2
kind: MigrationControllerArgs
defaultJobMode: EvictDirectly
RemovePodsViolatingNodeTaints parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
excludedTaints |
list(string) | — | Taint keys or key=value pairs to ignore. Pods on nodes with these taints are not evicted. |
includePreferNoSchedule |
bool | false | When true, also checks taints with effect PreferNoSchedule, not just NoSchedule. |
namespaces.include |
list(string) | — | Restrict descheduling to specific namespaces. Mutually exclusive with namespaces.exclude. |
namespaces.exclude |
list(string) | — | Skip descheduling in specific namespaces. Mutually exclusive with namespaces.include. |
labelSelector |
map | — | Restrict descheduling to pods that match the specified labels. |
Apply the configuration
-
Apply the ConfigMap to the cluster:
kubectl apply -f koord-descheduler-config.yaml -
Restart Koordinator Descheduler to load the new configuration:
kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 0 # Expected output: # deployment.apps/ack-koord-descheduler scaled kubectl -n kube-system scale deploy ack-koord-descheduler --replicas 1 # Expected output: # deployment.apps/ack-koord-descheduler scaled
Configure advanced parameters
Configure global behavior and template settings with a ConfigMap.
Advanced configuration example
This ConfigMap uses DeschedulerConfiguration for global settings, enables RemovePodsViolatingNodeTaints as the descheduling policy, and uses MigrationController as the evictor.
# koord-descheduler-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-descheduler-config
namespace: kube-system
data:
koord-descheduler-config: |
# Do not modify the following system configuration of koord-desheduler.
apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
leaderElection:
resourceLock: leases
resourceName: koord-descheduler
resourceNamespace: kube-system
dryRun: false # The global read-only mode. After you enable this mode, koord-descheduler does not perform any operations.
deschedulingInterval: 120s # The interval at which the descheduler runs. The interval is set to 120 seconds in this example.
nodeSelector: # The nodes that are involved in descheduling. By default, all nodes are descheduled.
matchLabels:
alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements.
maxNoOfPodsToEvictPerNode: 10 # The maximum number of pods that can be evicted from a node. The limit takes effect on a global scale. By default, no limit is configured.
maxNoOfPodsToEvictPerNamespace: 10 # The maximum number of pods that can be evicted from a namespace. The limit takes effect on a global scale. By default, no limit is configured.
# The preceding configuration is the system configuration.
# The template list.
profiles:
- name: koord-descheduler # The name of the template.
# Scope: apply this template only to the specified nodes.
# Method 1: Select nodes in one node pool.
nodeSelector:
matchLabels:
alibabacloud.com/nodepool-id: nodepool-1 # Configure it based on your requirements
# Method 2: Select nodes in multiple node pools.
# nodeSelector:
# matchExpressions:
# - key: alibabacloud.com/nodepool-id
# operator: In
# values:
# - nodepool-1
# - nodepool-2
plugins:
deschedule: # All plug-ins are disabled by default. Specify the ones to enable.
enabled:
- name: RemovePodsViolatingNodeTaints # Enable the node taint verification plug-in.
balance: # All plug-ins are disabled by default.
disabled:
- name: "*" # Disable all Balance plug-ins.
evict:
enabled:
- name: MigrationController # MigrationController is enabled by default.
filter:
enabled:
- name: MigrationController # Use MigrationController's filtering policy by default.
pluginConfig:
- name: RemovePodsViolatingNodeTaints
args:
excludedTaints:
- deschedule=not-allow # Ignore nodes whose taint key is deschedule and taint value is not-allow.
- reserved # Ignore nodes whose taint key is reserved.
includePreferNoSchedule: false # When false, only checks NoSchedule taints.
namespaces:
include: # Restrict descheduling to these namespaces.
- "namespace1"
- "namespace2"
# exclude: # Alternatively, exclude these namespaces.
# - "namespace1"
# - "namespace2"
labelSelector: # Only deschedule pods matching these labels.
accelerator: nvidia-tesla-p100
- name: MigrationController
args:
apiVersion: descheduler/v1alpha2
kind: MigrationControllerArgs
defaultJobMode: EvictDirectly
evictLocalStoragePods: false # When false, pods using emptyDir or hostPath are not descheduled.
maxMigratingPerNode: 1 # Maximum pods migrated simultaneously on a node.
maxMigratingPerNamespace: 1 # Maximum pods migrated simultaneously in a namespace.
maxMigratingPerWorkload: 1 # Maximum pods migrated simultaneously in a workload.
maxUnavailablePerWorkload: 2 # Maximum unavailable replicated pods allowed in a workload.
objectLimiters:
workload: # Throttle workload-level migration. Default: only 1 pod per workload within 5 minutes.
duration: 5m
maxMigrating: 1
evictionPolicy: Eviction # Use the Eviction API by default.
System configurations
Configure global, system-level behavior in DeschedulerConfiguration.
| Parameter | Type | Valid value | Description | Example |
|---|---|---|---|---|
dryRun |
boolean | true / false (default: false) | Read-only mode. When enabled, no pods are migrated. | false |
deschedulingInterval |
time.Duration | >0s | How often the descheduler runs. | 120s |
nodeSelector |
Structure | — | Limit which nodes are eligible for descheduling. Accepts matchLabels (one node pool) or matchExpressions (multiple node pools). See Kubernetes labelSelector. |
See example YAML above |
maxNoOfPodsToEvictPerNode |
int | ≥0 (default: 0) | Maximum pods evicted from a single node per descheduling cycle. 0 means no limit. | 10 |
maxNoOfPodsToEvictPerNamespace |
int | ≥0 (default: 0) | Maximum pods evicted from a single namespace per descheduling cycle. 0 means no limit. | 10 |
Template configurations
Each template (profiles) groups descheduling policies and evictors with the following fields:
-
`name`: Template identifier.
-
`plugins`: Enables or disables descheduling policies (
deschedule,balance), evictors (evict), and pre-eviction filters (filter). -
`pluginConfig`: Per-plug-in arguments. Match the
namefield to the plug-in name and configureargs. See Configure policy plug-ins and Configure evictor plug-ins. -
`nodeSelector`: Limits the template to specific nodes. If unset, applies to all nodes.
Template-level nodeSelector requires ack-koordinator v1.6.1-ack.1.16 or later.
`plugins` field reference:
| Field | Supported plug-ins | Description |
|---|---|---|
deschedule |
RemovePodsViolatingNodeTaints, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, RemovePodsHavingTooManyRestarts, PodLifeTime, RemoveFailedPod |
All disabled by default. Specify plug-ins to enable. |
balance |
RemoveDuplicates, LowNodeUtilization, HighNodeUtilization, RemovePodsViolatingTopologySpreadConstraint, LowNodeLoad |
All disabled by default. Specify plug-ins to enable. |
evict |
MigrationController, DefaultEvictor |
The pod evictor. MigrationController is enabled by default. Do not enable multiple evict plug-ins simultaneously. |
filter |
MigrationController, DefaultEvictor |
Pre-eviction filtering policy. MigrationController is enabled by default. Do not enable multiple filter plug-ins simultaneously. |
Configure policy plug-ins
Koordinator Descheduler supports six Deschedule and five Balance plug-ins from Kubernetes Descheduler. LowNodeLoad is provided by Koordinator. See Work with load-aware hotspot descheduling.
| Policy type | Plug-in | Description |
|---|---|---|
| Deschedule | RemovePodsViolatingInterPodAntiAffinity | Evicts pods that violate inter-pod anti-affinity rules. |
| Deschedule | RemovePodsViolatingNodeAffinity | Evicts pods that no longer satisfy node affinity rules. |
| Deschedule | RemovePodsViolatingNodeTaints | Evicts pods that cannot tolerate node taints. |
| Deschedule | RemovePodsHavingTooManyRestarts | Evicts pods that restart too frequently. |
| Deschedule | PodLifeTime | Evicts pods whose TTL has expired. |
| Deschedule | RemoveFailedPod | Evicts pods in the Failed state. |
| Balance | RemoveDuplicates | Spreads replicated pods evenly across nodes. |
| Balance | LowNodeUtilization | Redistributes pods based on node resource allocation. |
| Balance | HighNodeUtilization | Consolidates pods from underutilized nodes to more utilized ones. |
| Balance | RemovePodsViolatingTopologySpreadConstraint | Evicts pods that violate topology spread constraints. |
Configure evictor plug-ins
Koordinator Descheduler supports two evictor plug-ins: DefaultEvictor and MigrationController.
MigrationController
MigrationController provides fine-grained eviction control and observability through migration jobs.
| Parameter | Type | Valid value | Description | Example |
|---|---|---|---|---|
evictLocalStoragePods |
boolean | true / false (default: false) | When false, pods using emptyDir or hostPath are not descheduled. |
false |
maxMigratingPerNode |
int64 | ≥0 (default: 2) | Maximum pods migrated simultaneously on a node. 0 means no limit. | 2 |
maxMigratingPerNamespace |
int64 | ≥0 (default: 0) | Maximum pods migrated simultaneously in a namespace. 0 means no limit. | 1 |
maxMigratingPerWorkload |
intOrString | ≥0 (default: 10%) | Maximum pods or percentage migrated simultaneously in a workload. 0 means no limit. If a workload has only one pod, it is excluded from descheduling. | 1 or 10% |
maxUnavailablePerWorkload |
intOrString | ≥0 and < replica count (default: 10%) | Maximum unavailable replicated pods allowed in a workload. 0 means no limit. | 1 or 10% |
objectLimiters.workload |
Structure | Duration >0 (default: 5m); MaxMigrating ≥0 (default: 10%) |
Throttles workload-level migration within a time window. Duration sets the window length. MaxMigrating sets the maximum pods migrated within that window. |
duration: 5m maxMigrating: 1 |
evictionPolicy |
string | Eviction (default), Delete, Soft |
Controls how pods are evicted. Eviction: calls the Eviction API for graceful eviction. Delete: calls the Delete API. Soft: adds the scheduling.koordinator.sh/soft-eviction annotation for custom downstream handling. |
Eviction |
DefaultEvictor
DefaultEvictor is the standard Kubernetes Descheduler evictor. See DefaultEvictor for configuration.
MigrationController vs. DefaultEvictor
| Capability | DefaultEvictor | MigrationController |
|---|---|---|
| Eviction methods | Eviction API only | Eviction API, Delete API, or Soft annotation |
| Per-node eviction limit | Supported | Supported |
| Per-namespace eviction limit | Supported | Supported |
| Per-workload eviction limit | Not supported | Supported |
| Per-workload unavailability limit | Not supported | Supported |
| Eviction throttling | Not supported | Time window-based throttling per workload |
| Eviction observability | Component logs only | Component logs and Kubernetes events with per-pod migration status |
Next steps
-
Descheduling — concepts, features, and workflow.
-
Using the community Kubernetes Descheduler? See Koordinator Descheduler and Kubernetes Descheduler for differences and migration.
-
Configure load-aware descheduling with
LowNodeLoad: Work with load-aware hotspot descheduling. -
Analyze cluster resource usage and get cost-saving recommendations: Cost Insight.
-
Troubleshooting: Scheduling FAQs.
-
Release notes and component overview: ack-koordinator (ack-slo-manager).