Enable container memory QoS

更新时间:
复制 MD 格式

Use ack-koordinator to prioritize memory for latency-sensitive containers and reduce OOM errors during contention.

Note Before using memory QoS, read Pod Quality of Service Classes and Assign Memory Resources to Containers and Pods in the Kubernetes documentation.

How it works

Kubernetes assigns each pod a memory request and limit. Two failure modes occur under memory pressure:

  • Container-level pressure: When a container's memory usage (including page cache) nears its limit, the OS triggers memcg-level direct reclamation, blocking processes. If allocation outpaces reclamation, an OOM error terminates the pod.

  • Node-level pressure: When total container memory limits exceed node physical memory, the kernel reclaims memory across containers indiscriminately, degrading performance and potentially triggering OOM errors.

image

ack-koordinator addresses both by configuring the memory control group (memcg) per container, using three Alibaba Cloud Linux (Alinux) kernel features:

This produces fairer memory distribution and lower latency during overcommitment.

Advantages over open-source Kubernetes memory QoS

The upstream Kubernetes memory QoS feature (Kubernetes 1.22+) supports only cgroup v2, requires manual kubelet configuration, and lacks per-pod or per-namespace granularity.

ack-koordinator improves on the upstream implementation in two ways:

  • Broader kernel compatibility: Supports cgroup v1 and v2, backed by Alinux kernel features such as memcg backend asynchronous reclamation and minimum watermark rating. See Overview of kernel features and interfaces.

  • Fine-grained configuration: Configure memory QoS per pod, namespace, or cluster through pod annotations or ConfigMaps.

Configuration mechanism

ack-koordinator uses four cgroup parameters for memory QoS. Each maps to configuration options in Advanced parameters:

cgroup parameter Controls Configured by
memory.limit_in_bytes Hard upper limit for the container Kubernetes (from limits.memory)
memory.high Throttling threshold — reclamation starts here throttlingPercent
memory.wmark_high Async reclamation trigger wmarkRatio
memory.min Unreclaimable memory floor minLimitPercent / lowLimitPercent

Configuration priority

When multiple configuration sources apply to the same pod, ack-koordinator uses the following priority order (highest first):

  1. Pod annotation (koordinator.sh/memoryQOS)

  2. Namespace-level ConfigMap (ack-slo-pod-config)

  3. Cluster-level ConfigMap (ack-slo-config)

QoS class mapping

If a pod lacks the koordinator.sh/qosClass label, ack-koordinator maps Kubernetes QoS classes automatically:

Kubernetes QoS class koordinator QoS class
Guaranteed Default memory QoS settings
Burstable LS (latency-sensitive)
BestEffort BE (best-effort)

Prerequisites

Before you begin, make sure you have:

Enable memory QoS for a specific pod

Add the following annotation to the pod spec:

annotations:
  # Enable memory QoS with recommended settings
  koordinator.sh/memoryQOS: '{"policy": "auto"}'
  # Disable memory QoS
  # koordinator.sh/memoryQOS: '{"policy": "none"}'

Enable memory QoS for a cluster

Use the ack-slo-config ConfigMap to apply memory QoS to all pods in the cluster.

  1. Create configmap.yaml with the following content:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ack-slo-config
      namespace: kube-system
    data:
      resource-qos-config: |-
        {
          "clusterStrategy": {
            "lsClass": {
              "memoryQOS": {
                "enable": true
              }
            },
            "beClass": {
              "memoryQOS": {
                "enable": true
              }
            }
          }
        }
  2. Set each pod's QoS class with the koordinator.sh/qosClass label:

    apiVersion: v1
    kind: Pod
    metadata:
      name: pod-demo
      labels:
        koordinator.sh/qosClass: 'LS'
  3. Apply the ConfigMap:

    • If ack-slo-config already exists in kube-system, update it to preserve other settings: ``bash kubectl patch cm -n kube-system ack-slo-config --patch "$(cat configmap.yaml)" ``

    • If it does not exist, create it: ``bash kubectl apply -f configmap.yaml ``

  4. (Optional) Configure advanced parameters.

Enable memory QoS for a namespace

Use the ack-slo-pod-config ConfigMap to enable or disable memory QoS for pods in specific namespaces.

  1. Create ack-slo-pod-config.yaml with the following content:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ack-slo-pod-config
      namespace: kube-system
    data:
      memory-qos: |
        {
          "enabledNamespaces": ["allow-ns"],
          "disabledNamespaces": ["block-ns"]
        }

    Replace allow-ns and block-ns with the actual namespace names.

  2. Apply the ConfigMap:

    kubectl patch cm -n kube-system ack-slo-pod-config --patch "$(cat ack-slo-pod-config.yaml)"
  3. (Optional) Configure advanced parameters.

Example: Redis under memory overcommitment

This example shows how memory QoS reduces Redis latency and increases throughput under memory overcommitment. The test uses:

  • An ACK Pro cluster with two nodes (8 vCPUs, 32 GB each)

  • One node for Redis, the other for the stress test

Run the test

  1. Create redis-demo.yaml:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: redis-demo-config
    data:
      redis-config: |
        appendonly yes
        appendfsync no
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: redis-demo
      labels:
        koordinator.sh/qosClass: 'LS'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "auto"}'
    spec:
      containers:
      - name: redis
        image: redis:5.0.4
        command:
          - redis-server
          - "/redis-master/redis.conf"
        env:
        - name: MASTER
          value: "true"
        ports:
        - containerPort: 6379
        resources:
          limits:
            cpu: "2"
            memory: "6Gi"
          requests:
            cpu: "2"
            memory: "2Gi"
        volumeMounts:
        - mountPath: /redis-master-data
          name: data
        - mountPath: /redis-master
          name: config
      volumes:
        - name: data
          emptyDir: {}
        - name: config
          configMap:
            name: redis-demo-config
            items:
            - key: redis-config
              path: redis.conf
      nodeName: # Set to the name of the node running Redis.
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: redis-demo
    spec:
      ports:
      - name: redis-port
        port: 6379
        protocol: TCP
        targetPort: 6379
      selector:
        name: redis-demo
      type: ClusterIP
  2. Deploy Redis:

    kubectl apply -f redis-demo.yaml
  3. Simulate memory overcommitment with the Stress tool. Create stress-demo.yaml:

    apiVersion: v1
    kind: Pod
    metadata:
      name: stress-demo
      labels:
        koordinator.sh/qosClass: 'BE'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "auto"}'
    spec:
      containers:
        - args:
            - '--vm'
            - '2'
            - '--vm-bytes'
            - 11G
            - '-c'
            - '2'
            - '--vm-hang'
            - '2'
          command:
            - stress
          image: polinux/stress
          imagePullPolicy: Always
          name: stress
      restartPolicy: Always
      nodeName: # Set to the same node as redis-demo.
  4. Deploy the stress workload:

    kubectl apply -f stress-demo.yaml
  5. Verify the global minimum watermark before running the benchmark.

    Important

    In memory overcommitment scenarios, a low global minimum watermark causes the OOM killer to run before memory reclamation. For a 32 GB node, set this value to at least 4,000,000 KB.

    cat /proc/sys/vm/min_free_kbytes

    Expected output:

    4000000
  6. Deploy memtier-benchmark to send requests to the Redis pod:

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        name: memtier-demo
      name: memtier-demo
    spec:
      containers:
        - command:
            - memtier_benchmark
            - '-s'
            - 'redis-demo'
            - '--data-size'
            - '200000'
            - "--ratio"
            - "1:4"
          image: 'redislabs/memtier_benchmark:1.3.0'
          name: memtier
      restartPolicy: Never
      nodeName: # Set to the name of the node sending requests.
  7. Check the benchmark results:

    kubectl logs -f memtier-demo
  8. To compare, disable memory QoS on both pods and repeat the test:

    apiVersion: v1
    kind: Pod
    metadata:
      name: redis-demo
      labels:
        koordinator.sh/qosClass: 'LS'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "none"}'
    spec:
      ...
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: stress-demo
      labels:
        koordinator.sh/qosClass: 'BE'
      annotations:
        koordinator.sh/memoryQOS: '{"policy": "none"}'

Test results

Important

Results are for reference only. Actual values depend on cluster configuration and workload.

Metric Memory QoS disabled Memory QoS enabled
Latency-avg 51.32 ms 47.25 ms
Throughput-avg 149.0 MB/s 161.9 MB/s

Enabling memory QoS reduced Redis latency by 7.9% and increased throughput by 8.7% under memory overcommitment.

Advanced parameters

Configure these parameters in pod annotations or the ack-slo-config ConfigMap. Pod annotations take precedence.

Note The Annotation and ConfigMap columns indicate whether the parameter is configurable through annotations and the ConfigMap. Supported indicates supported and Not supported indicates not supported.
image

<table> <thead> <tr> <td><p><b>Parameter</b></p></td> <td><p><b>Type</b></p></td> <td><p><b>Value range</b></p></td> <td><p><b>Description</b></p></td> <td><p><b>Pod annotation</b></p></td> <td><p><b>ConfigMap</b></p></td> </tr> </thead> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <colgroup></colgroup> <tbody> <tr> <td><p><code>enable</code></p></td> <td><p>Boolean</p></td> <td> <ul> <li><p><code>true</code></p></li> <li><p><code>false</code></p></li> </ul></td> <td> <ul> <li><p><code>true</code>: enables memory QoS for all containers with recommended memcg settings.</p></li> <li><p><code>false</code>: disables memory QoS for all containers and restores default memcg settings.</p></li> </ul></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>policy</code></p></td> <td><p>String</p></td> <td> <ul> <li><p><code>auto</code></p></li> <li><p><code>default</code></p></li> <li><p><code>none</code></p></li> </ul></td> <td> <ul> <li><p><code>auto</code>: enables memory QoS with recommended settings, overriding the ack-slo-pod-config ConfigMap.</p></li> <li><p><code>default</code>: inherits settings from the ack-slo-pod-config ConfigMap.</p></li> <li><p><code>none</code>: disables memory QoS and restores default memcg settings, overriding the ack-slo-pod-config ConfigMap.</p></li> </ul></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>minLimitPercent</code></p></td> <td><p>Int</p></td> <td><p>0\~100</p></td> <td><p>Unit: %. Default value: <code>0</code> (disabled). </p><p>Unreclaimable proportion of the pod memory request. Use this to cache files for page-cache-sensitive applications. See<a href="https://www.alibabacloud.com/help/en/document_detail/169536.html#concept-2482889">Memcg QoS feature of the cgroup v1 interface</a>. </p><p>Formula: <code>Value of memory.min = Memory request × Value of minLimitPercent/100</code>. For example, with <code>Memory Request=100MiB</code> and <code>minLimitPercent=100</code>, <code>the value of memory.min is 104857600</code>. </p></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>lowLimitPercent</code></p></td> <td><p>Int</p></td> <td><p>0\~100</p></td> <td><p>Unit: %. Default value: <code>0</code> (disabled). </p><p>Relatively unreclaimable proportion of the pod memory request. See<a href="https://www.alibabacloud.com/help/en/document_detail/169536.html#concept-2482889">Memcg QoS feature of the cgroup v1 interface</a>. </p><p>Formula: <span><code>Value of memory.low = Memory request × Value of lowLimitPercent/100</code></span>. For example, with <span><code>Memory Request=100MiB</code></span> and <span><code>lowLimitPercent=100</code></span>, <span><code>the value of memory.low is 104857600</code></span>. </p></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>throttlingPercent</code></p></td> <td><p>Int</p></td> <td><p>0\~100</p></td> <td><p>Unit: %. Default value: <code>0</code> (disabled). </p><p>Memory throttling threshold as a ratio of container usage to limit. When exceeded, memory is reclaimed. Prevents cgroup-level OOM in overcommitment scenarios. See<a href="https://www.alibabacloud.com/help/en/document_detail/169536.html#concept-2482889">Memcg QoS feature of the cgroup v1 interface</a>. </p><p>Formula: <code>Value of memory.high = Memory limit × Value of throttlingPercent/100</code>. For example, with <code>Memory Limit=100MiB</code> and <code>throttlingPercent=80</code>, <code>the value of memory.high is 83886080(80 MiB)</code>. </p></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>wmarkRatio</code></p></td> <td><p>Int</p></td> <td><p>0\~100</p></td> <td><p>Unit: %. Default value: <code>95</code>. <code>0</code> disables this parameter. When usage exceeds the threshold, memcg backend asynchronous reclamation triggers. </p><p>Asynchronous reclamation threshold: ratio of usage to limit, or usage to <code>memory.high</code>. See<a href="https://www.alibabacloud.com/help/en/document_detail/169535.html#task-2487938">Memcg backend asynchronous reclaim</a>. </p><p>If throttlingPercent is disabled, the formula is: Value of memory.wmark_high = Memory limit × wmarkRatio/100. If throttlingPercent is enabled, the formula is: <code>Value of memory.wmark_high = Value of memory.high × wmarkRatio/100</code>. For example, with <code>Memory Limit=100MiB</code> and <code>wmarkRatio=95,throttlingPercent=80</code>: <code>memory.high is 83886080 (80 MiB)</code>, <code>memory.wmark_ratio is 95</code>, and <code>memory.wmark_high is 79691776 (76 MiB)</code>. </p></td> <td><p><img></p></td> <td><p><img></p></td> </tr> <tr> <td><p><code>wmarkMinAdj</code></p></td> <td><p>Int</p></td> <td><p>-25\~50</p></td> <td><p>Unit: %. The default value is <code>-25</code> for the <code>LS</code> QoS class and <code>50</code> for the <code>BE</code> QoS class. <code>0</code> disables this parameter. </p><p>Adjusts the global minimum watermark per container. Negative values postpone reclamation; positive values expedite it. See<a href="https://www.alibabacloud.com/help/en/document_detail/169537.html#task-2492619">Memcg global minimum watermark rating</a>. </p><p>For example, an LS pod defaults to <code>memory.wmark_min_adj=-25</code>, decreasing the minimum watermark by 25%. </p></td> <td><p><img></p></td> <td><p><img></p></td> </tr> </tbody> </table>

FAQ

Is the memory QoS configuration from ack-slo-manager still valid after upgrading to ack-koordinator?

Yes. ack-koordinator is backward compatible with the annotation-based protocol used in ack-slo-manager 0.8.0 and earlier:

  • alibabacloud.com/qosClass — sets the QoS class

  • alibabacloud.com/memoryQOS — configures memory QoS

The following table shows which protocols each version supports:

Component version alibabacloud.com protocol koordinator.sh protocol
≥ 0.3.0 and < 0.8.0 ×
≥ 0.8.0

Support for the alibabacloud.com protocol ended on July 30, 2023. Migrate to the koordinator.sh protocol.

Billing

No fee is charged for installing or using the ack-koordinator component. However, costs may apply in the following cases:

Next steps