使用自定义指标为LLM推理服务配置HPA弹性扩缩容-容器服务 Kubernetes 版 ACK-阿里云

在管理LLM推理服务时，需要应对模型推理过程中高度动态的负载波动。本文通过结合推理框架的自定义指标与 Kubernetes HPA（Horizontal Pod Autoscaler）机制，实现对推理服务Pod数量的自动灵活调整，从而有效提升推理服务的质量与稳定性。

前提条件

已部署单机LLM推理服务或部署多机分布式推理服务。
已部署阿里云Prometheus监控组件。具体操作，请参见使用阿里云Prometheus监控。
已部署ack-alibaba-cloud-metrics-adapter组件，且在部署组件时设置AlibabaCloudMetricsAdapter.prometheus.url参数为阿里云Prometheus监控的地址。具体操作，请参见修改ack-alibaba-cloud-metrics-adapter组件配置。

配置采集监控指标

LLM推理服务与传统微服务存在显著差异，其单次推理耗时显著增加，资源瓶颈通常集中在GPU算力与显存容量。然而，受制于当前GPU利用率和显存统计方式的局限性，这两项指标难以准确反映节点负载状态。因此，我们选择以推理引擎自身暴露的性能指标（如请求延迟、队列深度）作为弹性扩缩容的决策依据。

计费说明

LLM推理服务将监控数据接入阿里云Prometheus监控功能后，相关组件会自动将监控指标发送至阿里云Prometheus服务，这些指标将被视为自定义指标。

使用自定义指标会产生额外的费用。这些费用将根据您的集群规模、应用数量和数据量等因素产生变动，您可以通过用量查询，监控和管理您的资源使用情况。

步骤一：获取推理引擎监控指标

如果您已通过为LLM推理服务配置监控为推理服务配置了Prometheus监控，可跳过此步骤。

创建podmonitor.yaml。

展开查看代码示例。

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: llm-serving-podmonitor
  namespace: default
  annotations:
    arms.prometheus.io/discovery: "true"
    arms.prometheus.io/resource: "arms"
spec:
  selector:
    matchExpressions:
    - key: alibabacloud.com/inference-workload
      operator: Exists
  namespaceSelector:
    any: true
  podMetricsEndpoints:
  - interval: 15s
    path: /metrics
    port: "http"
    relabelings:
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_name
      targetLabel: pod_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_namespace
      targetLabel: pod_namespace
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
      regex: (.+)
      targetLabel: rbg_role
    # Allow to override workload-name with specific label
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
      regex: (.+)
      targetLabel: workload_name
    - action: replace
      sourceLabels:
      - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
      regex: (.+)
      targetLabel: backend

执行以下命令创建PodMonitor。
```
kubectl apply -f ./podmonitor.yaml
```

步骤二：修改`ack-alibaba-cloud-metrics-adapter`组件配置

登录容器服务管理控制台，在左侧导航栏选择集群列表。
在集群列表页面，单击目标集群名称，在左侧导航栏，单击应用 > Helm。
在Helm页面的操作列，单击ack-alibaba-cloud-metrics-adapter对应的更新。

在更新发布面板，配置如下YAML，然后单击确定。YAML中的指标仅作为示例，您可根据实际需求进行修改。

vLLM metrics列表可参考文档vLLM Metrics，SGLang metrics列表可参考文档SGLang metrics，Dynamo metrics列表可参考Dynamo Metrics。

展开查看示例代码。

AlibabaCloudMetricsAdapter:

  prometheus:
    enabled: true    # 这里设置为true，打开整体Prometheus adapter功能。
    # 填写阿里云Prometheus监控的地址。
    url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
    # 阿里云Prometheus开启鉴权Token后，请配置prometheusHeader Authorization。
#    prometheusHeader:
#    - Authorization: xxxxxxx

    adapter:
      rules:
        default: false  			# 默认指标获取配置，推荐保持false。
        custom:

        # ** 示例1：this is an example for vllm **
        # vllm:num_requests_waiting 排队的请求数量
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"
        - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:num_requests_running 正在处理的请求数量
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_running"
        - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # vllm:kv_cache_usage_perc kv cache使用率
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:kv_cache_usage_perc"
        - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # ** 示例2：this is an example for sglang **
        # sglang:num_queue_reqs 排队的请求数量
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_queue_reqs"
        - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
        # sglang:num_running_reqs 正在处理的请求数量
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_running_reqs"
        - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
          # sglang:token_usage 系统中Token使用率，可以反映KVCache利用率
          # 执行以下命令确认指标是否采集
          # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:token_usage"
        - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

        # 示例3：this is an example for dynamo
        # nv_llm_http_service_inflight_requests 正在处理的请求数量
        # 执行以下命令确认指标是否采集
        # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nv_llm_http_service_inflight_requests"
        - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
          resources:
            overrides:
              namespace: { resource: "namespace" }
              pod: { resource: "pod" }
          metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

配置弹性扩缩容

以下伸缩策略中的参数配置仅作为演示参考，实际配置请根据真实业务场景，综合考虑资源成本和服务SLO后进行设置。

创建hpa.yaml，相关YAML代码示例如下。请根据您使用的推理框架，在以下示例中选择一个进行配置。

vLLM框架

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: vllm-inference # 替换为vllm推理服务的名称
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm:num_requests_waiting
      target:
        type: Value
        averageValue: 5

SGLang框架

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: sgl-inference
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Pods
    pods:
      metric:
        name: sglang:num_queue_reqs
      target:
        type: Value
        averageValue: 5

执行以下命令创建HPA对象。

kubectl apply -f hpa.yaml

使用benchmark工具，对服务进行压测。

benchmark压测工具的详细介绍及使用方式，请参见vLLM Benchmark 及 SGLang Benchmark。

创建benchmark.yaml文件。

image部署推理服务所使用的LLM/SGlang容器镜像，可选择：
- LLM容器镜像：kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
- SGlang容器镜像：anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

展开查看相关示例YAML代码。

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: llm-benchmark
  name: llm-benchmark
spec:
  selector:
    matchLabels:
      app: llm-benchmark
  template:
    metadata:
      labels:
        app: llm-benchmark
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - command:
        - sh
        - -c
        - sleep inf
        image: #部署推理服务所使用的SGlang/LLM容器镜像
        imagePullPolicy: IfNotPresent
        name: llm-benchmark
        resources:
          limits:
            cpu: "8"
            memory: 40Gi
          requests:
            cpu: "8"
            memory: 40Gi
        volumeMounts:
        - mountPath: /models/Qwen3-32B
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

执行命令创建压测的服务实例。
```
kubectl create -f benchmark.yaml
```

等待实例成功运行后，在实例中执行以下命令进行压测：

vLLM框架

python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

SGLang框架

python3 -m sglang.bench_serving --backend sglang \
        --model /models/Qwen3-32B \
        --host inference-service \
        --port 8000 \
        --dataset-name random \
        --random-input-len 1500 \
        --random-output-len 100 \
        --random-range-ratio 1 \
        --num-prompts 400 \
        --max-concurrency 20

在压测期间，重新打开一个终端，执行以下命令查看服务的扩缩容情况。

kubectl describe hpa llm-inference-hpa

预期输出中，可以看到Events字段记录了SuccessfulRescale事件，表明HPA已根据推理服务中处于等待中状态的请求数量，将推理服务的副本数从1个扩容至3个。

Name:                                   llm-inference-hpa
Namespace:                              default
Labels:                                 <none>
Annotations:                            <none>
CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
Reference:                              StatefulSet/vllm-inference
Metrics:                                ( current / target )
  "vllm:num_requests_waiting" on pods:  11 / 5
Min replicas:                           1
Max replicas:                           3
StatefulSet pods:                       1 current / 3 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target

前提条件

配置采集监控指标

计费说明

步骤一：获取推理引擎监控指标

步骤二：修改ack-alibaba-cloud-metrics-adapter组件配置

配置弹性扩缩容

vLLM框架

SGLang框架

vLLM框架

SGLang框架

步骤二：修改`ack-alibaba-cloud-metrics-adapter`组件配置