为单机/多机推理配置弹性扩缩容

在管理LLM推理服务时,需要应对模型推理过程中高度动态的负载波动。本文通过结合推理框架的自定义指标与 Kubernetes HPA(Horizontal Pod Autoscaler)机制,实现对推理服务Pod数量的自动灵活调整,从而有效提升推理服务的质量与稳定性。

前提条件

配置采集监控指标

LLM推理服务与传统微服务存在显著差异,其单次推理耗时显著增加,资源瓶颈通常集中在GPU算力与显存容量。然而,受制于当前GPU利用率和显存统计方式的局限性,这两项指标难以准确反映节点负载状态。因此,我们选择以推理引擎自身暴露的性能指标(如请求延迟、队列深度)作为弹性扩缩容的决策依据。

计费说明

LLM推理服务将监控数据接入阿里云Prometheus监控功能后,相关组件会自动将监控指标发送至阿里云Prometheus服务,这些指标将被视为自定义指标

使用自定义指标会产生额外的费用。这些费用将根据您的集群规模、应用数量和数据量等因素产生变动,您可以通过用量查询,监控和管理您的资源使用情况。

步骤一:获取推理引擎监控指标

如果您已通过LLM推理服务配置监控为推理服务配置了Prometheus监控,可跳过此步骤。
  1. 创建podmonitor.yaml

    展开查看代码示例。

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: llm-serving-podmonitor
      namespace: default
      annotations:
        arms.prometheus.io/discovery: "true"
        arms.prometheus.io/resource: "arms"
    spec:
      selector:
        matchExpressions:
        - key: alibabacloud.com/inference-workload
          operator: Exists
      namespaceSelector:
        any: true
      podMetricsEndpoints:
      - interval: 15s
        path: /metrics
        port: "http"
        relabelings:
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_name
          targetLabel: pod_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_namespace
          targetLabel: pod_namespace
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_rolebasedgroup_workloads_x_k8s_io_role
          regex: (.+)
          targetLabel: rbg_role
        # Allow to override workload-name with specific label
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_workload
          regex: (.+)
          targetLabel: workload_name
        - action: replace
          sourceLabels:
          - __meta_kubernetes_pod_label_alibabacloud_com_inference_backend
          regex: (.+)
          targetLabel: backend
    
  2. 执行以下命令创建PodMonitor。

    kubectl apply -f ./podmonitor.yaml

步骤二:修改ack-alibaba-cloud-metrics-adapter组件配置

  1. 登录容器服务管理控制台,在左侧导航栏选择集群列表

  2. 集群列表页面,单击目标集群名称,在左侧导航栏,单击应用 > Helm

  3. Helm页面的操作列,单击ack-alibaba-cloud-metrics-adapter对应的更新

  4. 更新发布面板,配置如下YAML,然后单击确定。YAML中的指标仅作为示例,您可根据实际需求进行修改。

    vLLM metrics列表可参考文档vLLM Metrics,SGLang metrics列表可参考文档SGLang metrics,Dynamo metrics列表可参考Dynamo Metrics

    展开查看示例代码。

    AlibabaCloudMetricsAdapter:
    
      prometheus:
        enabled: true    # 这里设置为true,打开整体Prometheus adapter功能。
        # 填写阿里云Prometheus监控的地址。
        url: http://cn-beijing.arms.aliyuncs.com:9090/api/v1/prometheus/xxxx/xxxx/xxx/cn-beijing
        # 阿里云Prometheus开启鉴权Token后,请配置prometheusHeader Authorization。
    #    prometheusHeader:
    #    - Authorization: xxxxxxx
    
        adapter:
          rules:
            default: false  			# 默认指标获取配置,推荐保持false。
            custom:
    
            # ** 示例1:this is an example for vllm **
            # vllm:num_requests_waiting 排队的请求数量
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_waiting"
            - seriesQuery: 'vllm:num_requests_waiting{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:num_requests_running 正在处理的请求数量
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:num_requests_running"
            - seriesQuery: 'vllm:num_requests_running{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # vllm:kv_cache_usage_perc kv cache使用率
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/vllm:kv_cache_usage_perc"
            - seriesQuery: 'vllm:kv_cache_usage_perc{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # ** 示例2:this is an example for sglang **
            # sglang:num_queue_reqs 排队的请求数量
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_queue_reqs"
            - seriesQuery: 'sglang:num_queue_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
            # sglang:num_running_reqs 正在处理的请求数量
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:num_running_reqs"
            - seriesQuery: 'sglang:num_running_reqs{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
              # sglang:token_usage 系统中Token使用率,可以反映KVCache利用率
              # 执行以下命令确认指标是否采集
              # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/sglang:token_usage"
            - seriesQuery: 'sglang:token_usage{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
            # 示例3:this is an example for dynamo
            # nv_llm_http_service_inflight_requests 正在处理的请求数量
            # 执行以下命令确认指标是否采集
            # kubectl get --raw  "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nv_llm_http_service_inflight_requests"
            - seriesQuery: 'nv_llm_http_service_inflight_requests{namespace!="",pod!=""}'
              resources:
                overrides:
                  namespace: { resource: "namespace" }
                  pod: { resource: "pod" }
              metricsQuery: 'sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
    
    

配置弹性扩缩容

以下伸缩策略中的参数配置仅作为演示参考,实际配置请根据真实业务场景,综合考虑资源成本和服务SLO后进行设置。

  1. 创建hpa.yaml,相关YAML代码示例如下。请根据您使用的推理框架,在以下示例中选择一个进行配置。

    vLLM框架

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: vllm-inference # 替换为vllm推理服务的名称
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: vllm:num_requests_waiting
          target:
            type: Value
            averageValue: 5
    

    SGLang框架

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: llm-inference-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: StatefulSet
        name: sgl-inference
      minReplicas: 1
      maxReplicas: 3
      metrics:
      - type: Pods
        pods:
          metric:
            name: sglang:num_queue_reqs
          target:
            type: Value
            averageValue: 5

    执行以下命令创建HPA对象。

    kubectl apply -f hpa.yaml
  2. 使用benchmark工具,对服务进行压测。

    benchmark压测工具的详细介绍及使用方式,请参见vLLM BenchmarkSGLang Benchmark
    1. 创建benchmark.yaml文件。

      • image部署推理服务所使用的LLM/SGlang容器镜像,可选择:

        • LLM容器镜像:kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0

        • SGlang容器镜像:anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/anolis-docker-images/docker-temp:0.3.4.post2-sglang0.4.10.post2-pytorch2.7.1.8-cuda12.8.1-py312-alinux3.2104

      展开查看相关示例YAML代码。

      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        labels:
          app: llm-benchmark
        name: llm-benchmark
      spec:
        selector:
          matchLabels:
            app: llm-benchmark
        template:
          metadata:
            labels:
              app: llm-benchmark
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            containers:
            - command:
              - sh
              - -c
              - sleep inf
              image: #部署推理服务所使用的SGlang/LLM容器镜像
              imagePullPolicy: IfNotPresent
              name: llm-benchmark
              resources:
                limits:
                  cpu: "8"
                  memory: 40Gi
                requests:
                  cpu: "8"
                  memory: 40Gi
              volumeMounts:
              - mountPath: /models/Qwen3-32B
                name: llm-model
            volumes:
            - name: llm-model
              persistentVolumeClaim:
                claimName: llm-model
    2. 执行命令创建压测的服务实例。

      kubectl create -f benchmark.yaml
    3. 等待实例成功运行后,在实例中执行以下命令进行压测:

      vLLM框架

      python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
              --model /models/Qwen3-32B \
              --host inference-service \
              --port 8000 \
              --dataset-name random \
              --random-input-len 1500 \
              --random-output-len 100 \
              --random-range-ratio 1 \
              --num-prompts 400 \
              --max-concurrency 20

      SGLang框架

      python3 -m sglang.bench_serving --backend sglang \
              --model /models/Qwen3-32B \
              --host inference-service \
              --port 8000 \
              --dataset-name random \
              --random-input-len 1500 \
              --random-output-len 100 \
              --random-range-ratio 1 \
              --num-prompts 400 \
              --max-concurrency 20

    在压测期间,重新打开一个终端,执行以下命令查看服务的扩缩容情况。

    kubectl describe hpa llm-inference-hpa

    预期输出中,可以看到Events字段记录了SuccessfulRescale事件,表明HPA已根据推理服务中处于等待中状态的请求数量,将推理服务的副本数从1个扩容至3个。

    Name:                                   llm-inference-hpa
    Namespace:                              default
    Labels:                                 <none>
    Annotations:                            <none>
    CreationTimestamp:                      Fri, 25 Jul 2025 11:29:20 +0800
    Reference:                              StatefulSet/vllm-inference
    Metrics:                                ( current / target )
      "vllm:num_requests_waiting" on pods:  11 / 5
    Min replicas:                           1
    Max replicas:                           3
    StatefulSet pods:                       1 current / 3 desired
    Conditions:
      Type            Status  Reason              Message
      ----            ------  ------              -------
      AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 3
      ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric vllm:num_requests_waiting
      ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  1s    horizontal-pod-autoscaler  New size: 3; reason: pods metric vllm:num_requests_waiting above target