为PD分离推理服务配置弹性伸缩策略

Prefill-Decode(PD)分离的LLM推理架构中,PrefillDecode阶段的资源需求差异巨大,传统的CPU/GPU利用率指标无法有效指导弹性伸缩。本方案以Dynamo框架为例,介绍如何利用KEDA,根据NATS消息队列的积压情况,为Prefill角色配置独立的弹性伸缩策略,实现资源按需分配,优化服务成本与性能。

前提条件

使用限制

  • 本文的弹性伸缩方案仅针对PD分离架构中的 Prefill 角色。Decode角色的弹性伸缩需配置独立的策略(Decode建议用GPU显存利用率)。

  • 本文示例基于Dynamo推理框架,如果您使用其他框架,相关配置(如NATS Stream名称、Consumer名称)需相应调整。

操作步骤

针对通过RoleBasedGroup(RBG)部署的PD分离推理服务,RBG提供了按角色独立扩缩容的能力。本文将以Dynamo PD分离框架为例,演示利用KEDA(Kubernetes Event-driven Autoscaling)为PD分离推理服务中的Prefill角色单独配置弹性伸缩策略。

DynamoPD分离架构中,待处理的推理请求会作为消息推送至NATS消息队列的dynamo_prefill_queue流中。Prefill实例作为消费者,根据自身处理能力从此队列拉取消息进行处理。因此,队列中待处理(Pending)的消息数量能有效反映Prefill角色的负载压力。KEDA提供的NATS JetStream Scaler可以监控此队列的积压消息数,并据此触发弹性伸缩,精准调控Prefill实例的数量。

在生产环境中应用此弹性伸缩策略前,强烈建议在测试环境中进行充分的压力测试,以确定最适合您业务负载的lagThreshold(积压消息阈值)和pollingInterval(轮询间隔)。不合理的配置可能导致扩容不及时影响服务性能,或过度扩容造成资源浪费。

步骤一:为RBG角色创建ScalingAdapter

为了让KEDA能够独立控制RBG中特定角色的副本数,需要在创建RBG时,为目标角色开启ScalingAdapter,会自动创建与其绑定的RoleBasedGroupScalingAdapter资源。

  1. 创建rbg.yaml文件,通过73-74行的scalingAdapter: enable: true设置为所创建的RBGprefill角色开启ScalingAdapter。

    展开查看YAML代码示例。

    apiVersion: workloads.x-k8s.io/v1alpha1
    kind: RoleBasedGroup
    metadata:
      name: dynamo-pd
      namespace: default
    spec:
      roles:
        - name: processor
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve graphs.pd_disagg:Frontend -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步骤2中所构建的Dynamo Runtime镜像地址
                  name: processor
                  ports:
                    - containerPort: 8000
                      name: health
                      protocol: TCP
                    - containerPort: 9345
                      name: request
                      protocol: TCP
                    - containerPort: 443
                      name: api
                      protocol: TCP
                    - containerPort: 9347
                      name: metrics
                      protocol: TCP
                  readinessProbe:
                    initialDelaySeconds: 30
                    periodSeconds: 30
                    tcpSocket:
                      port: 8000
                  resources:
                    limits:
                      cpu: "8"
                      memory: 12Gi
                    requests:
                      cpu: "8"
                      memory: 12Gi
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
                    - mountPath: /workspace/examples/llm/graphs/pd_disagg.py
                      name: dynamo-configs
                      subPath: pd_disagg.py
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: prefill
          replicas: 2
          scalingAdapter:
            enable: true
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.prefill_worker:PrefillWorker -f ./configs/qwen3.yaml
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步骤2中所构建的Dynamo Runtime镜像地址
                  name: prefill-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
        - name: decoder
          replicas: 1
          template:
            spec:
              containers:
                - command:
                    - sh
                    - -c
                    - cd /workspace/examples/llm; dynamo serve components.worker:VllmWorker -f ./configs/qwen3.yaml --service-name VllmWorker
                  env:
                    - name: DYNAMO_NAME
                      value: dynamo
                    - name: DYNAMO_NAMESPACE
                      value: default
                    - name: ETCD_ENDPOINTS
                      value: http://etcd:2379
                    - name: NATS_SERVER
                      value: nats://nats:4222
                    - name: DYNAMO_RP_TIMEOUT
                      value: "60"
                  image: #步骤2中所构建的Dynamo Runtime镜像地址
                  name: vllm-worker
                  resources:
                    limits:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                    requests:
                      cpu: "12"
                      memory: 50Gi
                      nvidia.com/gpu: "2"
                  volumeMounts:
                    - mountPath: /models/Qwen3-32B/
                      name: model
                    - mountPath: /workspace/examples/llm/configs/qwen3.yaml
                      name: dynamo-configs
                      subPath: qwen3.yaml
              volumes:
                - name: model
                  persistentVolumeClaim:
                    claimName: llm-model
                - configMap:
                    name: dynamo-configs
                  name: dynamo-configs
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: dynamo-service
    spec:
      type: ClusterIP
      ports:
        - port: 8000
          protocol: TCP
          targetPort: 8000
      selector:
        rolebasedgroup.workloads.x-k8s.io/name: dynamo-pd
        rolebasedgroup.workloads.x-k8s.io/role: processor
  2. 执行以下命令创建资源。

    kubectl apply -f ./rbg.yaml
  3. 在创建RBG时,系统会自动为开启了 ScalingAdapter 的角色创建一个名为 RoleBasedGroupScalingAdapter 的自定义资源,并将其与该角色进行绑定。通过 RoleBasedGroupScalingAdapter为绑定的角色提供 Scale 子资源的实现能力。

    • 执行以下命令,查看为指定角色自动创建的RoleBasedGroupScalingAdapter

      kubectl get rolebasedgroupscalingadapter

      预期输出:

      NAME                  PHASE   REPLICAS
      dynamo-pd-prefill     Bound   2
    • 执行以下命令,确认dynamo-pd-prefill ScalingAdapter的状态。

      kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

      预期输出中,Status.Phase应为Bound,表明该ScalingAdapter已成功与所创建的RBGprefill角色完成绑定。

      Name:         dynamo-pd-prefill
      Namespace:    default
      Labels:       <none>
      Annotations:  <none>
      API Version:  workloads.x-k8s.io/v1alpha1
      Kind:         RoleBasedGroupScalingAdapter
      Metadata:
        Creation Timestamp:  2025-07-25T06:10:37Z
        Generation:          2
        Owner References:
          API Version:           workloads.x-k8s.io/v1alpha1
          Block Owner Deletion:  true
          Kind:                  RoleBasedGroup
          Name:                  dynamo-pd
          UID:                   5dd61668-79f3-4197-a5db-b778ce460270
        Resource Version:        1157485
        UID:                     edbb8373-2b9c-4ad1-8b6b-d5dfff71e769
      Spec:
        Replicas:  2
        Scale Target Ref:
          Name:  dynamo-pd
          Role:  prefill
      Status:
        Phase:     Bound
        Replicas:  2
        Selector:  rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
      Events:
        Type    Reason           Age   From                          Message
        ----    ------           ----  ----                          -------
        Normal  SuccessfulBound  25s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]

步骤二:创建KEDA ScaledObject监控消息队列

创建ScaledObject资源,定义伸缩规则,将其关联到上一步创建的RoleBasedGroupScalingAdapter

  1. 创建scaledobject.yaml文件,内容如下。该配置指定了伸缩对象为dynamo-pd-prefill ScalingAdapter,并设置了基于NATS消息队列积压数量的触发器。

    以下伸缩策略中的参数配置仅作为演示参考,实际配置请根据真实业务场景进行调整。
    apiVersion: keda.sh/v1alpha1
    kind: ScaledObject
    metadata:
      name: dynamo-prefill-scaledobject
    spec:
      pollingInterval: 30 # For demo. 默认: 30 秒
      minReplicaCount: 1 # For demo. 默认: 0
      maxReplicaCount: 6 # For demo. 默认: 100
      scaleTargetRef:
        apiVersion: workloads.x-k8s.io/v1alpha1
        kind: RoleBasedGroupScalingAdapter
        name: dynamo-pd-prefill #指定伸缩对象为RoleBasedGroup中的Prefill角色
      triggers:
      - type: nats-jetstream
        metadata:
          natsServerMonitoringEndpoint: "nats.default.svc.cluster.local:8222" #NATS service endpoint
          account: "$G" #当Nats未设置账户时的默认值
          stream: "dynamo_prefill_queue" #DynamoPrefillQueue名称
          consumer: "worker-group" #DynamoConsumer的持久化名称
          lagThreshold: "5" #Nats指定队列中处于Pending的消息数量伸缩阈值
          useHttps: "false" #是否使用Https协议
  2. 执行以下命令创建资源。

    kubectl apply -f ./scaledobject.yaml
  3. 执行以下命令,确认KEDA ScaledObject资源状态。

    kubectl describe so dynamo-prefill-scaledobject

    展开查看预期输出。

    Name:         dynamo-prefill-scaledobject
    Namespace:    default
    Labels:       scaledobject.keda.sh/name=dynamo-prefill-scaledobject
    Annotations:  <none>
    API Version:  keda.sh/v1alpha1
    Kind:         ScaledObject
    Metadata:
      ...
    Spec:
      Cooldown Period:    300
      Max Replica Count:  6
      Min Replica Count:  1
      Polling Interval:   30
      Scale Target Ref:
        API Version:  workloads.x-k8s.io/v1alpha1
        Kind:         RoleBasedGroupScalingAdapter
        Name:         dynamo-pd-prefill
      Triggers:
        Metadata:
          Account:                          $G
          Consumer:                         worker-group
          Lag Threshold:                    5
          Nats Server Monitoring Endpoint:  nats.default.svc.cluster.local:8222
          Stream:                           dynamo_prefill_queue
          Use Https:                        false
        Type:                               nats-jetstream
    Status:
      Conditions:
        Message:  ScaledObject is defined correctly and is ready for scaling
        Reason:   ScaledObjectReady
        Status:   True
        Type:     Ready
        Message:  Scaling is not performed because triggers are not active
        Reason:   ScalerNotActive
        Status:   False
        Type:     Active
        Status:   Unknown
        Type:     Fallback
      External Metric Names:
        s0-nats-jetstream-dynamo_prefill_queue
      Hpa Name:                keda-hpa-dynamo-prefill-scaledobject
      Original Replica Count:  1
      Scale Target GVKR:
        Group:            workloads.x-k8s.io
        Kind:             RoleBasedGroupScalingAdapter
        Resource:         rolebasedgroupscalingadapters
        Version:          v1alpha1
      Scale Target Kind:  workloads.x-k8s.io/v1alpha1.RoleBasedGroupScalingAdapter
    Events:
      Type    Reason              Age   From           Message
      ----    ------              ----  ----           -------
      Normal  KEDAScalersStarted  3s    keda-operator  Started scalers watch
      Normal  ScaledObjectReady   3s    keda-operator  ScaledObject is ready for scaling

    预期输出中,Status.ConditionsReady状态应为True

    同时,KEDA会自动创建一个HPA资源,其名称记录在Status.HpaName字段中,可执行以下命令查看。

    kubectl get hpa keda-hpa-dynamo-prefill-scaledobject

步骤三:(可选)压测并验证扩缩容效果

  1. 创建用于压测的服务实例,使用benchmark工具,对服务进行压测。

    benchmark压测工具的详细介绍及使用方式,请参见vLLM Benchmark
    1. 创建benchmark.yaml文件。

      展开查看相关示例YAML代码。

      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        labels:
          app: llm-benchmark
        name: llm-benchmark
      spec:
        selector:
          matchLabels:
            app: llm-benchmark
        template:
          metadata:
            labels:
              app: llm-benchmark
          spec:
            hostNetwork: true
            dnsPolicy: ClusterFirstWithHostNet
            containers:
            - command:
              - sh
              - -c
              - sleep inf
              image: #部署推理服务所使用的Dynamo容器镜像
              imagePullPolicy: IfNotPresent
              name: llm-benchmark
              resources:
                limits:
                  cpu: "8"
                  memory: 40Gi
                requests:
                  cpu: "8"
                  memory: 40Gi
              volumeMounts:
              - mountPath: /models/Qwen3-32B
                name: llm-model
            volumes:
            - name: llm-model
              persistentVolumeClaim:
                claimName: llm-model
    2. 执行命令创建压测的服务实例。

      kubectl create -f benchmark.yaml
    3. 等待实例成功运行后,在实例中执行以下命令进行压测:

      python3 $VLLM_ROOT_DIR/benchmarks/benchmark_serving.py \
              --backend openai-chat \
              --model /models/Qwen3-32B/ \
              --served-model-name qwen \
              --trust-remote-code \
              --dataset-name random \
              --random-input-len 1500 \
              --random-output-len 100 \
              --num-prompts 320 \
              --max-concurrency 32 \
              --host dynamo-service \
              --port 8000 \
              --endpoint /v1/chat/completions 
  2. 在压测期间,新开一个终端,执行以下命令观察HPA的扩缩容事件。

    kubectl describe hpa keda-hpa-dynamo-prefill-scaledobject

    预期输出中,可以看到Events字段记录了SuccessfulRescale事件,表明KEDA已根据NATS队列的积压情况成功触发了扩容。

    Name:                               keda-hpa-dynamo-prefill-scaledobject
    Namespace:                          default
    Reference:                          RoleBasedGroupScalingAdapter/dynamo-pd-prefill
    Min replicas:                       1
    Max replicas:                       6
    RoleBasedGroupScalingAdapter pods:  6 current / 6 desired
    Events:
      Type     Reason             Age                   From                       Message
      ----     ------             ----                  ----                       -------
      Normal  SuccessfulRescale  2m1s  horizontal-pod-autoscaler  New size: 4; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
      Normal  SuccessfulRescale  106s  horizontal-pod-autoscaler  New size: 6; reason: external metric s0-nats-jetstream-dynamo_prefill_queue(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: dynamo-prefill-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target
  3. 同时,可以观察RoleBasedGroupScalingAdapter的副本数变化。

    kubectl describe rolebasedgroupscalingadapter dynamo-pd-prefill

    预期输出中,Spec.ReplicasStatus.Replicas的值会从初始值增加到扩容后的值(例如6)。

    Name:         dynamo-pd-prefill
    Namespace:    default
    API Version:  workloads.x-k8s.io/v1alpha1
    Kind:         RoleBasedGroupScalingAdapter
    Metadata:
      Owner References:
        API Version:           workloads.x-k8s.io/v1alpha1
        Block Owner Deletion:  true
        Kind:                  RoleBasedGroup
        Name:                  dynamo-pd
    Spec:
      Replicas:  6
      Scale Target Ref:
        Name:  dynamo-pd
        Role:  prefill
    Status:
      Last Scale Time:  2025-08-04T02:08:10Z
      Phase:            Bound
      Replicas:         6
      Selector:         rolebasedgroup.workloads.x-k8s.io/name=dynamo-pd,rolebasedgroup.workloads.x-k8s.io/role=prefill
    Events:
      Type    Reason           Age    From                          Message
      ----    ------           ----   ----                          -------
      Normal  SuccessfulBound  6m9s   RoleBasedGroupScalingAdapter  Succeed to find scale target role [prefill] of rbg [dynamo-pd]
      Normal  SuccessfulScale  4m40s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 1 to 4 replicas
      Normal  SuccessfulScale  4m25s  RoleBasedGroupScalingAdapter  Succeed to scale target role [prefill] of rbg [dynamo-pd] from 4 to 6 replicas