基于KServe为服务配置弹性扩缩容策略

在部署与管理KServe模型服务过程中,需应对模型推理服务面临的高度动态负载波动。KServe通过集成Kubernetes原生的HPA(Horizontal Pod Autoscaler)技术及扩缩容控制器,实现了根据CPU利用率、内存占用情况、GPU利用率以及自定义性能指标,自动灵活地调整模型服务Pod的规模,以确保服务效能与稳定性。本文以Qwen-7B-Chat-Int8模型、GPU类型为V100卡为例,介绍如何基于KServe为服务配置弹性扩缩容。

前提条件

基于CPU或Memory配置自动扩缩容策略

Raw Deployment模式下的自动扩缩容依赖Kubernetes的HPA(Horizontal Pod Autoscaler)机制,这是最基础的自动扩缩容方式,HPA根据Pod的CPU或Memory的利用率动态调整ReplicaSet中的Pod副本数量。

以下介绍如何基于CPU利用率配置自动扩缩容。HPA机制可参考社区文档Pod 水平自动扩缩

  1. 执行以下命令,提交服务。

    arena serve kserve \
        --name=sklearn-iris \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/ai-sample/kserve-sklearn-server:v0.12.0 \
        --cpu=1 \
        --memory=200Mi \
        --scale-metric=cpu \
        --scale-target=10 \
        --min-replicas=1 \
        --max-replicas=10 \
        "python -m sklearnserver --model_name=sklearn-iris --model_dir=/models --http_port=8080"

    参数说明如下:

    参数

    说明

    --scale-metric

    扩缩容指标支持cpumemory两种。此处以cpu为例。

    --scale-target

    扩缩容阈值,百分比。

    --min-replicas

    扩缩容的最小副本数,该值需要为大于0的整数。HPA策略暂不支持缩容到0。

    --max-replicas

    扩缩容的最大副本数,该值需要为大于minReplicas的整数。

    预期输出:

    inferenceservice.serving.kserve.io/sklearn-iris created
    INFO[0002] The Job sklearn-iris has been submitted successfully 
    INFO[0002] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status 

    输出结果表明sklearn-iris服务已经成功创建。

  2. 执行以下命令,准备推理输入请求。

    创建一个名为iris-input.json的文件,并将以下特定的JSON内容写入iris-input.json文件中,用于模型预测的输入数据。

    cat <<EOF > "./iris-input.json"
    {
      "instances": [
        [6.8,  2.8,  4.8,  1.4],
        [6.0,  3.4,  4.5,  1.6]
      ]
    }
    EOF
  3. 执行以下命令,访问服务进行推理。

    # 从kube-system命名空间中获取名为nginx-ingress-lb的服务的负载均衡器IP地址,这是外部访问服务的入口点。
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # 获取名为sklearn-iris的Inference Service的URL,并从中提取出主机名部分,以便后续使用。
    SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # 使用curl命令发送请求到模型服务。请求头中设置了目标主机名(即之前获取的SERVICE_HOSTNAME)和内容类型为JSON。-d @./iris-input.json指定了请求体内容来自于本地文件iris-input.json,该文件应包含模型预测所需的输入数据。
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/models/sklearn-iris:predict -d @./iris-input.json

    预期输出:

    {"predictions":[1,1]}%

    输出结果表明请求导致了两次推理的发生,且两次推理的响应一致。

  4. 执行以下命令,发起压测。

    说明

    Hey压测工具的详细介绍,请参见Hey

    hey -z 2m -c 20 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -D ./iris-input.json http://${NGINX_INGRESS_IP}:80/v1/models/sklearn-iris:predict
  5. 在压测的同时,另外打开一个终端,执行以下命令查看服务的扩缩容情况。

    kubectl describe hpa sklearn-iris-predictor

    预期输出:

    展开查看服务的扩缩容情况

    Name:                                                  sklearn-iris-predictor
    Namespace:                                             default
    Labels:                                                app=isvc.sklearn-iris-predictor
                                                           arena.kubeflow.org/uid=3399d840e8b371ed7ca45dda29debeb1
                                                           chart=kserve-0.1.0
                                                           component=predictor
                                                           heritage=Helm
                                                           release=sklearn-iris
                                                           serving.kserve.io/inferenceservice=sklearn-iris
                                                           servingName=sklearn-iris
                                                           servingType=kserve
    Annotations:                                           arena.kubeflow.org/username: kubecfg:certauth:admin
                                                           serving.kserve.io/deploymentMode: RawDeployment
    CreationTimestamp:                                     Sat, 11 May 2024 17:15:47 +0800
    Reference:                                             Deployment/sklearn-iris-predictor
    Metrics:                                               ( current / target )
      resource cpu on pods  (as a percentage of request):  0% (2m) / 10%
    Min replicas:                                          1
    Max replicas:                                          10
    Behavior:
      Scale Up:
        Stabilization Window: 0 seconds
        Select Policy: Max
        Policies:
          - Type: Pods     Value: 4    Period: 15 seconds
          - Type: Percent  Value: 100  Period: 15 seconds
      Scale Down:
        Select Policy: Max
        Policies:
          - Type: Percent  Value: 100  Period: 15 seconds
    Deployment pods:       10 current / 10 desired
    Conditions:
      Type            Status  Reason               Message
      ----            ------  ------               -------
      AbleToScale     True    ScaleDownStabilized  recent recommendations were higher than current one, applying the highest recent recommendation
      ScalingActive   True    ValidMetricFound     the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
      ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range
    Events:
      Type    Reason             Age                  From                       Message
      ----    ------             ----                 ----                       -------
      Normal  SuccessfulRescale  38m                  horizontal-pod-autoscaler  New size: 8; reason: cpu resource utilization (percentage of request) above target
      Normal  SuccessfulRescale  28m                  horizontal-pod-autoscaler  New size: 7; reason: All metrics below target
      Normal  SuccessfulRescale  27m                  horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    预期输出的Events参数显示HPA根据CPU使用情况自动调整了副本数。例如,在不同时间点将副本数调整为8、7、1。即HPA能根据CPU的使用情况进行自动扩缩容。

基于GPU利用率配置自定义指标的弹性扩缩容策略

自定义指标的扩缩容依赖ACK提供的ack-alibaba-cloud-metrics-adapter组件与Kubernetes HPA机制实现。详细信息,请参见基于阿里云Prometheus指标的容器水平伸缩

以下示例演示如何基于Pod的GPU利用率配置自定义指标的扩缩容。

  1. 准备Qwen-7B-Chat-Int8模型数据。具体操作,请参见部署vLLM推理服务

  2. 配置自定义GPU Metrics指标。具体操作,请参见基于GPU指标实现弹性伸缩

  3. 执行以下命令,部署vLLM服务。

    arena serve kserve \
        --name=qwen \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --scale-metric=DCGM_CUSTOM_PROCESS_SM_UTIL \
        --scale-target=50 \
        --min-replicas=1 \
        --max-replicas=2 \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
        "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    预期输出:

    inferenceservice.serving.kserve.io/qwen created
    INFO[0002] The Job qwen has been submitted successfully 
    INFO[0002] You can run `arena serve get qwen --type kserve -n default` to check the job status 

    输出结果表明推理服务已经部署成功。

  4. 执行以下命令,使用获取到的Nginx Ingress网关地址访问推理服务,测试vLLM服务是否正常。

    # 获取Nginx ingress的IP地址。
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # 获取Inference Service的Hostname。
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # 发送请求访问推理服务。
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "测试一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    预期输出:

    {"id":"cmpl-77088b96abe744c89284efde2e779174","object":"chat.completion","created":1715590010,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的,请问您有什么需要测试的?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}%    

    输出结果表明请求被正确地发送到了服务端,并且服务端返回了一个预期的JSON响应。

  5. 执行以下命令,对服务进行压测。

    说明

    Hey压测工具的详细介绍,请参见Hey

    hey -z 2m -c 5 -m POST -host $SERVICE_HOSTNAME -H "Content-Type: application/json" -d '{"model": "qwen", "messages": [{"role": "user", "content": "测试一下"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}' http://$NGINX_INGRESS_IP:80/v1/chat/completions 
  6. 在压测期间,重新打开一个终端,执行以下命令查看服务的扩缩容情况。

    kubectl describe hpa qwen-hpa

    预期输出:

    展开查看qwen-hpa的扩缩容情况

    Name:                                     qwen-hpa
    Namespace:                                default
    Labels:                                   <none>
    Annotations:                              <none>
    CreationTimestamp:                        Tue, 14 May 2024 14:57:03 +0800
    Reference:                                Deployment/qwen-predictor
    Metrics:                                  ( current / target )
      "DCGM_CUSTOM_PROCESS_SM_UTIL" on pods:  0 / 50
    Min replicas:                             1
    Max replicas:                             2
    Deployment pods:                          1 current / 1 desired
    Conditions:
      Type            Status  Reason            Message
      ----            ------  ------            -------
      AbleToScale     True    ReadyForNewScale  recommended size matches current size
      ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric DCGM_CUSTOM_PROCESS_SM_UTIL
      ScalingLimited  True    TooFewReplicas    the desired replica count is less than the minimum replica count
    Events:
      Type    Reason             Age   From                       Message
      ----    ------             ----  ----                       -------
      Normal  SuccessfulRescale  43m   horizontal-pod-autoscaler  New size: 2; reason: pods metric DCGM_CUSTOM_PROCESS_SM_UTIL above target
      Normal  SuccessfulRescale  34m   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

    预期输出表明在压测期间Pod数会扩容到2,而当压测结束后,经过一段时间(约为5分钟),Pod缩容到1。即KServe可以基于Pod的GPU利用率实现自定义指标的扩缩容。

配置定时扩缩容策略

定时扩缩容需要结合ACK提供的ack-kubernetes-cronhpa-controller组件实现,通过该组件您可以设定在特定的时间点或周期性地改变应用的副本数量,以应对可预见性的负载变化。

  1. 安装CronHPA组件。具体操作,请参见使用容器定时水平伸缩(CronHPA)

  2. 准备Qwen-7B-Chat-Int8模型数据。具体操作,请参见部署vLLM推理服务

  3. 执行以下命令,部署vLLM服务。

    arena serve kserve \
        --name=qwen-cronhpa \
        --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
        --gpus=1 \
        --cpu=4 \
        --memory=12Gi \
        --annotation="serving.kserve.io/autoscalerClass=external" \
        --data="llm-model:/mnt/models/Qwen-7B-Chat-Int8" \
       "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144"

    预期输出:

    inferenceservice.serving.kserve.io/qwen-cronhpa created
    INFO[0004] The Job qwen-cronhpa has been submitted successfully 
    INFO[0004] You can run `arena serve get qwen-cronhpa --type kserve -n default` to check the job status 
  4. 执行以下命令,测试vLLM服务是否正常。

    # 获取Nginx ingress的IP地址。
    NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
    # 获取Inference Service的Hostname。
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # 发送请求访问推理服务。
    curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
         http://$NGINX_INGRESS_IP:80/v1/chat/completions -X POST \
         -d '{"model": "qwen", "messages": [{"role": "user", "content": "你好"}], "max_tokens": 512, "temperature": 0.7, "top_p": 0.9, "seed": 10, "stop":["<|endoftext|>", "<|im_end|>", "<|im_start|>"]}'

    预期输出:

    {"id":"cmpl-b7579597aa284f118718b22b83b726f8","object":"chat.completion","created":1715589652,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"好的,请问您有什么需要测试的?<|im_end|>"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}% 

    输出结果表明请求被正确地发送到了服务,并且服务返回了一个预期的JSON响应。

  5. 执行以下命令,配置定时扩缩容。

    展开查看配置定时扩缩容的命令

    kubectl apply -f- <<EOF
    apiVersion: autoscaling.alibabacloud.com/v1beta1
    kind: CronHorizontalPodAutoscaler
    metadata:
      name: qwen-cronhpa
      namespace: default 
    spec:
       scaleTargetRef:
          apiVersion: apps/v1
          kind: Deployment
          name: qwen-cronhpa-predictor
       jobs:
       # 每天10点半扩容
       - name: "scale-up"
         schedule: "0 30 10 * * *"
         targetSize: 2
         runOnce: false
      # 每天12点缩容
       - name: "scale-down"
         schedule: "0 0 12 * * *"
         targetSize: 1
         runOnce: false
    EOF

    预期输出:

    展开查看预设的扩缩容

    Name:         qwen-cronhpa
    Namespace:    default
    Labels:       <none>
    Annotations:  <none>
    API Version:  autoscaling.alibabacloud.com/v1beta1
    Kind:         CronHorizontalPodAutoscaler
    Metadata:
      Creation Timestamp:  2024-05-12T14:06:49Z
      Generation:          2
      Resource Version:    9205625
      UID:                 b9e72da7-262e-4***-b***-26586b7****c
    Spec:
      Jobs:
        Name:         scale-up
        Schedule:     0 30 10 * * *
        Target Size:  2
        Name:         scale-down
        Schedule:     0 0 12 * * *
        Target Size:  1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Status:
      Conditions:
        Job Id:           3972f7cc-bab0-482e-8cbe-7c4*******5
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-up
        Run Once:         false
        Schedule:         0 30 10 * * *
        State:            Submitted
        Target Size:      2
        Job Id:           36a04605-0233-4420-967c-ac2********6
        Last Probe Time:  2024-05-12T14:06:49Z
        Message:          
        Name:             scale-down
        Run Once:         false
        Schedule:         0 0 12 * * *
        State:            Submitted
        Target Size:      1
      Scale Target Ref:
        API Version:  apps/v1
        Kind:         Deployment
        Name:         qwen-cronhpa-predictor
    Events:           <none>
    

    输出结果表明qwen-cronhpaCRD已经配置了一个自动扩缩容计划,根据设定的时间表,在每天的特定时间自动调整名为qwen-cronhpa-predictor的Deployment中的Pod数量,以满足预设的扩缩容需求。

相关文档

如需了解ACK弹性扩缩容的详细信息,请参见弹性伸缩概述