Deploy a vLLM inference application with Knative

更新时间:
复制 MD 格式

Traditional autoscaling policies based on GPU utilization do not accurately reflect the actual load of a large language model (LLM) inference service. Even when GPU utilization reaches 100%, the system may not be under heavy load. The Knative Pod Autoscaler (KPA) provides an autoscaling mechanism that adjusts resource allocation based on queries per second (QPS) or requests per second (RPS), which more directly reflects inference service performance. This topic describes how to deploy a vLLM inference service in Knative, using the Qwen-7B-Chat-Int8 model and a V100 GPU as an example.

vLLM (Vectorized Large Language Model) is a high-performance inference library for large language models (LLMs). It supports multiple model formats and backend acceleration, making it suitable for deploying large-scale language model inference services. vLLM achieves high inference efficiency for large language models by using optimization techniques such as PagedAttention, continuous batching, and model quantization. For more information about the vLLM framework, see the vLLM GitHub repository.

Prerequisites

Step 1: Prepare model data and upload to OSS

You can use OSS or NAS to prepare model data. For more information, see Use an ossfs 1.0 static volume or Use a NAS static volume. The following steps use OSS as an example.

  1. Download the model files. This example uses the Qwen-7B-Chat-Int8 model.

    1. Run the following command to install Git.

      # You can run yum install git or apt install git.
      sudo yum install git
    2. Run the following command to install the Git LFS (Large File Support) extension.

      # You can run yum install git-lfs or apt install git-lfs.
      sudo yum install git-lfs
    3. Run the following command to clone the Qwen-7B-Chat-Int8 repository from ModelScope to your local machine.

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
    4. Run the following commands to change to the Qwen-7B-Chat-Int8 directory and download the large files managed by Git LFS.

      cd Qwen-7B-Chat-Int8
      git lfs pull
  2. Upload the downloaded Qwen-7B-Chat-Int8 files to OSS.

    1. Log on to the OSS Console and take note of the name of the bucket you created.

      For information about how to create a bucket, see Create buckets.

    2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen-7B-Chat-Int8 in your OSS bucket.

      ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8
    4. Run the following command to upload the model files to OSS.

      ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8
  3. Configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the target cluster.

    The following tables provide sample parameters. For more information, see Use an ossfs 1.0 static volume.

    • Example PV parameters

      Parameter

      Description

      Volume Type

      OSS

      Name

      llm-model

      access credential

      Configure the AccessKey ID and AccessKey Secret used to access OSS.

      bucket

      Select the created OSS bucket.

      OSS Path

      Select the path where the model is stored, such as /models/Qwen-7B-Chat-Int8.

    • Example PVC parameters

      Parameter

      Description

      Persistent Volume Claim Type

      OSS

      Name

      llm-model

      Allocation Mode

      Select Existing Volume.

      Existing Volume

      Click Select an Existing Volume and select the PV you created.

Step 2: Deploy the Knative inference service

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

  3. Click the Services tab, set namespace to default, and then click Create from Template. Paste the following YAML into the template and click Create to deploy the Knative inference service.

    apiVersion: serving.knative.dev/v1
    kind: Service
    metadata:
      labels:
        release: qwen
      name: qwen
      namespace: default
    spec:
      template:
        metadata:
          annotations:
            autoscaling.knative.dev/metric: "concurrency" # Specifies concurrency as the autoscaling metric.
            autoscaling.knative.dev/target: "2" # Sets the target concurrency to 2.
            autoscaling.knative.dev/max-scale: "3" 
          labels:
            release: qwen
        spec:
          containers:
          - command:
            - sh
            - -c
            - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
              --served-model-name qwen --model /models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
              0.95 --quantization gptq --max-model-len=6144
            image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
            imagePullPolicy: IfNotPresent
            name: vllm-container
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 5
              periodSeconds: 5
            resources:
              limits:
                cpu: "32"
                memory: 64Gi
                nvidia.com/gpu: "1"
              requests:
                cpu: "8"
                memory: 16Gi
                nvidia.com/gpu: "1"
            volumeMounts:
            - mountPath: /models/Qwen-7B-Chat-Int8 # The path where the model is stored.
              name: llm-model
          volumes:
          - name: llm-model
            persistentVolumeClaim:
              claimName: llm-model

    Parameter

    Description

    autoscaling.knative.dev/metric

    The autoscaling metric. Valid values are concurrency and rps. The default is concurrency.

    autoscaling.knative.dev/target

    The autoscaling threshold.

    autoscaling.knative.dev/max-scale

    The maximum number of replicas.

  4. On the Services tab, check whether the service is successfully deployed. Then, obtain the default domain name and the access gateway IP.

    Get the gateway IP from the access gateway field at the top of the page. Get the domain name from the Default Domain Name column, for example, qwen.default.example.com.

Step 3: Verify the inference service

Run the following command to send a request to the inference service.

# Replace 1XX.XX.XX.XXX with your access gateway IP.
curl -H "Host: qwen.default.example.com" -H "Content-Type: application/json" http://1XX.XX.XX.XXX:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hangzhou West Lake"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"cmpl-e914ef54331e4a5f9b858425321a82ed","object":"chat.completion","created":1733191642,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"West Lake is located in Hangzhou, Zhejiang Province, China, and is a famous scenic spot"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}

The output indicates that the model can generate a response based on the given input.

(Optional) Step 4: Clean up resources

If you no longer need the created resources, delete them. Run the following commands to delete the PV and PVC.

kubectl delete pvc llm-model
kubectl delete pv llm-model