Deploy a vLLM inference application with Knative-Container Service for Kubernetes(ACK)-阿里云帮助中心

Traditional autoscaling policies based on GPU utilization do not accurately reflect the actual load of a large language model (LLM) inference service. Even when GPU utilization reaches 100%, the system may not be under heavy load. The Knative Pod Autoscaler (KPA) provides an autoscaling mechanism that adjusts resource allocation based on queries per second (QPS) or requests per second (RPS), which more directly reflects inference service performance. This topic describes how to deploy a vLLM inference service in Knative, using the Qwen-7B-Chat-Int8 model and a V100 GPU as an example.

vLLM (Vectorized Large Language Model) is a high-performance inference library for large language models (LLMs). It supports multiple model formats and backend acceleration, making it suitable for deploying large-scale language model inference services. vLLM achieves high inference efficiency for large language models by using optimization techniques such as PagedAttention, continuous batching, and model quantization. For more information about the vLLM framework, see the vLLM GitHub repository.

Prerequisites

You have created an ACK Managed Cluster or an ACK Dedicated Cluster that contains GPU nodes. The cluster version must be 1.22 or later, and the GPU type must be V100, A10, or T4. For more information, see Create an ACK Managed Cluster or Create an ACK Dedicated Cluster (Creating new clusters is no longer supported).

You can add the label ack.aliyun.com/nvidia-driver-version:550.144.03 to the GPU node pool to set the driver version to 550.144.03. For more information, see Customize the GPU driver version for nodes by specifying a version number.
You have deployed Knative on the cluster. For more information, see Deploy and manage Knative components.
You have obtained the cluster KubeConfig and connected to the cluster using kubectl.

Step 1: Prepare model data and upload to OSS

You can use OSS or NAS to prepare model data. For more information, see Use an ossfs 1.0 static volume or Use a NAS static volume. The following steps use OSS as an example.

Download the model files. This example uses the Qwen-7B-Chat-Int8 model.
1. Run the following command to install Git.
```
# You can run yum install git or apt install git.
sudo yum install git
```
2. Run the following command to install the Git LFS (Large File Support) extension.
```
# You can run yum install git-lfs or apt install git-lfs.
sudo yum install git-lfs
```
3. Run the following command to clone the Qwen-7B-Chat-Int8 repository from ModelScope to your local machine.
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git
```
4. Run the following commands to change to the Qwen-7B-Chat-Int8 directory and download the large files managed by Git LFS.
```
cd Qwen-7B-Chat-Int8
git lfs pull
```
Upload the downloaded Qwen-7B-Chat-Int8 files to OSS.
1. Log on to the OSS Console and take note of the name of the bucket you created.
  
  For information about how to create a bucket, see Create buckets.
2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
3. Run the following command to create a directory named Qwen-7B-Chat-Int8 in your OSS bucket.
```
ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8
```
4. Run the following command to upload the model files to OSS.
```
ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8
```

Configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the target cluster.

The following tables provide sample parameters. For more information, see Use an ossfs 1.0 static volume.

Example PV parameters

Parameter	Description
Volume Type	OSS
Name	llm-model
access credential	Configure the AccessKey ID and AccessKey Secret used to access OSS.
bucket	Select the created OSS bucket.
OSS Path	Select the path where the model is stored, such as /models/Qwen-7B-Chat-Int8.

Example PVC parameters

Parameter	Description
Persistent Volume Claim Type	OSS
Name	llm-model
Allocation Mode	Select Existing Volume.
Existing Volume	Click Select an Existing Volume and select the PV you created.

Step 2: Deploy the Knative inference service

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Knative.

Click the Services tab, set namespace to default, and then click Create from Template. Paste the following YAML into the template and click Create to deploy the Knative inference service.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency" # Specifies concurrency as the autoscaling metric.
        autoscaling.knative.dev/target: "2" # Sets the target concurrency to 2.
        autoscaling.knative.dev/max-scale: "3" 
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0
        imagePullPolicy: IfNotPresent
        name: vllm-container
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          limits:
            cpu: "32"
            memory: 64Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 16Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /models/Qwen-7B-Chat-Int8 # The path where the model is stored.
          name: llm-model
      volumes:
      - name: llm-model
        persistentVolumeClaim:
          claimName: llm-model

Parameter	Description
`autoscaling.knative.dev/metric`	The autoscaling metric. Valid values are `concurrency` and `rps`. The default is `concurrency`.
`autoscaling.knative.dev/target`	The autoscaling threshold.
`autoscaling.knative.dev/max-scale`	The maximum number of replicas.

On the Services tab, check whether the service is successfully deployed. Then, obtain the default domain name and the access gateway IP.

Get the gateway IP from the access gateway field at the top of the page. Get the domain name from the Default Domain Name column, for example, qwen.default.example.com.

Step 3: Verify the inference service

Run the following command to send a request to the inference service.

# Replace 1XX.XX.XX.XXX with your access gateway IP.
curl -H "Host: qwen.default.example.com" -H "Content-Type: application/json" http://1XX.XX.XX.XXX:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hangzhou West Lake"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"cmpl-e914ef54331e4a5f9b858425321a82ed","object":"chat.completion","created":1733191642,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"West Lake is located in Hangzhou, Zhejiang Province, China, and is a famous scenic spot"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}

The output indicates that the model can generate a response based on the given input.

(Optional) Step 4: Clean up resources

If you no longer need the created resources, delete them. Run the following commands to delete the PV and PVC.

kubectl delete pvc llm-model
kubectl delete pv llm-model