Traditional autoscaling policies based on GPU utilization do not accurately reflect the actual load of a large language model (LLM) inference service. Even when GPU utilization reaches 100%, the system may not be under heavy load. The Knative Pod Autoscaler (KPA) provides an autoscaling mechanism that adjusts resource allocation based on queries per second (QPS) or requests per second (RPS), which more directly reflects inference service performance. This topic describes how to deploy a vLLM inference service in Knative, using the Qwen-7B-Chat-Int8 model and a V100 GPU as an example.
vLLM (Vectorized Large Language Model) is a high-performance inference library for large language models (LLMs). It supports multiple model formats and backend acceleration, making it suitable for deploying large-scale language model inference services. vLLM achieves high inference efficiency for large language models by using optimization techniques such as PagedAttention, continuous batching, and model quantization. For more information about the vLLM framework, see the vLLM GitHub repository.
Prerequisites
-
You have created an ACK Managed Cluster or an ACK Dedicated Cluster that contains GPU nodes. The cluster version must be 1.22 or later, and the GPU type must be V100, A10, or T4. For more information, see Create an ACK Managed Cluster or Create an ACK Dedicated Cluster (Creating new clusters is no longer supported).
You can add the label
ack.aliyun.com/nvidia-driver-version:550.144.03to the GPU node pool to set the driver version to 550.144.03. For more information, see Customize the GPU driver version for nodes by specifying a version number. -
You have deployed Knative on the cluster. For more information, see Deploy and manage Knative components.
-
You have obtained the cluster KubeConfig and connected to the cluster using kubectl.
Step 1: Prepare model data and upload to OSS
You can use OSS or NAS to prepare model data. For more information, see Use an ossfs 1.0 static volume or Use a NAS static volume. The following steps use OSS as an example.
-
Download the model files. This example uses the Qwen-7B-Chat-Int8 model.
-
Run the following command to install Git.
# You can run yum install git or apt install git. sudo yum install git -
Run the following command to install the Git LFS (Large File Support) extension.
# You can run yum install git-lfs or apt install git-lfs. sudo yum install git-lfs -
Run the following command to clone the Qwen-7B-Chat-Int8 repository from ModelScope to your local machine.
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen-7B-Chat-Int8.git -
Run the following commands to change to the Qwen-7B-Chat-Int8 directory and download the large files managed by Git LFS.
cd Qwen-7B-Chat-Int8 git lfs pull
-
-
Upload the downloaded Qwen-7B-Chat-Int8 files to OSS.
-
Log on to the OSS Console and take note of the name of the bucket you created.
For information about how to create a bucket, see Create buckets.
-
Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
-
Run the following command to create a directory named Qwen-7B-Chat-Int8 in your OSS bucket.
ossutil mkdir oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8 -
Run the following command to upload the model files to OSS.
ossutil cp -r ./Qwen-7B-Chat-Int8 oss://<Your-Bucket-Name>/models/Qwen-7B-Chat-Int8
-
-
Configure a PersistentVolume (PV) and a PersistentVolumeClaim (PVC) for the target cluster.
The following tables provide sample parameters. For more information, see Use an ossfs 1.0 static volume.
-
Example PV parameters
Parameter
Description
Volume Type
OSS
Name
llm-model
access credential
Configure the AccessKey ID and AccessKey Secret used to access OSS.
bucket
Select the created OSS bucket.
OSS Path
Select the path where the model is stored, such as /models/Qwen-7B-Chat-Int8.
-
Example PVC parameters
Parameter
Description
Persistent Volume Claim Type
OSS
Name
llm-model
Allocation Mode
Select Existing Volume.
Existing Volume
Click Select an Existing Volume and select the PV you created.
-
Step 2: Deploy the Knative inference service
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
Click the Services tab, set namespace to default, and then click Create from Template. Paste the following YAML into the template and click Create to deploy the Knative inference service.
apiVersion: serving.knative.dev/v1 kind: Service metadata: labels: release: qwen name: qwen namespace: default spec: template: metadata: annotations: autoscaling.knative.dev/metric: "concurrency" # Specifies concurrency as the autoscaling metric. autoscaling.knative.dev/target: "2" # Sets the target concurrency to 2. autoscaling.knative.dev/max-scale: "3" labels: release: qwen spec: containers: - command: - sh - -c - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /models/Qwen-7B-Chat-Int8 --gpu-memory-utilization 0.95 --quantization gptq --max-model-len=6144 image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:v0.10.0 imagePullPolicy: IfNotPresent name: vllm-container readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 5 resources: limits: cpu: "32" memory: 64Gi nvidia.com/gpu: "1" requests: cpu: "8" memory: 16Gi nvidia.com/gpu: "1" volumeMounts: - mountPath: /models/Qwen-7B-Chat-Int8 # The path where the model is stored. name: llm-model volumes: - name: llm-model persistentVolumeClaim: claimName: llm-modelParameter
Description
autoscaling.knative.dev/metricThe autoscaling metric. Valid values are
concurrencyandrps. The default isconcurrency.autoscaling.knative.dev/targetThe autoscaling threshold.
autoscaling.knative.dev/max-scaleThe maximum number of replicas.
-
On the Services tab, check whether the service is successfully deployed. Then, obtain the default domain name and the access gateway IP.
Get the gateway IP from the access gateway field at the top of the page. Get the domain name from the Default Domain Name column, for example,
qwen.default.example.com.
Step 3: Verify the inference service
Run the following command to send a request to the inference service.
# Replace 1XX.XX.XX.XXX with your access gateway IP.
curl -H "Host: qwen.default.example.com" -H "Content-Type: application/json" http://1XX.XX.XX.XXX:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Hangzhou West Lake"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'
Expected output:
{"id":"cmpl-e914ef54331e4a5f9b858425321a82ed","object":"chat.completion","created":1733191642,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"West Lake is located in Hangzhou, Zhejiang Province, China, and is a famous scenic spot"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":10,"total_tokens":20,"completion_tokens":10}}
The output indicates that the model can generate a response based on the given input.
(Optional) Step 4: Clean up resources
If you no longer need the created resources, delete them. Run the following commands to delete the PV and PVC.
kubectl delete pvc llm-model
kubectl delete pv llm-model