Run multiple inference services on a single GPU by slicing its memory with shared GPU scheduling.
How it works
The shared GPU scheduling add-on slices GPU memory across pods. slices GPU memory across pods. Each service specifies its memory need with the --gpumemory flag. The scheduler co-locates pods on the same GPU node until total requested memory reaches the node's physical capacity.
Use shared GPU scheduling when maximizing GPU utilization matters more than fault isolation. For strict isolation, use dedicated GPU nodes.
Limitations
-
Total GPU memory requested across all pods on a node must not exceed the node's physical GPU memory.
-
GPU-accelerated nodes use CUDA 11 by default. This guide requires CUDA 12.0 or later.
-
ack-kserve must be in Raw Deployment mode.
Prerequisites
Ensure the following:
-
An ACK managed or dedicated cluster with GPU-accelerated nodes, running Kubernetes 1.22 or later. See Add GPU-accelerated nodes to a cluster or Create an ACK dedicated cluster with GPU-accelerated nodes.
-
CUDA 12.0 or later on GPU nodes. Default is CUDA 11 — add the
ack.aliyun.com/nvidia-driver-version:525.105.17tag to the node pool for CUDA 12. See Customize the NVIDIA GPU driver version on nodes. -
The shared GPU scheduling component installed on the cluster
-
Arena client 0.9.15 or later. See Configure the Arena client.
-
cert-manager and ack-kserve installed, with ack-kserve in Raw Deployment mode
Step 1: Prepare model data
Store the model in an OSS bucket or NAS file system. This guide uses OSS. See Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume.
-
Download the Qwen1.5-0.5B-Chat model.
git lfs install GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git cd Qwen1.5-0.5B-Chat git lfs pull -
Upload model files to your OSS bucket.
See Install ossutil.
ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat -
Create a persistent volume (PV) with the following configuration.
Configuration item Value Persistent volume type OSS Name llm-model Certificate Access AccessKey ID and AccessKey secret for the OSS bucket Bucket ID The OSS bucket created in the previous step OSS path /Qwen1.5-0.5B-Chat -
Create a persistent volume claim (PVC) bound to the PV.
Configuration item Value Persistent volume claim type OSS Name llm-model Allocation mode Select Existing persistent volume Existing persistent volume Click Select Existing persistent volume and select the PV created in the previous step
Step 2: Deploy the inference services
Deploy two Qwen inference services, each requesting 6 GB of GPU memory. Only --name differs between commands.
Start the first service:
arena serve kserve \
--name=qwen1 \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
--gpumemory=6 \
--cpu=3 \
--memory=8Gi \
--data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
"python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"
Start the second service with --name=qwen2.
Key parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
--name |
String | Yes | The service name. Must be globally unique. |
--image |
String | Yes | The container image. |
--gpumemory |
Integer (GB) | No | GPU memory allocation in GB, such as --gpumemory=6 for 6 GB. Total across all services on a node must not exceed physical GPU memory. |
--cpu |
Integer | No | The vCPU count. |
--memory |
String | No | RAM allocation, such as 8Gi. |
--data |
String | No | PVC-to-container mount in <pvc-name>:<container-path> format. Here, llm-model mounts to /mnt/models/. |
Step 3: Verify the inference services
-
Check that both pods are running on the same GPU node.
kubectl get pod -owide | grep qwenExpected output:
qwen1-predictor-856568bdcf-5pfdq 1/1 Running 0 7m10s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none> qwen2-predictor-6b477b587d-dpdnj 1/1 Running 0 4m3s 10.130.XX.XX cn-beijing.172.16.XX.XX <none> <none>Both pods run on the same node (
cn-beijing.172.16.XX.XX), confirming GPU sharing is active. -
Check GPU memory per pod (one command per service):
kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # First service kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Second serviceExpected output for each pod: GPU memory allocated to the first inference service
Fri Jun 28 06:20:43 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB On | 00000000:00:07.0 Off | 0 | | N/A 39C P0 53W / 300W | 5382MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+GPU memory allocated to the second inference service
Fri Jun 28 06:40:17 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla V100-SXM2-16GB On | 00000000:00:07.0 Off | 0 | | N/A 39C P0 53W / 300W | 5382MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+Each pod sees a 6 GB (6,144 MiB) limit, confirming GPU memory sharing works as configured.
-
Send a test request through the NGINX Ingress gateway.
curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \ -H "Content-Type: application/json" \ http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \ -d '{ "model": "qwen", "messages": [{"role": "user", "content": "This is a test."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10 }'Expected output:
{"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}A response confirms the inference service works correctly.
(Optional) Step 4: Clean up
Delete resources when no longer needed.
Delete the inference services:
arena serve delete qwen1
arena serve delete qwen2
Delete the PVC and PV:
kubectl delete pvc llm-model
kubectl delete pv llm-model