Deploy GPU-shared inference services

更新时间:
复制 MD 格式

Run multiple inference services on a single GPU by slicing its memory with shared GPU scheduling.

How it works

The shared GPU scheduling add-on slices GPU memory across pods. slices GPU memory across pods. Each service specifies its memory need with the --gpumemory flag. The scheduler co-locates pods on the same GPU node until total requested memory reaches the node's physical capacity.

Use shared GPU scheduling when maximizing GPU utilization matters more than fault isolation. For strict isolation, use dedicated GPU nodes.

Limitations

  • Total GPU memory requested across all pods on a node must not exceed the node's physical GPU memory.

  • GPU-accelerated nodes use CUDA 11 by default. This guide requires CUDA 12.0 or later.

  • ack-kserve must be in Raw Deployment mode.

Prerequisites

Ensure the following:

Step 1: Prepare model data

Store the model in an OSS bucket or NAS file system. This guide uses OSS. See Use an ossfs 1.0 statically provisioned volume or Mount a statically provisioned NAS volume.

  1. Download the Qwen1.5-0.5B-Chat model.

    git lfs install
    GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/Qwen/Qwen1.5-0.5B-Chat.git
    cd Qwen1.5-0.5B-Chat
    git lfs pull
  2. Upload model files to your OSS bucket.

    See Install ossutil.
    ossutil mkdir oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
    ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<your-bucket-name>/models/Qwen1.5-0.5B-Chat
  3. Create a persistent volume (PV) with the following configuration.

    Configuration item Value
    Persistent volume type OSS
    Name llm-model
    Certificate Access AccessKey ID and AccessKey secret for the OSS bucket
    Bucket ID The OSS bucket created in the previous step
    OSS path /Qwen1.5-0.5B-Chat
  4. Create a persistent volume claim (PVC) bound to the PV.

    Configuration item Value
    Persistent volume claim type OSS
    Name llm-model
    Allocation mode Select Existing persistent volume
    Existing persistent volume Click Select Existing persistent volume and select the PV created in the previous step

Step 2: Deploy the inference services

Deploy two Qwen inference services, each requesting 6 GB of GPU memory. Only --name differs between commands.

Start the first service:

arena serve kserve \
    --name=qwen1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

Start the second service with --name=qwen2.

Key parameters:

Parameter Type Required Description
--name String Yes The service name. Must be globally unique.
--image String Yes The container image.
--gpumemory Integer (GB) No GPU memory allocation in GB, such as --gpumemory=6 for 6 GB. Total across all services on a node must not exceed physical GPU memory.
--cpu Integer No The vCPU count.
--memory String No RAM allocation, such as 8Gi.
--data String No PVC-to-container mount in <pvc-name>:<container-path> format. Here, llm-model mounts to /mnt/models/.

Step 3: Verify the inference services

  1. Check that both pods are running on the same GPU node.

    kubectl get pod -owide | grep qwen

    Expected output:

    qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
    qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

    Both pods run on the same node (cn-beijing.172.16.XX.XX), confirming GPU sharing is active.

  2. Check GPU memory per pod (one command per service):

    kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi   # First service
    kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi   # Second service

    Expected output for each pod: GPU memory allocated to the first inference service

    Fri Jun 28 06:20:43 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
    | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    +---------------------------------------------------------------------------------------+

    GPU memory allocated to the second inference service

    Fri Jun 28 06:40:17 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
    | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    +---------------------------------------------------------------------------------------+

    Each pod sees a 6 GB (6,144 MiB) limit, confirming GPU memory sharing works as configured.

  3. Send a test request through the NGINX Ingress gateway.

    curl -H "Host: $(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)" \
         -H "Content-Type: application/json" \
         http://$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'):80/v1/chat/completions \
         -d '{
                "model": "qwen",
                "messages": [{"role": "user", "content": "This is a test."}],
                "max_tokens": 10,
                "temperature": 0.7,
                "top_p": 0.9,
                "seed": 10
             }'

    Expected output:

    {"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.completion","created":1719303373,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

    A response confirms the inference service works correctly.

(Optional) Step 4: Clean up

Delete resources when no longer needed.

Delete the inference services:

arena serve delete qwen1
arena serve delete qwen2

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model