ACK managed Pro clusters give you a ready-to-use environment for running large language model (LLM) inference services—no local GPU hardware required, no complex dependency setup. This guide covers two deployment paths: a quick option for validating a model in about 15 minutes, and a production-grade option that pre-loads model files onto persistent storage to reduce cold-start times and bandwidth costs.
Prerequisites
Before you begin, ensure that you have:
-
An ACK managed Pro cluster running Kubernetes 1.22 or later
-
At least one GPU-accelerated node with 16 GB or more of GPU memory
-
NVIDIA driver version 535 or later installed on the GPU node pool (this guide uses
550.144.03, set via theack.aliyun.com/nvidia-driver-versionlabel) -
The Arena client installed
Choose a deployment path
| Option 1: Quick test | Option 2: Production | |
|---|---|---|
| Setup time | ~15 minutes | Longer (model pre-upload required) |
| Model storage | Downloaded into the container at startup | Pre-loaded on Object Storage Service (OSS) |
| Cold-start | Slow — model re-downloads on every pod restart | Fast — model is already on the mounted volume |
| Best for | Validating inference capabilities | Stable, repeatable production workloads |
Option 1: Quick deployment for testing
Use Arena to deploy qwen/Qwen1.5-4B-Chat from ModelScope. The container downloads the model at startup, so the GPU node needs at least 30 GB of free disk space.
-
Run the Arena command to deploy the inference service:
arena serve custom \ --name=modelscope \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \ "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"The following output confirms the Kubernetes resources for
modelscope-v1were created:service/modelscope-v1 created deployment.apps/modelscope-v1-custom-serving created INFO[0002] The Job modelscope has been submitted successfully INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status -
Check the service status. The pod stays in
ContainerCreatingwhile the model downloads. Depending on network conditions, this can take 5–15 minutes:arena serve get modelscopeOnce the pod status shows
Running, the inference service is ready.
Option 2: Production-ready deployment with persistent storage
Pre-loading model files on OSS avoids re-downloading files larger than 10 GB every time a pod restarts. This reduces cold-start times, lowers bandwidth costs, and improves service stability.
Step 1: Download the model files
-
Install Git and Git Large File Storage (LFS). macOS
brew install git brew install git-lfsWindows Download and install Git from the official Git website. Git Large File Storage is bundled with Git for Windows — download the latest version. Linux (Red Hat-based)
yum install git yum install git-lfsFor other Linux distributions, see the official Git website.
-
Clone the Qwen1.5-4B-Chat model repository and pull the large files:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git cd Qwen1.5-4B-Chat git lfs pull
Step 2: Upload model files to OSS
-
Create a bucket. To reduce model pull latency, create the bucket in the same region as your cluster:
ossutil mb oss://<your-bucket-name> -
Create a folder in the bucket for the model files:
ossutil mkdir oss://<your-bucket-name>/Qwen1.5-4B-Chat -
Upload the model files:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<your-bucket-name>/Qwen1.5-4B-Chat
Step 3: Configure a persistent volume (PV)
-
Log on to the ACK console and click the target cluster. In the left navigation pane, choose Volumes > Persistent Volumes.
-
Click Create. In the Create PV dialog box, set the following parameters and click Create:
Parameter Value PV type OSSVolume name llm-modelCapacity 20GiAccess mode ReadOnlyManyAccess certificate Select Create Secret Optional parameters -o umask=022 -o max_stat_cache_size=0 -o allow_otherBucket ID Click Select Bucket and select your bucket OSS path /Qwen1.5-4B-ChatEndpoint Select Public Endpoint
Step 4: Configure a persistent volume claim (PVC)
-
In the left navigation pane, choose Volumes > Persistent Volume Claims.
-
On the Persistent Volume Claims page, set the following parameters and click Create:
Parameter Value PVC type OSSName llm-modelAllocation mode Select Existing Volumes Existing volumes Select the llm-modelPV created in the previous stepCapacity 20Gi
Step 5: Deploy the inference service
Run the Arena command to deploy the service. The --data flag mounts the PVC containing the pre-loaded model files. Because the model is already on the mounted volume, the pod starts without downloading anything:
arena serve custom \
--name=modelscope \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--data=llm-model:/Qwen1.5-4B-Chat \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"
The following output confirms the inference service was submitted:
service/modelscope-v1 created
deployment.apps/modelscope-v1-custom-serving created
INFO[0001] The Job modelscope has been submitted successfully
INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status
Check the service status:
arena serve get modelscope
Once the pod status shows Running, the inference service is ready.
Validate the inference service
-
Set up port forwarding to the inference service:
Importantkubectl port-forwardis for development and debugging only. It is not reliable, secure, or scalable in production. For production networking, see Ingress management.kubectl port-forward svc/modelscope-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000 -
In a new terminal, send a test inference request:
curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{ "text_input": "What is artificial intelligence? Artificial intelligence is", "parameters": { "stream": false, "temperature": 0.9, "seed": 10 } }'A successful response contains the model's generated text:
{"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}
(Optional) Clean up
Delete the inference service and storage resources when you're done:
# Delete the inference service
arena serve del modelscope
# Delete the PVC and PV (Option 2 only)
kubectl delete pvc llm-model
kubectl delete pv llm-model
FAQ
How can I pull model files from Hugging Face instead of ModelScope?
Make sure the container runtime can reach the Hugging Face repository, then set MODEL_SOURCE=Huggingface in the Arena command. The GPU node needs at least 30 GB of free disk space to accommodate the downloaded files:
arena serve custom \
--name=huggingface \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
"MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"
The following output confirms the resources were created:
service/huggingface-v1 created
deployment.apps/huggingface-v1-custom-serving created
INFO[0003] The Job huggingface has been submitted successfully
INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status
Appendix: command parameter reference
| Parameter | Description | Example |
|---|---|---|
serve custom |
Arena subcommand. Deploys a custom model service rather than a preset type such as tfserving or triton. |
— |
--name |
Service name. A unique identifier used for subsequent operations such as checking logs and deleting the service. | modelscope |
--version |
Service version. A version label for the service, useful for version management and phased releases. | v1 |
--gpus |
GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference. | 1 |
--replicas |
Replica count. The number of pods to run. More replicas increase concurrent throughput and availability. | 1 |
--restful-port |
RESTful API port. The port on which the service exposes its RESTful API to receive inference requests. | 8000 |
--readiness-probe-action |
Readiness probe type. The check method used by the Kubernetes readiness probe to determine whether the container is ready to receive traffic. | tcpSocket |
--readiness-probe-action-option |
Probe type options. Parameters for the chosen probe type. For tcpSocket, specifies the port to check. |
port: 8000 |
--readiness-probe-option |
Additional probe settings. Extra parameters for the readiness probe. This flag can be repeated. Sets the initial delay and check interval. | initialDelaySeconds: 30, periodSeconds: 30 |
--data |
Volume mount. Mounts a PVC at a specified path inside the container, in the format <pvc-name>:<mount-path>. Used to mount pre-loaded model files. |
llm-model:/Qwen1.5-4B-Chat |
--image |
Container image. The full URL of the container image that defines the runtime environment for the service. | kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 |
[COMMAND] |
Startup command. The command to run after the container starts. Sets the MODEL_ID environment variable and launches server.py. |
"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py" |
What's next
-
To specify an NVIDIA driver version for GPU nodes, see Specify an NVIDIA driver version for nodes by adding a label.
-
To use production-grade inference frameworks such as vLLM or Triton, see Deploy a Qwen model inference service using vLLM and Deploy a Qwen model inference service using Triton.