Quickly deploy a large language model inference service in ACK-Container Service for Kubernetes(ACK)-阿里云帮助中心

Prerequisites

Before you begin, ensure that you have:

An ACK managed Pro cluster running Kubernetes 1.22 or later
At least one GPU-accelerated node with 16 GB or more of GPU memory
NVIDIA driver version 535 or later installed on the GPU node pool (this guide uses 550.144.03, set via the ack.aliyun.com/nvidia-driver-version label)
The Arena client installed

Choose a deployment path

	Option 1: Quick test	Option 2: Production
Setup time	~15 minutes	Longer (model pre-upload required)
Model storage	Downloaded into the container at startup	Pre-loaded on Object Storage Service (OSS)
Cold-start	Slow — model re-downloads on every pod restart	Fast — model is already on the mounted volume
Best for	Validating inference capabilities	Stable, repeatable production workloads

Option 1: Quick deployment for testing

Use Arena to deploy qwen/Qwen1.5-4B-Chat from ModelScope. The container downloads the model at startup, so the GPU node needs at least 30 GB of free disk space.

Run the Arena command to deploy the inference service:

arena serve custom \
    --name=modelscope \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
    "MODEL_ID=qwen/Qwen1.5-4B-Chat python3 server.py"

The following output confirms the Kubernetes resources for modelscope-v1 were created:

service/modelscope-v1 created
deployment.apps/modelscope-v1-custom-serving created
INFO[0002] The Job modelscope has been submitted successfully
INFO[0002] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status

Check the service status. The pod stays in ContainerCreating while the model downloads. Depending on network conditions, this can take 5–15 minutes:
```
arena serve get modelscope
```
Once the pod status shows Running, the inference service is ready.

Option 2: Production-ready deployment with persistent storage

Pre-loading model files on OSS avoids re-downloading files larger than 10 GB every time a pod restarts. This reduces cold-start times, lowers bandwidth costs, and improves service stability.

Step 1: Download the model files

Install Git and Git Large File Storage (LFS). macOS
```
brew install git
brew install git-lfs
```
Windows Download and install Git from the official Git website. Git Large File Storage is bundled with Git for Windows — download the latest version. Linux (Red Hat-based)
```
yum install git
yum install git-lfs
```
For other Linux distributions, see the official Git website.

Clone the Qwen1.5-4B-Chat model repository and pull the large files:

GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
cd Qwen1.5-4B-Chat
git lfs pull

Step 2: Upload model files to OSS

Install and configure ossutil.
Create a bucket. To reduce model pull latency, create the bucket in the same region as your cluster:
```
ossutil mb oss://<your-bucket-name>
```

Create a folder in the bucket for the model files:

ossutil mkdir oss://<your-bucket-name>/Qwen1.5-4B-Chat

Upload the model files:

ossutil cp -r ./Qwen1.5-4B-Chat oss://<your-bucket-name>/Qwen1.5-4B-Chat

Step 3: Configure a persistent volume (PV)

Log on to the ACK console and click the target cluster. In the left navigation pane, choose Volumes > Persistent Volumes.

Click Create. In the Create PV dialog box, set the following parameters and click Create:

Parameter	Value
PV type	`OSS`
Volume name	`llm-model`
Capacity	`20Gi`
Access mode	`ReadOnlyMany`
Access certificate	Select Create Secret
Optional parameters	`-o umask=022 -o max_stat_cache_size=0 -o allow_other`
Bucket ID	Click Select Bucket and select your bucket
OSS path	`/Qwen1.5-4B-Chat`
Endpoint	Select Public Endpoint

Step 4: Configure a persistent volume claim (PVC)

In the left navigation pane, choose Volumes > Persistent Volume Claims.

On the Persistent Volume Claims page, set the following parameters and click Create:

Parameter	Value
PVC type	`OSS`
Name	`llm-model`
Allocation mode	Select Existing Volumes
Existing volumes	Select the `llm-model` PV created in the previous step
Capacity	`20Gi`

Step 5: Deploy the inference service

Run the Arena command to deploy the service. The --data flag mounts the PVC containing the pre-loaded model files. Because the model is already on the mounted volume, the pod starts without downloading anything:

arena serve custom \
    --name=modelscope \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --data=llm-model:/Qwen1.5-4B-Chat \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
    "MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"

The following output confirms the inference service was submitted:

service/modelscope-v1 created
deployment.apps/modelscope-v1-custom-serving created
INFO[0001] The Job modelscope has been submitted successfully
INFO[0001] You can run `arena serve get modelscope --type custom-serving -n default` to check the job status

Check the service status:

arena serve get modelscope

Once the pod status shows Running, the inference service is ready.

Validate the inference service

Set up port forwarding to the inference service:

Important
kubectl port-forward is for development and debugging only. It is not reliable, secure, or scalable in production. For production networking, see Ingress management.
```
kubectl port-forward svc/modelscope-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

In a new terminal, send a test inference request:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "What is artificial intelligence? Artificial intelligence is",
    "parameters": {
      "stream": false,
      "temperature": 0.9,
      "seed": 10
    }
  }'

A successful response contains the model's generated text:

{"model_name":"/Qwen1.5-4B-Chat","text_output":"What is artificial intelligence? Artificial intelligence is a branch of computer science that studies how to make computers have intelligent behavior."}

(Optional) Clean up

Delete the inference service and storage resources when you're done:

# Delete the inference service
arena serve del modelscope

# Delete the PVC and PV (Option 2 only)
kubectl delete pvc llm-model
kubectl delete pv llm-model

FAQ

How can I pull model files from Hugging Face instead of ModelScope?

Make sure the container runtime can reach the Hugging Face repository, then set MODEL_SOURCE=Huggingface in the Arena command. The GPU node needs at least 30 GB of free disk space to accommodate the downloaded files:

arena serve custom \
    --name=huggingface \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1 \
    "MODEL_ID=Qwen/Qwen1.5-4B-Chat MODEL_SOURCE=Huggingface python3 server.py"

The following output confirms the resources were created:

service/huggingface-v1 created
deployment.apps/huggingface-v1-custom-serving created
INFO[0003] The Job huggingface has been submitted successfully
INFO[0003] You can run `arena serve get huggingface --type custom-serving -n default` to check the job status

Appendix: command parameter reference

Parameter	Description	Example
`serve custom`	Arena subcommand. Deploys a custom model service rather than a preset type such as `tfserving` or `triton`.	—
`--name`	Service name. A unique identifier used for subsequent operations such as checking logs and deleting the service.	`modelscope`
`--version`	Service version. A version label for the service, useful for version management and phased releases.	`v1`
`--gpus`	GPU count. The number of GPUs allocated to each pod. Required when the model needs GPUs for inference.	`1`
`--replicas`	Replica count. The number of pods to run. More replicas increase concurrent throughput and availability.	`1`
`--restful-port`	RESTful API port. The port on which the service exposes its RESTful API to receive inference requests.	`8000`
`--readiness-probe-action`	Readiness probe type. The check method used by the Kubernetes readiness probe to determine whether the container is ready to receive traffic.	`tcpSocket`
`--readiness-probe-action-option`	Probe type options. Parameters for the chosen probe type. For `tcpSocket`, specifies the port to check.	`port: 8000`
`--readiness-probe-option`	Additional probe settings. Extra parameters for the readiness probe. This flag can be repeated. Sets the initial delay and check interval.	`initialDelaySeconds: 30`, `periodSeconds: 30`
`--data`	Volume mount. Mounts a PVC at a specified path inside the container, in the format `<pvc-name>:<mount-path>`. Used to mount pre-loaded model files.	`llm-model:/Qwen1.5-4B-Chat`
`--image`	Container image. The full URL of the container image that defines the runtime environment for the service.	`kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/quick-deploy-llm:v1`
`[COMMAND]`	Startup command. The command to run after the container starts. Sets the `MODEL_ID` environment variable and launches `server.py`.	`"MODEL_ID=/Qwen1.5-4B-Chat python3 server.py"`

What's next

To specify an NVIDIA driver version for GPU nodes, see Specify an NVIDIA driver version for nodes by adding a label.
To use production-grade inference frameworks such as vLLM or Triton, see Deploy a Qwen model inference service using vLLM and Deploy a Qwen model inference service using Triton.