FunModel lets you deploy your own or open source pre-trained models as online API services. This topic guides you on how to select a suitable deployment plan and use vLLM, SGLang, or a custom image to deploy, invoke, and manage your model.
Preparations
Before you begin, make sure that you have an Alibaba Cloud account and are logged on to the FunModel console.
Switch to the new console: If you are using the old version, click New Console in the upper-right corner of the page.
Complete authorization: When you log on for the first time, follow the on-screen instructions to configure settings, such as RAM role authorization.
Choosing a deployment plan
Before you deploy a model, you should understand the following concepts to help you select the right plan for your business needs.
Deployment mode comparison: Elastic instances vs. provisioned instances
FunModel provides two instance types to suit scenarios with different workload characteristics. Elastic instances are a flexible, pay-as-you-go model that focuses on cost control. Provisioned instances are a stable, long-term reservation model that focuses on performance assurance. For more information, see Instance types and specifications.
Inference framework comparison: vLLM vs. SGLang
FunModel has built-in support for vLLM and SGLang, two mainstream high-performance inference frameworks. vLLM and SGLang deployments currently support only large language models (LLMs).
Comparison | vLLM | SGLang | Recommendation |
Core features | Achieves high throughput and efficient GPU memory management with PagedAttention technology. | Designed for complex LLM programs. Optimizes structured output and multi-path parallel inference with RadixAttention. | vLLM is an excellent choice for general-purpose, high-performance inference. It has an active community and supports a wide range of models. SGLang has an advantage in scenarios that require complex prompt engineering, parallel generation, or structured output such as JSON. |
Model compatibility | Supports most mainstream Transformer-based models. | Also supports a wide range of Transformer models. May offer better compatibility for certain complex structures. | Both are highly compatible. vLLM is usually sufficient. If you encounter a specific model or advanced use case that vLLM does not support, try SGLang. |
Scenarios | General LLM inference tasks such as text generation, dialogue, and summarization. | Agent applications, multi-role dialogue simulation, JSON mode output, and chain-of-thought (CoT) inference. | - |
Note: vLLM is a general-purpose, high-performance inference solution with broad community support. SGLang may provide better support when you need to handle complex prompt engineering or structured output.
Model source description
FunModel supports loading models from different sources. The model files are ultimately mounted to the /mnt/ directory of the instance.
Model source | Model path description |
ModelScope Model ID | Provide the model ID, such as |
Object Storage Service (OSS) | Provide the full OSS path. The files are mounted to |
Attaching a disk to NAS | Provide the absolute NAS path. The path is mounted to |
Deployment steps
Method 1: Deploy using a pre-built framework (vLLM/SGLang)
This method is suitable for most standard LLM models. It does not require you to build an image and is the recommended way to quickly deploy a model.
In the FunModel console, click Custom Development.
Select a Model Environment and enter a Model Name and Model Description.
Select a Model Source and enter the path. For more information, see Model source description.
Select an Instance Type (Elastic Instance or Provisioned Instance) and the required instance type.
Note: The selected Instance Type, especially its GPU memory, must meet the model's runtime requirements. Otherwise, the deployment may fail due to an out-of-memory (OOM) error.
Configure the Start Command and Listening Port. The start command is executed after the container starts, loading the model and starting the inference service. A default start command is provided based on the selected model source, which you can modify as needed.
vLLM start command example
# Deploy the Qwen3-0.6B model # /mnt/Qwen3-0.6B is the model's mount path in the container. It must match the mount path from the model source. # --served-model-name: The model name used in API calls. # --tensor-parallel-size: The tensor parallelism size, usually set to the number of GPUs in the instance. vllm serve /mnt/Qwen3-0.6B \ --served-model-name Qwen/Qwen3-0.6B \ --port 9000 \ --trust-remote-code \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192Note: For more information about start parameters, see the official vLLM documentation.
SGLang start command example
# Deploy the Llama3-8B model # --model-path: Specifies the model's mount path in the container. # --host: The service listening address. It must be set to 0.0.0.0. # --tp-size: The tensor parallelism size. python3 -m sglang.launch_server \ --model-path /mnt/Llama3-8B-Instruct \ --host 0.0.0.0 \ --port 9000 \ --tp-size 1Note: For more information about start parameters, see the official SGLang documentation.
Role Name: The RAM role for accessing cloud resources. This role must have the necessary permissions. Select AliyunFCDefaultRole.
After you confirm the configurations, click Create Model Service.
Method 2: Deploy using a custom image
When the default runtime environment cannot meet your specific needs, you can deploy a model using a custom image for maximum flexibility. Use this method in the following scenarios:
You need a specific framework version: You need to use a specific version of an inference framework, such as vLLM or SGLang, instead of the version pre-installed on the platform.
You have special package dependencies: Your application requires specific operating system dependencies, such as libraries installed with
apt, Python packages, or special environment configurations.You want to integrate the model and code for offline deployment: You want to package the model files directly into the image to enable offline deployment or faster startup. This is suitable for scenarios where the model is part of the application.
Example: The offline service image for
PaddleOCR-VLisserverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/paddlex-genai-vllm-server:20251111-offline.
Procedure
Prepare a custom image
Write a Dockerfile to build a Docker image that contains all dependencies and code. Use an official CUDA image as the base image. The following is an example Dockerfile:
# 1. Base image # Select a CUDA 12.1 image compatible with vLLM. The 'devel' version includes build tools, which is helpful for models that require just-in-time compilation. FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 # 2. Set environment variables # - Prevents apt-get from asking interactive questions during the build # - Sets the time zone for easier log viewing ENV DEBIAN_FRONTEND=noninteractive ENV TZ=Asia/Shanghai # 3. Install system dependencies # - Run apt-get update, install, and clean the cache in one step to reduce the image layer size # - Install python3, pip, git (some models may need to clone from Hugging Face), and build-essential (for compiling dependencies) RUN apt-get update && \ apt-get install -y --no-install-recommends \ python3.10 \ python3-pip \ git \ build-essential \ && rm -rf /var/lib/apt/lists/* # 4. Set the working directory WORKDIR /app # 5. Install Python dependencies # - Upgrade pip # - Use --no-cache-dir to avoid caching and reduce image size # - Do not specify a vllm version number to install the latest stable version from PyPI by default # - Also install transformers, as vLLM often needs it to handle the tokenizer RUN pip3 install --no-cache-dir --upgrade pip && \ pip3 install --no-cache-dir \ vllm \ transformers # 6. Expose the service port # The default port for the new vLLM OpenAI-compatible service is 8000 EXPOSE 8000 # 7. Define the container start command # - Use `python -m vllm.entrypoints.openai.api_server` as the new standard startup method # - Pass the model path, host, port, and other arguments # - The `--model` parameter points to the model directory you will mount CMD [ \ "python3", \ "-m", "vllm.entrypoints.openai.api_server", \ "--host", "0.0.0.0", \ "--port", "8000", \ "--model", "/data/models/your-model-name", \ "--trust-remote-code" \ ]Build and push the image to Alibaba Cloud ACR
Push the built image to your Alibaba Cloud Container Registry (ACR) repository. For more information, see Push and pull images using a Personal Edition instance.
Deploy in FunModel
When you create the model service, set Model Environment to Custom Environment.
For Container Image, enter the path to your ACR image, such as
serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:v0.11.0.Enter the Model Source and related path information. For more information, see Model source description. For example, select OSS as the model source and specify the path to the model that you prepared in OSS.
You can select Elastic Instance or Provisioned Instance.
Configure the Start Command and Listening Port. For example:
vllm serve /mnt/fun-model-test/Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --port 9000 --trust-remote-code. For more information about the parameters, see Method 1: Deploy using a pre-built framework (vLLM/SGLang). If your Dockerfile already includes aCMDorENTRYPOINT, you can leave this parameter empty.NoteWhether in the
Dockerfileor in the Start Command field in the console, your service listening address host must be set to0.0.0.0.
Click Create Model Service and wait for the service to start.
Develop a custom image using DevPod
If a model cannot be run directly from the command line, we recommend that you use a custom image to start a DevPod. In the DevPod, you can develop and debug the model, and then package the debugged environment into an image to complete the deployment. For more information, see DeepSeek-OCR Quick Start Guide.
Official pre-built images
To fine-tune the official environment, you can use a FunModel pre-built image as your base image:
Regions in China:
serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:<tag>Regions outside China:
serverless-registry.ap-southeast-1.cr.aliyuncs.com/functionai/vllm-openai:<tag>
Available <tag> values include v0.10.1, v0.10.2, and v0.11.0. To make sure that you are using the latest environment, see the following link for a complete list of image versions: FunModel Pre-built Image List.
Service O&M
After the service is deployed, you can invoke the service, monitor its performance, and troubleshoot issues.
Service invocation
On the service details page, go to to obtain the endpoint and authentication token. Services deployed with vLLM and SGLang are usually compatible with the OpenAI API format.
cURL call example
# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information.
ENDPOINT_URL="<YOUR_ENDPOINT>"
TOKEN="<YOUR_TOKEN>"
curl $ENDPOINT_URL/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3-0.6B",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": false
}'
Python SDK call example
import requests
import json
# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information.
ENDPOINT_URL = "<YOUR_ENDPOINT>/v1/chat/completions"
TOKEN = "<YOUR_TOKEN>"
headers = {
"Authorization": f"Bearer {TOKEN}",
"Content-Type": "application/json"
}
data = {
"model": "Qwen3-0.6B", # The model name here must match the --served-model-name in the start command.
"messages": [
{"role": "user", "content": "Hello, please introduce yourself."}
],
"stream": False
}
response = requests.post(ENDPOINT_URL, headers=headers, data=json.dumps(data))
if response.status_code == 200:
print(response.json())
else:
print(f"Error: {response.status_code}, {response.text}")
Logs and monitoring
: Records the key steps of a service deployment, including model download, image pull, instance startup, and model loading. These logs are the primary source for troubleshooting deployment failures.
Logs: Records of request handling and error messages generated after the service starts. Use these logs to troubleshoot invocation exceptions.
Monitoring: View service performance metrics such as QPS, latency, and GPU utilization.
FAQ and troubleshooting
Deployment failed
Check deployment logs: First, check the deployment logs for error messages.
Check the model path: Confirm that the model path referenced in the start command (
/mnt/...) matches the mount rule of the model source.Check the instance type: If the logs show an
OutOfMemoryErrororOOM, it usually means the instance has insufficient GPU memory. You can try switching to an instance with higher specifications.Check RAM permissions: If you see
permission deniedor403 Forbidden, it is usually because the RAM user that performs the operation lacks the necessary Devs-related permissions, which prevents the service configuration from being saved.
Service startup failed
Check service logs: Confirm whether the service start command executed successfully and check for any errors.
Check port configuration: Ensure that the port specified in the start command, such as
--port 9000, matches the port configured in the console.Check the listening address: For custom images, ensure that the service is listening on host
0.0.0.0.
Performance issues
Optimize start parameters: For multi-GPU instances, make sure to set
--tensor-parallel-sizeto enable parallel computing.Upgrade the instance type: If monitoring shows that resources such as GPU utilization have reached a bottleneck, you can consider upgrading to a higher-performance instance.
Use model quantization: Model quantization techniques, such as AWQ and GPTQ, can reduce GPU memory usage and computation. This operation may require a custom image.