Custom model deployment

更新时间:
复制 MD 格式

FunModel lets you deploy your own or open source pre-trained models as online API services. This topic guides you on how to select a suitable deployment plan and use vLLM, SGLang, or a custom image to deploy, invoke, and manage your model.

Preparations

Before you begin, make sure that you have an Alibaba Cloud account and are logged on to the FunModel console.

  1. Switch to the new console: If you are using the old version, click New Console in the upper-right corner of the page.

  2. Complete authorization: When you log on for the first time, follow the on-screen instructions to configure settings, such as RAM role authorization.

Choosing a deployment plan

Before you deploy a model, you should understand the following concepts to help you select the right plan for your business needs.

Deployment mode comparison: Elastic instances vs. provisioned instances

FunModel provides two instance types to suit scenarios with different workload characteristics. Elastic instances are a flexible, pay-as-you-go model that focuses on cost control. Provisioned instances are a stable, long-term reservation model that focuses on performance assurance. For more information, see Instance types and specifications.

Inference framework comparison: vLLM vs. SGLang

FunModel has built-in support for vLLM and SGLang, two mainstream high-performance inference frameworks. vLLM and SGLang deployments currently support only large language models (LLMs).

Comparison

vLLM

SGLang

Recommendation

Core features

Achieves high throughput and efficient GPU memory management with PagedAttention technology.

Designed for complex LLM programs. Optimizes structured output and multi-path parallel inference with RadixAttention.

vLLM is an excellent choice for general-purpose, high-performance inference. It has an active community and supports a wide range of models.

SGLang has an advantage in scenarios that require complex prompt engineering, parallel generation, or structured output such as JSON.

Model compatibility

Supports most mainstream Transformer-based models.

Also supports a wide range of Transformer models. May offer better compatibility for certain complex structures.

Both are highly compatible. vLLM is usually sufficient. If you encounter a specific model or advanced use case that vLLM does not support, try SGLang.

Scenarios

General LLM inference tasks such as text generation, dialogue, and summarization.

Agent applications, multi-role dialogue simulation, JSON mode output, and chain-of-thought (CoT) inference.

-

Note: vLLM is a general-purpose, high-performance inference solution with broad community support. SGLang may provide better support when you need to handle complex prompt engineering or structured output.

Model source description

FunModel supports loading models from different sources. The model files are ultimately mounted to the /mnt/ directory of the instance.

Model source

Model path description

ModelScope Model ID

Provide the model ID, such as Qwen/Qwen3-0.6B. FunModel automatically downloads and mounts it to /mnt/<model_name>.

Object Storage Service (OSS)

Provide the full OSS path. The files are mounted to /mnt/<OSS_Bucket_Name>/<OSS_Path>.

Attaching a disk to NAS

Provide the absolute NAS path. The path is mounted to /mnt/<NAS_Mount_Target_Name>/<NAS_Path>.

Deployment steps

Method 1: Deploy using a pre-built framework (vLLM/SGLang)

This method is suitable for most standard LLM models. It does not require you to build an image and is the recommended way to quickly deploy a model.

  1. In the FunModel console, click Custom Development.

  2. Select a Model Environment and enter a Model Name and Model Description.

  3. Select a Model Source and enter the path. For more information, see Model source description.

  4. Select an Instance Type (Elastic Instance or Provisioned Instance) and the required instance type.

    • Note: The selected Instance Type, especially its GPU memory, must meet the model's runtime requirements. Otherwise, the deployment may fail due to an out-of-memory (OOM) error.

  5. Configure the Start Command and Listening Port. The start command is executed after the container starts, loading the model and starting the inference service. A default start command is provided based on the selected model source, which you can modify as needed.

    • vLLM start command example

      # Deploy the Qwen3-0.6B model
      # /mnt/Qwen3-0.6B is the model's mount path in the container. It must match the mount path from the model source.
      # --served-model-name: The model name used in API calls.
      # --tensor-parallel-size: The tensor parallelism size, usually set to the number of GPUs in the instance.
      vllm serve /mnt/Qwen3-0.6B \
        --served-model-name Qwen/Qwen3-0.6B \
        --port 9000 \
        --trust-remote-code \
        --tensor-parallel-size 1 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 8192
      Note: For more information about start parameters, see the official vLLM documentation.
    • SGLang start command example

      # Deploy the Llama3-8B model
      # --model-path: Specifies the model's mount path in the container.
      # --host: The service listening address. It must be set to 0.0.0.0.
      # --tp-size: The tensor parallelism size.
      python3 -m sglang.launch_server \
        --model-path /mnt/Llama3-8B-Instruct \
        --host 0.0.0.0 \
        --port 9000 \
        --tp-size 1
      Note: For more information about start parameters, see the official SGLang documentation.
  6. Role Name: The RAM role for accessing cloud resources. This role must have the necessary permissions. Select AliyunFCDefaultRole.

  7. After you confirm the configurations, click Create Model Service.

Method 2: Deploy using a custom image

When the default runtime environment cannot meet your specific needs, you can deploy a model using a custom image for maximum flexibility. Use this method in the following scenarios:

  • You need a specific framework version: You need to use a specific version of an inference framework, such as vLLM or SGLang, instead of the version pre-installed on the platform.

  • You have special package dependencies: Your application requires specific operating system dependencies, such as libraries installed with apt, Python packages, or special environment configurations.

  • You want to integrate the model and code for offline deployment: You want to package the model files directly into the image to enable offline deployment or faster startup. This is suitable for scenarios where the model is part of the application.

    • Example: The offline service image for PaddleOCR-VL is serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/paddlex-genai-vllm-server:20251111-offline.

Procedure

  1. Prepare a custom image

    Write a Dockerfile to build a Docker image that contains all dependencies and code. Use an official CUDA image as the base image. The following is an example Dockerfile:

    # 1. Base image
    # Select a CUDA 12.1 image compatible with vLLM. The 'devel' version includes build tools, which is helpful for models that require just-in-time compilation.
    FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
    
    # 2. Set environment variables
    # - Prevents apt-get from asking interactive questions during the build
    # - Sets the time zone for easier log viewing
    ENV DEBIAN_FRONTEND=noninteractive
    ENV TZ=Asia/Shanghai
    
    # 3. Install system dependencies
    # - Run apt-get update, install, and clean the cache in one step to reduce the image layer size
    # - Install python3, pip, git (some models may need to clone from Hugging Face), and build-essential (for compiling dependencies)
    RUN apt-get update && \
        apt-get install -y --no-install-recommends \
        python3.10 \
        python3-pip \
        git \
        build-essential \
        && rm -rf /var/lib/apt/lists/*
    
    # 4. Set the working directory
    WORKDIR /app
    
    # 5. Install Python dependencies
    # - Upgrade pip
    # - Use --no-cache-dir to avoid caching and reduce image size
    # - Do not specify a vllm version number to install the latest stable version from PyPI by default
    # - Also install transformers, as vLLM often needs it to handle the tokenizer
    RUN pip3 install --no-cache-dir --upgrade pip && \
        pip3 install --no-cache-dir \
        vllm \
        transformers
    
    # 6. Expose the service port
    # The default port for the new vLLM OpenAI-compatible service is 8000
    EXPOSE 8000
    
    # 7. Define the container start command
    # - Use `python -m vllm.entrypoints.openai.api_server` as the new standard startup method
    # - Pass the model path, host, port, and other arguments
    # - The `--model` parameter points to the model directory you will mount
    CMD [ \
        "python3", \
        "-m", "vllm.entrypoints.openai.api_server", \
        "--host", "0.0.0.0", \
        "--port", "8000", \
        "--model", "/data/models/your-model-name", \
        "--trust-remote-code" \
    ]
    
  2. Build and push the image to Alibaba Cloud ACR

    Push the built image to your Alibaba Cloud Container Registry (ACR) repository. For more information, see Push and pull images using a Personal Edition instance.

  3. Deploy in FunModel

    • When you create the model service, set Model Environment to Custom Environment.

    • For Container Image, enter the path to your ACR image, such as serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:v0.11.0.

    • Enter the Model Source and related path information. For more information, see Model source description. For example, select OSS as the model source and specify the path to the model that you prepared in OSS.

    • You can select Elastic Instance or Provisioned Instance.

    • Configure the Start Command and Listening Port. For example: vllm serve /mnt/fun-model-test/Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --port 9000 --trust-remote-code. For more information about the parameters, see Method 1: Deploy using a pre-built framework (vLLM/SGLang). If your Dockerfile already includes a CMD or ENTRYPOINT, you can leave this parameter empty.

      Note

      Whether in the Dockerfile or in the Start Command field in the console, your service listening address host must be set to 0.0.0.0.

  • Click Create Model Service and wait for the service to start.

Develop a custom image using DevPod

If a model cannot be run directly from the command line, we recommend that you use a custom image to start a DevPod. In the DevPod, you can develop and debug the model, and then package the debugged environment into an image to complete the deployment. For more information, see DeepSeek-OCR Quick Start Guide.

Official pre-built images

To fine-tune the official environment, you can use a FunModel pre-built image as your base image:

  • Regions in China: serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:<tag>

  • Regions outside China: serverless-registry.ap-southeast-1.cr.aliyuncs.com/functionai/vllm-openai:<tag>

Available <tag> values include v0.10.1, v0.10.2, and v0.11.0. To make sure that you are using the latest environment, see the following link for a complete list of image versions: FunModel Pre-built Image List.

Service O&M

After the service is deployed, you can invoke the service, monitor its performance, and troubleshoot issues.

Service invocation

On the service details page, go to Overview > Access Information to obtain the endpoint and authentication token. Services deployed with vLLM and SGLang are usually compatible with the OpenAI API format.

cURL call example

# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information.
ENDPOINT_URL="<YOUR_ENDPOINT>"
TOKEN="<YOUR_TOKEN>"

curl $ENDPOINT_URL/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "stream": false
  }'

Python SDK call example

import requests
import json

# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information.
ENDPOINT_URL = "<YOUR_ENDPOINT>/v1/chat/completions"
TOKEN = "<YOUR_TOKEN>"

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

data = {
    "model": "Qwen3-0.6B",  # The model name here must match the --served-model-name in the start command.
    "messages": [
        {"role": "user", "content": "Hello, please introduce yourself."}
    ],
    "stream": False
}

response = requests.post(ENDPOINT_URL, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    print(response.json())
else:
    print(f"Error: {response.status_code}, {response.text}")

Logs and monitoring

  • Operation Record > Deployment Logs: Records the key steps of a service deployment, including model download, image pull, instance startup, and model loading. These logs are the primary source for troubleshooting deployment failures.

  • Logs: Records of request handling and error messages generated after the service starts. Use these logs to troubleshoot invocation exceptions.

  • Monitoring: View service performance metrics such as QPS, latency, and GPU utilization.

FAQ and troubleshooting

  1. Deployment failed

    • Check deployment logs: First, check the deployment logs for error messages.

    • Check the model path: Confirm that the model path referenced in the start command (/mnt/...) matches the mount rule of the model source.

    • Check the instance type: If the logs show an OutOfMemoryError or OOM, it usually means the instance has insufficient GPU memory. You can try switching to an instance with higher specifications.

    • Check RAM permissions: If you see permission denied or 403 Forbidden, it is usually because the RAM user that performs the operation lacks the necessary Devs-related permissions, which prevents the service configuration from being saved.

  2. Service startup failed

    • Check service logs: Confirm whether the service start command executed successfully and check for any errors.

    • Check port configuration: Ensure that the port specified in the start command, such as --port 9000, matches the port configured in the console.

    • Check the listening address: For custom images, ensure that the service is listening on host 0.0.0.0.

  3. Performance issues

    • Optimize start parameters: For multi-GPU instances, make sure to set --tensor-parallel-size to enable parallel computing.

    • Upgrade the instance type: If monitoring shows that resources such as GPU utilization have reached a bottleneck, you can consider upgrading to a higher-performance instance.

    • Use model quantization: Model quantization techniques, such as AWQ and GPTQ, can reduce GPU memory usage and computation. This operation may require a custom image.