Custom model deployment

更新时间:
复制 MD 格式

Deploy your own or open source pre-trained models as online API services using FunModel. This topic guides you through selecting a deployment plan and using vLLM, SGLang, or a custom image to deploy, invoke, and manage your model.

Preparations

Before you begin, make sure you have an Alibaba Cloud account and are logged on to the FunModel console.

  1. Switch to the new console: If you are using the old version, click New Console in the upper-right corner of the page.

  2. Complete authorization: When you log on for the first time, follow the on-screen instructions to complete configurations such as RAM role authorization.

Choosing a deployment plan

Before you deploy a model, review the following concepts to help you choose the right plan for your needs.

Deployment mode comparison: Elastic vs. provisioned instances

FunModel provides two instance types to suit business scenarios with different load characteristics. Elastic instances are a flexible, pay-as-you-go model focused on cost control. Provisioned instances are a stable, long-term reserved model focused on performance assurance. For more information, see Instance types and specifications.

Inference framework comparison: vLLM vs. SGLang

FunModel includes built-in support for vLLM and SGLang. These are two popular, high-performance inference frameworks. Note that vLLM and SGLang deployments currently support only large language models (LLMs).

Comparison dimension

vLLM

SGLang

Recommendation

Core features

Achieves high throughput and efficient GPU memory management through PagedAttention technology.

Designed for complex LLM programs. Optimizes structured output and multi-path parallel inference through RadixAttention.

vLLM is an excellent choice for general-purpose, high-performance inference. It has an active community and supports a wide range of models.

SGLang has an advantage in scenarios that require complex prompt engineering, parallel generation, or structured output, such as JSON.

Model compatibility

Supports most mainstream Transformer-based models.

Also supports a wide range of Transformer models. It may offer better compatibility for certain complex structures.

Both have good compatibility. vLLM usually meets most needs. If you encounter a specific model or advanced use case that vLLM does not support, try SGLang.

Scenarios

General LLM inference tasks such as text generation, dialogue, and summarization.

Agent applications, multi-role dialogue simulation, JSON mode output, and chain-of-thought (CoT) inference.

-

Note: vLLM is a general-purpose, high-performance inference solution with broad community support. SGLang is better suited for complex prompt engineering or structured output.

Model sources

FunModel supports loading models from different sources. The model files are mounted to the /mnt/ directory of the instance.

Model source

Model path description

ModelScope Model ID

Provide a model ID, such as Qwen/Qwen3-0.6B. FunModel automatically downloads and mounts the model to /mnt/<model_name>.

Object Storage Service (OSS)

Provide the full OSS path. The files are mounted to /mnt/<OSS_bucket_name>/<OSS_path>.

NAS Mount

Provide the absolute NAS path. The path is mounted to /mnt/<NAS_path>.

Deployment steps

Method 1: Deploy using a pre-built framework (vLLM/SGLang)

This method is suitable for most standard LLM models and does not require you to build a custom image. We recommend this method for quickly deploying a model.

  1. In the FunModel console, click Custom Development.

  2. Select a Model Environment, and then enter a Model Name and Model Description.

  3. Select a Model Source and enter the path information as described in the Model sources section.

  4. Select an Instance Type (Elastic Instance or Provisioned Instance) and a specific instance type.

    • Note: The selected instance type (especially the GPU memory) must meet the model's operational requirements. Otherwise, the deployment may fail due to an out-of-memory (OOM) error.

  5. Configure the Startup Command and Listener Port. This command is executed after the container starts to load the model and start the inference service. The system provides a default startup command based on the model source that you select, which you can modify to meet your requirements.

    • vLLM startup command example

      # Deploy the Qwen3-0.6B model
      # /mnt/Qwen3-0.6B is the model's mount path inside the container. It must match the mount path of the model source.
      # --served-model-name: The model name used in API calls.
      # --tensor-parallel-size: The number of tensor parallel replicas. This is usually set to the number of GPUs in the instance.
      vllm serve /mnt/Qwen3-0.6B \
        --served-model-name Qwen/Qwen3-0.6B \
        --port 9000 \
        --trust-remote-code \
        --tensor-parallel-size 1 \
        --gpu-memory-utilization 0.9 \
        --max-model-len 8192
      Note: For more startup parameters, see the official vLLM documentation.
    • SGLang startup command example

      # Deploy the Llama3-8B model
      # --model-path: Specifies the model's mount path inside the container.
      # --host: The service listener address. Must be set to 0.0.0.0.
      # --tp-size: The tensor parallel size, which is the number of tensor parallel replicas.
      python3 -m sglang.launch_server \
        --model-path /mnt/Llama3-8B-Instruct \
        --host 0.0.0.0 \
        --port 9000 \
        --tp-size 1
      Note: For more startup parameters, see the official SGLang documentation.
  6. For Role Name, select AliyunFCDefaultRole. This RAM role is used to access cloud resources and must have the necessary permissions.

  7. After you confirm that all configurations are correct, click Create Model Service.

Method 2: Deploy using a custom image

If the default runtime environment does not meet your specific needs, use a custom image for maximum flexibility. This method is recommended for the following scenarios:

  • Need a specific framework version: You need to use a specific version of an inference framework, such as vLLM or SGLang, instead of the pre-built version provided by the platform.

  • Depend on special packages: Your application requires specific operating system dependencies, such as libraries installed with apt, Python packages, or special environment configurations.

  • Integrate model and code (offline deployment): You want to package the model files directly into the image for offline deployment or faster startup. This is suitable for scenarios where the model is integrated into the application.

    • Example: The offline service image for PaddleOCR-VL is serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/paddlex-genai-vllm-server:20251111-offline.

Procedure

  1. Prepare a custom image

    Write a Dockerfile to build a Docker image that contains all dependencies and code. We recommend using an official CUDA image as the base image. The following is an example Dockerfile:

    # 1. Base image
    # Select a CUDA 12.1 image compatible with vLLM. The 'devel' version includes build tools, which is helpful for some models that require just-in-time compilation.
    FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
    
    # 2. Set environment variables
    # - Prevents apt-get from asking interactive questions during the build
    # - Sets the time zone for easier log viewing
    ENV DEBIAN_FRONTEND=noninteractive
    ENV TZ=Asia/Shanghai
    
    # 3. Install system dependencies
    # - Run apt-get update, install, and clean the cache in a single step to reduce the image layer size
    # - Install python3, pip, git (some models may need to clone from Hugging Face), and build-essential (for compilation dependencies)
    RUN apt-get update && \
        apt-get install -y --no-install-recommends \
        python3.10 \
        python3-pip \
        git \
        build-essential \
        && rm -rf /var/lib/apt/lists/*
    
    # 4. Set the working directory
    WORKDIR /app
    
    # 5. Install Python dependencies
    # - Upgrade pip
    # - Use --no-cache-dir to avoid caching and reduce the image size
    # - Do not specify a vllm version number to install the latest stable version from PyPI by default
    # - Also install transformers, as vLLM often needs it to handle the tokenizer
    RUN pip3 install --no-cache-dir --upgrade pip && \
        pip3 install --no-cache-dir \
        vllm \
        transformers
    
    # 6. Expose the service port
    # The new OpenAI-compatible service in vLLM defaults to port 8000
    EXPOSE 8000
    
    # 7. Define the container startup command
    # - Use `python -m vllm.entrypoints.openai.api_server` as the new standard startup method
    # - Pass the model path, host, port, and other arguments
    # - The --model argument points to the model directory you will mount
    CMD [ \
        "python3", \
        "-m", "vllm.entrypoints.openai.api_server", \
        "--host", "0.0.0.0", \
        "--port", "8000", \
        "--model", "/data/models/your-model-name", \
        "--trust-remote-code" \
    ]
    
  2. Build and push the image to Alibaba Cloud ACR

    Push the built image to your Alibaba Cloud Container Registry (ACR) repository. For more information, see Push and pull images using an instance of a Personal Edition repository.

  3. Deploy in FunModel

    • When you create the model service, set Model Environment to Custom Environment.

    • For Container Image, enter your ACR image. For example: serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:v0.11.0.

    • Enter the Model Source and related path information as described in the Model sources section. For example, set Model Source to Object Storage Service (OSS) and specify the path to your model in OSS.

    • Select Elastic Instance or Provisioned Instance.

    • Configure the Startup Command and Health Check Port. For example: vllm serve /mnt/fun-model-test/Qwen/Qwen3-0.6B --served-model-name Qwen/Qwen3-0.6B --port 9000 --trust-remote-code. For a description of the parameters, see Method 1: Deploy using a pre-built framework (vLLM/SGLang). If your Dockerfile already includes a CMD or ENTRYPOINT instruction, you can leave this field empty.

      Note

      Whether in the Dockerfile or in the Startup Command in the console, you must set the service listener host to 0.0.0.0.

  • Click Create Model Service and wait for the model service to start.

Develop a custom image using DevPod

If the model cannot run directly from the command line, you can use a custom image to start a DevPod. In the DevPod, you can develop and debug the model. Then, you can package the debugged environment into an image for deployment. For detailed instructions, see DeepSeek-OCR Quick Start Guide.

Official pre-built images

To fine-tune the official environment, you can use a FunModel pre-built image as your base image:

  • China regions: serverless-registry.cn-hangzhou.cr.aliyuncs.com/functionai/vllm-openai:<tag>

  • Regions outside China: serverless-registry.ap-southeast-1.cr.aliyuncs.com/functionai/vllm-openai:<tag>

Available <tag>s include v0.10.1, v0.10.2, and v0.11.0. For a complete list of image versions, see the FunModel Pre-built Image List.

Service O&M

After the service is deployed, you can invoke the service, monitor its performance, and troubleshoot issues.

Service invocation

On the service details page, go to the Overview > Access Information section to obtain the endpoint and authentication token. Services deployed with vLLM and SGLang are usually compatible with the OpenAI API format.

cURL invocation example

# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information
ENDPOINT_URL="<YOUR_ENDPOINT>"
TOKEN="<YOUR_TOKEN>"

curl $ENDPOINT_URL/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "stream": false
  }'

Python SDK invocation example

import requests
import json

# Replace <YOUR_ENDPOINT> and <YOUR_TOKEN> with your actual information
ENDPOINT_URL = "<YOUR_ENDPOINT>/v1/chat/completions"
TOKEN = "<YOUR_TOKEN>"

headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json"
}

data = {
    "model": "Qwen3-0.6B",  # The model name here must match the --served-model-name in the startup command
    "messages": [
        {"role": "user", "content": "Hello, please introduce yourself"}
    ],
    "stream": False
}

response = requests.post(ENDPOINT_URL, headers=headers, data=json.dumps(data))

if response.status_code == 200:
    print(response.json())
else:
    print(f"Error: {response.status_code}, {response.text}")

Logs and monitoring

  • Operation Record > Deployment Log: Records key steps during service deployment, including model download, image pull, instance startup, and model loading. This is the primary source for troubleshooting deployment failures.

  • Log: Records request handling and error messages after the service starts. Use this log to troubleshoot invocation errors.

  • Monitoring: View metrics such as queries per second (QPS), latency, and GPU utilization to evaluate service performance.

FAQ and troubleshooting

  1. Deployment failed

    • Check the deployment log: Review the deployment log for error messages.

    • Check the model path: Confirm that the model path referenced in the startup command (/mnt/...) matches the mounting rule of the model source.

    • Check the instance type: If the log shows an OutOfMemoryError or OOM error, it usually means the instance has insufficient GPU memory. Try switching to an instance type with more memory.

    • Check RAM permissions: If you see a permission denied or 403 Forbidden error, this is usually because the RAM user performing the operation lacks the required Devs-related permissions. This prevents the service configuration from being saved.

  2. Service startup failed

    • Check the service log: Confirm whether the service startup command executed successfully and check for any errors.

    • Check the port configuration: Make sure the port specified in the startup command, such as --port 9000, matches the port configured in the console.

    • Check the listener address: For custom images, make sure the service is listening on host 0.0.0.0.

  3. Performance issues

    • Optimize startup parameters: For multi-GPU instances, make sure you set --tensor-parallel-size to enable parallel computing.

    • Upgrade the instance type: If monitoring shows that resources such as GPU utilization have reached a bottleneck, consider upgrading to a higher-performance instance.

    • Use model quantization: Model quantization techniques, such as AWQ and GPTQ, can reduce GPU memory usage and computational load. This operation may require a custom image.