Quick Start

更新时间:
复制 MD 格式

This document guides you through the process of deploying a model and calling its API on the FunModel platform. You will learn how to select and configure compute instances, manage service credentials, make inference requests, and perform basic troubleshooting. This guide helps you integrate the AI model capabilities of the FunModel platform into your applications.

Preparations

Before you begin, make sure that you have a valid Alibaba Cloud account and are logged on to the FunModel console.

  1. Switch to the new console: If you are using the old version, click New Console in the upper-right corner of the page.

  2. Complete authorization: The first time you log on, follow the on-screen instructions to configure RAM role authorization.

Deploy and call a model service

The following steps describe how to deploy a model as an online service and call it. This process applies to traditional models, such as OCR and speech recognition, and large language models (LLMs).

Step 1: Select a model

In the Model Marketplace, select a model that fits your business scenario. For example:

  • Traditional model: iic/cv_convnextTiny_ocr-recognition-general_damo (OCR).

  • Large language model (LLM): Qwen/Qwen3-8B (Qwen 8B model).

Step 2 (Optional for some model services): Test the model

Before deployment, you can use the Quick Experience feature to check if the model's performance meets your expectations.

  1. Select a model to open its details page.

  2. In the Quick Experience section, click Run Test. This runs an inference using preset test data.

  3. Review the output to determine whether the model's features meet your needs.

Step 3: Configure and deploy the model

In this step, you will deploy the model as an online service and allocate the required compute resources.

  1. On the model details page, click Deploy Now.

  2. On the configuration page, the core configuration items are Instance Type and GPU Specification. These specifications determine the performance and cost of the service. For more information about instance types, see Instance types and specifications.

    Instance type descriptions and recommendations:

    Instance type

    Specification (vCPU/Memory/GPU)

    Scenarios

    Basic GPU

    4-core 16 GB 8 GB

    Feature validation for traditional models, low-frequency calls

    Advanced GPU

    8-core 32 GB 16 GB

    Traditional models in production environments, lightweight LLMs

    Performance GPU

    8-core 64 GB 48 GB

    LLM inference, image generation, and other GPU-intensive tasks

    Performance GPU (Multi-card)

    16-core 128 GB 48 GB × 2

    High-performance inference for large-scale LLMs

  3. Click Deploy Now. Once the service deployment is complete, the page automatically redirects to the service details page.

Step 4: Call the model service

After the service is deployed, you can interact with the model in one of two ways.

Method 1: Online call

Use this method to quickly verify in the console that the deployed service is processing input and generating output correctly.

  1. On the service details page, click the Online Debugging tab.

  2. The system automatically populates the request with sample parameters, which you can modify as needed.

  3. Click Send Request.

  4. In the Response section on the right, you can view the model's response.

Method 2: API call

Use this method to integrate the model's capabilities into your application by making standard HTTP requests.

  1. Obtain the service credentials and endpoint

    Before making an API call, you need to obtain two key pieces of information from the Overview > Access Information section on the model details page:

    • API endpoint: The exclusive access URL for the service.

    • Bearer Token: Used for identity authentication in API calls.

    Note

    To ensure service security, FunModel recommends that you enable authenticated access.

    With authentication enabled, you must include a valid Authorization: Bearer <YOUR_TOKEN> in the HTTP request header to prevent unauthorized access. Disabling authentication means anyone who knows your service's API endpoint can call it. This poses a security risk and is only recommended for temporary testing in a trusted internal network.

  2. Construct and send the request

    Models on FunModel may follow different API specifications.

    • Large language models (LLMs) that are compatible with the OpenAI API

      LLM services that are deployed on FunModel provide an API endpoint that is compatible with the OpenAI v1/chat/completions interface. This compatibility makes it easy to migrate existing applications.

      The following curl example shows how to call the Qwen/Qwen3-8B model. Replace the url and Authorization values with your service's information.

      curl --request POST \
        --url https://YOUR_SERVICE_URL/v1/chat/completions \
        --header 'Authorization: Bearer YOUR_BEARER_TOKEN' \
        --header 'content-type: application/json' \
        --data '{
          "model": "Qwen/Qwen3-8B",
          "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, please tell me about Hangzhou."}
          ],
          "stream": false,
          "temperature": 0.8,
          "max_tokens": 1024
        }'
      

      Key request parameter descriptions:

      Parameter

      Type

      Required

      Description

      model

      string

      Yes

      Specifies the ID of the model to call. It must match the model you deployed.

      messages

      array

      Yes

      The conversation history, containing role and content.

      stream

      boolean

      No

      Specifies whether to return a stream. The default is false.

      temperature

      float

      No

      Controls the randomness of the generated text. The value is between 0 and 2. A higher value makes the response more creative.

      max_tokens

      integer

      No

      Controls the maximum length of a single generated response.

    • Traditional models

      The API format for traditional models is typically simpler. The following curl example shows how to call an OCR model.

      curl --request POST \
        --url https://YOUR_OCR_SERVICE_URL/ \
        --header 'Authorization: Bearer YOUR_BEARER_TOKEN' \
        --header 'content-type: application/json' \
        --data '{"input":{"image":"http://modelscope.oss-cn-beijing.aliyuncs.com/demo/images/image_ocr_recognition.jpg"}}'

Billing information

The FunModel platform is free to use. However, you are charged for the underlying cloud resources that are consumed when you deploy and call services. These fees are settled in your Alibaba Cloud account and primarily include the following:

  • Function Compute (FC) fees: These are the core computing costs for running the model. You are billed on a pay-as-you-go basis according to the instance type and running time that you select.

  • Apsara File Storage NAS (NAS) fees: Model files are stored in NAS. You are billed on a pay-as-you-go basis according to the amount of storage space that is used.

  • Simple Log Service (SLS) fees: Service logs are collected in SLS for querying. You are billed based on usage.

To prevent service interruptions due to overdue payments, make sure that your Alibaba Cloud account has a sufficient balance. The related Alibaba Cloud services typically offer a free quota. Usage that exceeds the free quota is billed on a pay-as-you-go basis. For more information, see the official pricing documentation for each cloud product.

Troubleshooting

If you encounter a problem, logs are the primary source of information for troubleshooting. Always check the logs first to identify the cause of a problem.

  • Deployment failures

    If a model deployment fails, on the model details page, click Operation Record > View Details to view detailed error messages.

    • OOMKilled (Out of Memory): This error indicates insufficient memory or GPU memory. It typically occurs when the selected instance type is too small for a large model. To resolve this issue, try upgrading to an instance with a higher configuration.

    • ImagePullBackoff / ErrImagePull: The image failed to pull. Check your network configuration or contact technical support.

    • Download timeout: The model file download timed out. This error is typically caused by a large model file or network fluctuations. Try to deploy the model again.

  • Call failures

    If a model service call fails, first check the returned HTTP status code. Then, use the request ID (x-fc-request-id) to find the detailed logs on the model details page.

    • 403 Forbidden: This error indicates that authentication failed, which typically means your API key (Bearer Token) is invalid. Check the following:

      • The format of the Authorization request header must be Bearer sk-xxxxxxxx.

      • The provided Bearer Token must be complete, correct, and not have expired.

      • Check the Message field in the response body. It provides the specific reason for the failure, such as access denied due to invalid bearer token.

    • 429 Too Many Requests: The call frequency exceeds the service's concurrency limit. To resolve this issue, increase the number of instances in the service's advanced settings or optimize your call logic.

    • 502 Bad Gateway / 504 Gateway Timeout: A backend service error or timeout occurred. Check the operational logs for information about program crashes or inference timeouts.

Best practices

  • Cost control: Before deployment, use the Quick Experience feature to verify the model's performance and avoid unnecessary resource overhead. For compute-intensive tasks, such as large model inference, select a suitable compute-optimized instance to balance cost and efficiency.

  • Performance monitoring: On the Monitoring tab of the service details page, track core metrics such as Function Invocations, Function Running Time, Memory Usage, and GPU Memory Usage. You can also configure alert rules to promptly detect and address performance issues.

  • Continuous optimization and configuration adjustment: Dynamically adjust service configurations based on your business load and monitoring data to balance performance and cost. For example, you can increase the number of instances to handle high concurrency or upgrade the instance type to reduce inference latency.