This guide walks you through the process of deploying a model and calling its API on the FunModel platform. You will learn how to select and configure compute instances, manage service credentials, send inference requests, and perform basic troubleshooting. This will help you integrate FunModel's AI model capabilities into your application.
Preparations
Before you begin, make sure you have an active Alibaba Cloud account and are logged in to the FunModel console.
Switch to the new console: If you are using the legacy console, click New Console in the upper-right corner of the page.
Complete authorization: On your first login, follow the on-screen instructions to complete RAM role authorization and other required configurations.
Deploy and call a model service
The following steps show how to deploy a model as an online service and call it. This workflow applies to both traditional models, such as OCR and speech recognition, and Large Language Models (LLMs).
Step 1: Select a model
In the Model Market, choose a model that fits your use case. For example:
Traditional model:
iic/cv_convnextTiny_ocr-recognition-general_damo(OCR text recognition).Large Language Model (LLM):
Qwen/Qwen3-8B(Qwen 8B model).
Step 2 (optional for some models): Try the model quickly
Before deploying the model, you can use the Quick Experience feature to verify whether the model meets your expectations.
Select the model and go to its product page.
In the Quick Experience section, click Run Test. The system runs a single inference using preset test data.
View the output to determine whether the model meets your needs.
Step 3: Configure and deploy the model
This step deploys the model as an online service. You must assign appropriate compute resources.
On the model product page, click Deploy Now.
In the configuration dialog box, the key settings are Instance Type and GPU Specifications. Different specifications affect performance and cost. For more information, see Instance Types and Specifications.
Instance specifications and selection guidance:
Instance Type
Specifications (vCPU/Memory/GPU)
Use Cases
GPU Basic
4 cores, 16 GB, 8 GB
Functional validation of traditional models, low-frequency calls
GPU Advanced
8 vCPUs, 32 GB memory, 16 GB disk
Production deployment of traditional models, lightweight LLMs
GPU Compute-Optimized
8 cores, 64 GB memory, 48 GB disk
LLM inference, image generation, and other GPU-intensive tasks
GPU Compute-Optimized (Multi-GPU)
16 cores, 128 GB, 2 × 48 GB
High-performance LLM inference at scale
Click Deploy Now and wait for the deployment to complete. You will be automatically redirected to the service product page.
Step 4: Call the model service
After the service is successfully deployed, you can interact with the model in the following two ways.
Method 1: Online call
Use this method in the console to quickly test the input and output of the deployed service.
On the service product page, click the Online Debugging tab.
The system automatically provides sample request parameters, which you can modify as needed.
Click Send Request.
View the model response in the Response Result section on the right.
Method 2: API call
Use this method to integrate the model's capabilities into your application using standard HTTP requests.
Obtain service credentials and endpoint
Before making API calls, obtain two critical pieces of information from the section on the model product page:
API Endpoint: The dedicated URL to access the service.
Authentication Token (Bearer Token): Used for identity authentication during API calls.
NoteTo secure your service, FunModel recommends enabling authenticated access.
When authentication is enabled, include a valid
Authorization: Bearer <YOUR_TOKEN>header in every HTTP request. This prevents unauthorized access. Disabling authentication means anyone who knows your API endpoint can call the service. This poses a security risk and should only be used for temporary testing in trusted private networks.Construct and send the request
Different models on FunModel may follow different API specifications.
Large Language Model (LLM) — OpenAI API Compatible
LLM services deployed on FunModel provide API endpoints that are compatible with the OpenAI
v1/chat/completionsinterface. This makes it easy to migrate existing applications.The following is a
curlexample for calling theQwen/Qwen3-8Bmodel. ReplaceurlandAuthorizationwith your actual service values.curl --request POST \ --url https://YOUR\_SERVICE\_URL/v1/chat/completions \ --header 'Authorization: Bearer YOUR_BEARER_TOKEN' \ --header 'content-type: application/json' \ --data '{ "model": "Qwen/Qwen3-8B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello, please introduce Hangzhou."} ], "stream": false, "temperature": 0.8, "max_tokens": 1024 }'Key request parameters:
Parameter
Type
Required
Description
modelstring
Yes
The model ID to call. It must match the deployed model.
messagesarray
Yes
The conversation history, containing
roleandcontent.streamboolean
No
Whether to return responses in streaming mode. Default is
false.temperaturefloat
No
Controls randomness in generated text. Valid range is 0 to 2. Higher values produce more creative responses.
max_tokensinteger
No
Maximum length of generated content per request.
Traditional models
Traditional model APIs usually use simpler formats. The following is a
curlexample for calling an OCR model.curl --request POST \ --url https://YOUR\_OCR\_SERVICE\_URL/ \ --header 'Authorization: Bearer YOUR_BEARER_TOKEN' \ --header 'content-type: application/json' \ --data '{"input":{"image":"http://modelscope.oss-cn-beijing.aliyuncs.com/demo/images/image\_ocr\_recognition.jpg"}}'
Billing information
FunModel is free to use for deployment and management. However, deploying and calling services consumes other Alibaba Cloud resources, which are charged to your account. These charges appear on your Alibaba Cloud bill and include the following:
Function Compute (FC) fees: The core computing cost for model execution. Billed on a pay-as-you-go basis based on your selected instance type and running time.
File storage (NAS) fees: Model files are stored in NAS. Billed on a pay-as-you-go basis based on the storage space used.
Simple Log Service (SLS) fees: Service logs are collected in SLS for querying. Billed based on usage.
To avoid service interruption due to an overdue payment, ensure that you have a sufficient account balance. Most Alibaba Cloud services offer a free quota. Usage beyond the free quota is billed on a pay-as-you-go basis. For more information, see the official pricing documentation for each service.
Troubleshooting
Logs are the primary source of information for diagnosing issues. Always check the logs first to identify the root cause of a problem.
Deployment failure
If a model deployment fails, go to the tab on the model product page to view detailed error messages.
OOMKilled(Out of Memory): Insufficient memory or GPU memory. This error often occurs when a large model is assigned to an instance that is too small. Upgrade to an instance with higher specifications.ImagePullBackoff/ErrImagePull: Failed to pull the runtime image. Check your network configuration or contact technical support.Download timeout: The model file download timed out. This usually happens because the model is large or the network is unstable. Try redeploying the model.
Call failure
If a model service call fails, first check the HTTP status code in the response. Then, in the service logs on the model product page, search for detailed logs using the request ID (
x-fc-request-id).403 Forbidden: Authentication failed. This error usually means your API key (Bearer Token) is invalid. Check the following:Is the
Authorizationrequest header formatted correctly? It must beBearer sk-xxxxxxxx.Is the provided
Bearer Tokencomplete, correct, and not expired?Check the
Messagefield in the response body. It provides the exact reason, such asaccess denied due to invalid bearer token.
429 Too Many Requests: Your call frequency exceeds the service’s concurrent limit. You can increase the number of instances in the Advanced Settings or optimize your calling logic.502 Bad Gateway/504 Gateway Timeout: Backend service error or timeout. Check the operational log for crashes or inference timeouts.
Best practices
Cost control: Before deployment, validate model performance using the Quick Experience feature to avoid unnecessary resource costs. For compute-intensive tasks such as LLM inference, choose a compute-optimized instance to balance cost and efficiency.
Performance monitoring: On the service product page, use the Monitoring tab to track key metrics such as function invocation count, function execution time, memory usage, and GPU memory usage. You can also set up alert rules to detect and resolve performance issues promptly.
Continuous optimization and configuration tuning: Adjust service configurations dynamically based on workload and monitoring data to balance performance and cost. For example, you can increase the instance count to handle high concurrency or upgrade the instance type to reduce inference latency.