Deploy large language models on PAI-EAS-Platform For AI(PAI)-阿里云帮助中心

Quick start: Deploy an open source model

This example shows how to deploy the open source model Qwen3-8B. The same process applies to other supported models.

Step 1: Create a service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Scenario-based Model Deployment area, click LLM Deployment.

Configure the following key parameters:

Parameter	Value
Model Settings	Select Public Model, then search for and select Qwen3-8B.
Inference Engine	Select vLLM (Recommended, OpenAI API-compatible).
Deployment Template	Select Single-Node. The system automatically fills in recommended parameters, such as instance type and image, based on the template.

Click Deploy. The service deployment takes about 5 minutes. When the service status changes to Running, the deployment is successful.

Note
If the service deployment fails, see Service deployment and status issues for troubleshooting.

Step 2: Verify the service

Once deployed, use online debugging to verify that the service is running correctly.

Click the service name to open its details page, then switch to the Online Debugging tab.

Configure the following request parameters:

Parameter	Value
Request method	POST
URL path	Enter the complete URL path in the format `/api/predict/{service_name}/v1/chat/completions`, where `{service_name}` is the name of your deployed service. Example: `/api/predict/llm_qwen3_8b_test/v1/chat/completions`.
Body	`{ "model": "Qwen3-8B", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 1024 }`
Headers	Ensure that the request includes `Content-Type: application/json`.

Click Send Request. You should receive a response that contains the model's reply.

If the request returns a status code of 200 and a JSON response of the chat.completion type, the service is deployed and running correctly. The response body contains the model's reply in the content field.

Call the API

Before calling the service, go to the View Endpoint Information tab on the service details page to get the endpoint and token.

The following examples show how to call the service API:

cURL

curl -X POST <EAS_ENDPOINT>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "hello"
        }
        ],
        "max_tokens":1024,
        "temperature": 0.7,
        "top_p": 0.8,
        "stream":true
    }'

In the preceding code, replace the following parameters:

Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of your deployed service.
Replace <model_name> with the name of the model. For vLLM or SGLang, you can get the model name from the model list API at <EAS_ENDPOINT>/v1/models.
```
curl -X GET <EAS_ENDPOINT>/v1/models -H "Authorization: <EAS_TOKEN>"
```

OpenAI SDK

We recommend using the official Python SDK to interact with the service. Make sure that you have installed the OpenAI SDK: pip install openai.

from openai import OpenAI

# 1. Configure the client.
# Replace <EAS_TOKEN> with the token of your deployed service.
openai_api_key = "<EAS_TOKEN>"
# Replace <EAS_ENDPOINT> with the endpoint of your deployed service.
openai_api_base = "<EAS_ENDPOINT>/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# 2. Get the model name.
# For BladeLLM, set model = "". BladeLLM does not require a model parameter
# and does not support getting the model name by using client.models.list().
# Set it to an empty string to meet the mandatory parameter requirement of the OpenAI SDK.
models = client.models.list()
model = models.data[0].id
print(model)

# 3. Send a chat request.
# Both streaming (stream=True) and non-streaming (stream=False) outputs are supported.
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "hello"},          
    ],
    model=model,
    top_p=0.8,
    temperature=0.7,
    max_tokens=1024,
    stream=stream,
)

if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

Python requests

If you do not want to add a dependency on the OpenAI SDK, you can use the requests library.

import json
import requests

# Replace <EAS_ENDPOINT> with the endpoint of your deployed service.
EAS_ENDPOINT = "<EAS_ENDPOINT>"
# Replace <EAS_TOKEN> with the token of your deployed service.
EAS_TOKEN = "<EAS_TOKEN>"
# Replace <model_name> with the model name. You can get it from the <EAS_ENDPOINT>/v1/models API.
# For BladeLLM, this API is not supported. You can omit the "model" field or set it to "".
model = "<model_name>"

url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}

stream = True
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "hello"},
]

req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024,
    "model": model,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)

if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        # The following code processes Server-Sent Events (SSE) formatted streaming responses.
        if msg.startswith("data:"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                if resp["choices"][0]["delta"].get("content") is not None:
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

Build a local web UI

Gradio is a Python UI library that helps you quickly build interactive interfaces for models. Follow these steps to run a Gradio web UI locally.

Download the code

GitHub link | OSS link
Prepare the environment

Python 3.10 or later is required, and you must install the dependencies by running pip install openai gradio.
Start the web application

In the terminal, execute the following command. Replace <EAS_ENDPOINT> and <EAS_TOKEN> with the endpoint and token of the deployed service.
```
python webui_client.py --eas_endpoint "<EAS_ENDPOINT>" --eas_token "<EAS_TOKEN>"
```
After a successful startup, a local URL is output (typically http://127.0.0.1:7860). Open this URL in a browser to access it.

Integrate with third-party applications

You can integrate a PAI-EAS service with various clients and development tools that support the OpenAI API. The key configuration settings are the service endpoint, token, and model name.

Dify

Install the OpenAI-API-compatible model provider

Click your avatar in the upper-right corner and select Settings. In the left pane, select Model Providers. If OpenAI-API-compatible is not in the Model List, find it in the list below and click Install.
Add a model

Click Add Model in the lower-right corner of the OpenAI-API-compatible card and configure the following parameters:
- Model type: Select LLM.
- Model name: For a vLLM deployment, send a GET request to the /v1/models endpoint to get the name. For this example, enter Qwen3-8B.
- API key: Enter your PAI-EAS service token.
- API endpoint URL: Enter the public endpoint of your PAI-EAS service and append /v1 to the end.
Test the integration
1. On the Dify main page, click Create Blank App, select the Chatflow type, enter an application name and other information, and then click Create.
2. Click the LLM node, select the model that you added, and set the context and prompts. Select Qwen3-8B CHAT from the OpenAI-API-compatible category. In the SYSTEM prompt, enter a role description and reference the context variable sys.query. In the USER area, add the sys.query and sys.files variables. You can adjust model parameters such as Max Tokens (default is 512), Presence Penalty, and Thinking Mode as needed.
3. Click Preview in the upper-right corner and enter a question.
  
  The bot processes the question through the workflow and returns a reply. The top of the conversation area displays a Deep thinking status and the elapsed time.

Chatbox

Go to Chatbox to download and install the client for your device, or click Launch Web App for the web version. This example uses macOS M3.
Add a model provider. Click Settings, add a model provider, enter a name such as pai, and select OpenAI API Compatible as the API mode.
Select the pai provider and configure the following parameters:
- API key: Enter your PAI-EAS service token.
- API host: Enter the public endpoint of your PAI-EAS service and append /v1 to the end.
- API path: Leave this field empty.
- Model: Click Get to add the model. The BladeLLM inference engine does not support API-based model retrieval. If you are using BladeLLM, add the model manually by clicking New.
Test the chat. Click New Chat and select the model service in the lower-right corner of the text input box.

The top of the chat interface displays the system prompt, such as You are a helpful assistant.. Enter a question such as "Who are you?" in the text box and send it. The assistant's reply area shows a collapsed Deep thinking section with the thinking time, followed by the model's response.

Cherry Studio

Install the client

Visit Cherry Studio to download and install the client.

You can also download it from https://github.com/CherryHQ/cherry-studio/releases.
Configure the model service
1. Click the settings icon in the lower-left corner. Under the Model Service section, click Add. In Provider name, enter a custom name, such as PAI, and set the provider type to OpenAI. Click OK.
2. In the API key field, enter your PAI-EAS service token. In the API address field, enter the public endpoint of your PAI-EAS service.
3. Click Add and enter the model name in the Model ID field. For vLLM deployments, you can obtain the name by sending a GET request to the /v1/models API. For example, enter Qwen3-8B. The name is case-sensitive. For the API Address, enter the URL but omit the v1 version number at the end. If the URL ends with a # character, the system forces the use of the /v1/chat/completions path. After the configuration is complete, the Model section automatically identifies and lists available models, such as Qwen3-8B.
4. Click Test next to the API key input field to verify connectivity.
Quickly test the model

Return to the chat interface, select the model at the top, and start a conversation.

For example, if you enter Who are you? and the Qwen3-8B model responds with a self-introduction, it indicates that the model is deployed and working correctly.

Billing

Costs include but are not limited to the following. For more information, see Billing of PAI-EAS.

Compute fees: This is the primary cost. When you create a PAI-EAS service, choose pay-as-you-go or subscription resources based on your needs.
Storage fees: If you use a custom model, the model files are stored in Object Storage Service (OSS), which incurs storage fees.

Production use

Choose the right model

Define your application scenario:
- General conversation: Always choose an instruction-tuned model, not a foundation model, to ensure the model understands and follows your instructions.
- Code Generation: Choose specialized code models, such as the Qwen3-Coder series, because they usually perform much better than general-purpose models on code-related tasks.
- Domain-specific tasks: If your task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned for that domain, or fine-tuning a general-purpose model yourself.
Balance performance and cost: Larger models are generally more capable but require more computing resources and have higher inference costs. Start with a smaller model, such as a 7B model, for validation. If its performance does not meet your needs, try a larger model.
Refer to authoritative benchmarks: You can consult leaderboards such as OpenCompass and LMSys Chatbot Arena. These leaderboards evaluate models on multiple dimensions, including reasoning, coding, and math, and can provide valuable guidance for model selection.

Choose the right inference engine

vLLM/SGLang: As mainstream choices in the open-source community, they offer broad model support and extensive community documentation and examples, which makes them easy to integrate and troubleshoot.
BladeLLM: An inference engine developed by the Alibaba Cloud PAI team. It is deeply optimized for specific models, especially the Qwen series, and may achieve higher performance and lower GPU memory usage.

Inference optimization

Deploy an LLM with intelligent routing: Dynamically distributes requests based on real-time metrics such as token throughput and GPU memory usage. This approach balances the computing power and memory allocation across inference instances and is suitable for deployments with multiple inference instances and unevenly distributed request loads. This improves cluster resource utilization and system stability.
Deploy Mixture of Experts (MoE) models by using expert parallelism and PD separation: For Mixture of Experts (MoE) models, this approach uses techniques such as expert parallelism (EP) and Prefill-Decode (PD) separation to improve inference throughput and reduce deployment costs.

FAQ

Q: Service is stuck in the Pending state

Follow these steps to troubleshoot the issue:

Check the instance status: On the service list page, click the service name to go to the details page. In the Service Instance section, check the instance status. If it shows insufficient resources, the current public resource group does not have enough capacity.
Solutions (in order of priority):
1. Option 1: Change the instance type. Return to the deployment page and select a different GPU model.
2. Solution 2: Use dedicated resources. For Resource Type, select an EAS resource group to use dedicated resources (the resource group must be created in advance).
Preventive measures:
1. Create a dedicated resource group to avoid the limitations of public resources.
2. During peak hours, test in multiple regions.

Q: Call errors

The API call returns the following error: Unsupported Media Type: Only 'application/json' is allowed

Ensure that the request headers contain Content-Type: application/json.
The API call returns the following error: The model '<model_name>' does not exist.

The vLLM inference engine requires that the model field is specified correctly. You can obtain the model name by calling the /v1/models endpoint with a GET request.

For more questions, see EAS FAQ.