PAI-EAS provides a one-click solution to deploy popular large language models (LLMs) like DeepSeek and Qwen. This eliminates the complex environment configuration, performance tuning, and cost management required for manual deployment.
Quick start: Deploy an open source model
This example deploys the open source model Qwen3-8B. The same process applies to other supported models.
Step 1: Create a service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
Click Deploy Service. In the Scenario-based Model Deployment section, click LLM Deployment.
Configure the following key parameters:
Parameter
Value
Model Settings
Select Public Model, then search for and select Qwen3-8B.
Inference Engine
Select vLLM (recommended, OpenAI API-compatible).
Deployment Template
Select Single Machine. The system automatically populates the recommended instance type and image.
Click Deploy. Deployment takes about 5 minutes. When the service status changes to Running, the deployment is successful.
NoteIf the service fails to deploy, see Service deployment and status issues.
Step 2: Verify with online debugging
After the service deploys successfully, use online debugging to verify that it is running correctly.
Click the service name to go to the details page, and then switch to the Online Debugging tab.
Configure the following request parameters:
Parameter
Value
Request method
POST
URL path
Append
/v1/chat/completionsto the existing URL. For example:/api/predict/llm_qwen3_8b_test/v1/chat/completions.Body
{ "model": "Qwen3-8B", "messages": [ {"role": "user", "content": "Hello!"} ], "max_tokens": 1024 }Headers
Ensure the request header includes
Content-Type: application/json.Click Send Request. You should receive a response that contains the model's reply.
A 200 status code and a response body containing a chat.completion JSON object indicate that the service is running correctly. The model's reply is in the content field.
Call the service via API
Before making a call, go to the Overview tab on the service details page, click View Endpoint Information, and get the endpoint and token.
The following code shows how to call the service.
cURL
curl -X POST <EAS_ENDPOINT>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: <EAS_TOKEN>" \
-d '{
"model": "<model_name>",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "hello"
}
],
"max_tokens":1024,
"temperature": 0.7,
"top_p": 0.8,
"stream":true
}'Where:
Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with your service's endpoint and token.Replace
<model_name>with the model name. For vLLM/SGLang, get the model name by calling the<EAS_ENDPOINT>/v1/modelsAPI endpoint.curl -X GET <EAS_ENDPOINT>/v1/models -H "Authorization: <EAS_TOKEN>"
OpenAI SDK
We recommend using the official Python SDK to interact with the service. Ensure you have the OpenAI SDK installed: pip install openai.
from openai import OpenAI
# 1. Configure the client
# Replace <EAS_TOKEN> with the token of your deployed service.
openai_api_key = "<EAS_TOKEN>"
# Replace <EAS_ENDPOINT> with the endpoint of your deployed service.
openai_api_base = "<EAS_ENDPOINT>/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# 2. Get the model name.
# For BladeLLM, set model = "". BladeLLM does not use the model parameter or support retrieval with client.models.list().
# This parameter is set to an empty string to meet the OpenAI SDK's requirement.
models = client.models.list()
model = models.data[0].id
print(model)
# 3. Send a chat request.
# Streaming (stream=True) and non-streaming (stream=False) outputs are supported.
stream = True
chat_completion = client.chat.completions.create(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
],
model=model,
top_p=0.8,
temperature=0.7,
max_tokens=1024,
stream=stream,
)
if stream:
for chunk in chat_completion:
print(chunk.choices[0].delta.content, end="")
else:
result = chat_completion.choices[0].message.content
print(result)Python requests library
If you prefer not to use the OpenAI SDK, you can use the requests library.
import json
import requests
# Replace <EAS_ENDPOINT> with the endpoint of your deployed service.
EAS_ENDPOINT = "<EAS_ENDPOINT>"
# Replace <EAS_TOKEN> with the token of your deployed service.
EAS_TOKEN = "<EAS_TOKEN>"
# Replace <model_name> with the model name. You can get the model name by calling the <EAS_ENDPOINT>/v1/models API endpoint.
# (For BladeLLM, this API is not supported. You can omit the "model" field or set it to "".)
model = "<model_name>"
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": EAS_TOKEN,
}
stream = True
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "hello"},
]
req = {
"messages": messages,
"stream": stream,
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
"model": model,
}
response = requests.post(
url,
json=req,
headers=headers,
stream=stream,
)
if stream:
for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
msg = chunk.decode("utf-8")
# The following code handles streaming responses in Server-Sent Events (SSE) format.
if msg.startswith("data:"):
info = msg[6:]
if info == "[DONE]":
break
else:
resp = json.loads(info)
if resp["choices"][0]["delta"].get("content") is not None:
print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
resp = json.loads(response.text)
print(resp["choices"][0]["message"]["content"])Build a local web UI with Gradio
Gradio is a user-friendly Python library for quickly creating interactive user interfaces for your machine learning models.
Download the code
Prepare the environment
Python 3.10 or later is required. Install the dependencies:
pip install openai gradio.Start the web application
Run the following command in your terminal. Replace
<EAS_ENDPOINT>and<EAS_TOKEN>with the endpoint and token of your deployed service.python webui_client.py --eas_endpoint "<EAS_ENDPOINT>" --eas_token "<EAS_TOKEN>"After the application starts, a local URL, typically
http://127.0.0.1:7860, is provided. Open this URL in a browser to access the web UI.
Integrate with third-party applications
You can integrate PAI-EAS services with various clients and development tools that support the OpenAI API. The key configuration parameters are the service endpoint, token, and model name.
Dify
Install the OpenAI-API-compatible model provider
In the upper-right corner of the page, click your profile picture and select Settings. In the left-side navigation pane, click Model Providers. If OpenAI-API-compatible is not in the Model List, find it in the list below and click Install.
Add a model
On the OpenAI-API-compatible card, click Add Model in the lower-right corner and configure the following parameters:
Model type: Select LLM.
Model name: For vLLM deployments, send a GET request to the
/v1/modelsAPI endpoint to get the model name. For this example, enter Qwen3-8B.API key: Enter the token of the PAI-EAS service.
API endpoint URL: Enter the public endpoint of the PAI-EAS service. Note: The URL must end with /v1.
Test the service
On the Dify main page, click Create Blank App, select the Chatflow type, enter the application name and other information, and then click Create.
Click the LLM node and select the model you added. To configure the node, select Qwen3-8B CHAT from the OpenAI-API-compatible category. In the SYSTEM prompt, enter the role description and reference the context variable sys.query. In the USER area, add the sys.query and sys.files variables. You can adjust model parameters such as Max Tokens (default is 512), Presence Penalty, and Thought Mode as needed.
Click Preview in the upper-right corner and enter a question.
The bot processes the question through the workflow and returns a reply. The top of the conversation area displays Deeply thought and the processing time.
Chatbox
Go to Chatbox to download and install the client for your device, or click Launch Web App to use the web version. This example uses a Mac with an M3 chip.
Add a model provider. Click Settings, add a model provider, enter a name such as pai, and select OpenAI API Compatible as the API Mode.
Select the pai model provider and configure the following parameters.
API key: Enter the token of the PAI-EAS service.
API host: Enter the public endpoint of the PAI-EAS service. Note: The URL must end with /v1.
API path: Leave this empty.
Model: Click Get to add the model. If the inference engine is BladeLLM, which does not support API retrieval, click New to enter the model name manually.
Test the conversation. Click New Chat and select the model service in the lower-right corner of the text input box.
The top of the conversation interface displays the system prompt, such as
You are a helpful assistant.. Enter a question such as "Who are you?" in the text input box and send it. A collapsible Deeply thought area and the processing time are displayed at the top of the assistant's reply area, followed by the model's reply.
Cherry Studio
Install the client
Visit Cherry Studio to download and install the client.
You can also download it from
https://github.com/CherryHQ/cherry-studio/releases.Configure the model service.
Click the settings icon in the lower-left corner. Under the Model Service section, click Add. For Provider Name, enter a custom name such as PAI, and set the provider type to OpenAI. Click OK.
In the API key field, enter the token of the PAI-EAS service. In the API address field, enter the public endpoint of the PAI-EAS service.
Click Add. In the Model ID field, enter the model name. For vLLM deployments, send a GET request to the
/v1/modelsAPI endpoint to get the model name. For this example, enterQwen3-8B. Note that the name is case-sensitive. When you enter the API address, do not include the /v1 suffix. If your URL ends with#, the client automatically uses the/v1/chat/completionspath. After the configuration is complete, the Model area automatically identifies and lists available models, such as Qwen3-8B.Click Test next to the API key input box to verify connectivity.
Quickly test the model
Return to the dialog box, select the model at the top, and start a conversation.
For example, enter Who are you? and send the message. The Qwen3-8B model replies with a self-introduction, confirming that the model is ready.
Billing
Charges include but are not limited to the following. For more information, see Billing of Elastic Algorithm Service (EAS).
Compute fees: This is the primary cost component. When you create a PAI-EAS service, select either the pay-as-you-go or subscription billing method for the resources, depending on your needs.
Storage fees: If you use a custom model, the model files are stored in Object Storage Service (OSS). You are charged for OSS storage based on your storage usage.
Using in production
Choose the right model
Define your use case:
General conversation: Make sure to select an instruction-tuned model, not a base model, to ensure that the model can understand and follow your instructions.
Code generation: Select a specialized code model, such as a model from the
Qwen3-Coderseries. These models typically outperform general-purpose models on code-related tasks.Domain-specific tasks: If your task is highly specialized, such as in finance or law, consider finding a model that has been fine-tuned in that domain, or fine-tuning a general-purpose model yourself.
Performance and cost: Models with more parameters are generally more powerful, but they also require more computing resources to deploy and incur higher inference costs. We recommend starting with a smaller model, such as a 7B-parameter model, to validate its performance. If the performance does not meet your requirements, you can then try progressively larger models.
Consult authoritative benchmarks: Refer to industry-recognized benchmarks such as OpenCompass and LMSys Chatbot Arena. They evaluate models across reasoning, coding, math, and more, which can help guide your selection.
Choose the right inference engine
vLLM/SGLang: As mainstream choices in the open source community, these engines have broad model support and extensive community documentation, making them easy to integrate and troubleshoot.
BladeLLM: This is a proprietary inference engine developed by the Alibaba Cloud PAI team. It is deeply optimized for specific models, especially the Qwen series, and may achieve higher performance and lower GPU memory usage.
Optimize inference
LLM intelligent routing deployment: This feature dynamically distributes requests based on real-time metrics such as token throughput and GPU memory utilization. It balances the computing power and memory allocation across multiple inference instances. This feature is ideal for deployments with multiple inference instances and uneven request loads, as it improves cluster resource utilization and system stability.
Deploy MoE models based on expert parallelism and Prefill-Decode separation: For Mixture-of-Experts (MoE) models, you can use technologies like expert parallelism (EP) and Prefill-Decode (PD) separation to improve inference throughput and reduce deployment costs.
FAQ
Q: Service stuck in "Pending" state
Follow these steps to troubleshoot the issue:
Check the instance status: On the service list page, click the service name to open the service details page. In the Service Instance section, check the instance status. If it shows Out of Stock, it indicates that the public resource group has insufficient resources.
Solutions (in order of priority):
Option 1: Change the instance type. Return to the deployment page and select a different GPU model.
Option 2: Use dedicated resources. For Resource Type, select a dedicated resource group. You must create this resource group in advance.
Preventive measures:
We recommend that enterprise users create dedicated resource groups to avoid availability issues in the public resource group.
During peak hours, we recommend testing in multiple regions.
Q: API call errors
An API call returns the error
Unsupported Media Type: Only 'application/json' is allowedEnsure that the request headers include
Content-Type: application/json.An API call returns the error
The model '<model_name>' does not exist.The vLLM inference engine requires the
modelfield to be specified correctly. Call the/v1/modelsendpoint with a GET request to get the model name.
For more questions, see EAS FAQ.