Manually deploying DeepSeek models involves complex tasks, including compute environment configuration, model loading, and inference optimization. PAI Model Gallery simplifies this process with a one-click deployment feature. In just a few steps, you can create an OpenAI API-compatible model service and integrate it into your applications.
Solution architecture
Core components
This solution is built on PAI and includes the following core components:
-
Model Gallery: Serves as the entry point for model distribution and deployment. It provides pre-configured DeepSeek models and their corresponding deployment configurations.
-
Elastic Algorithm Service (EAS): The core service that hosts model deployment and inference. It automatically manages underlying compute resources, such as GPUs, and starts model service instances.
-
Inference acceleration engine (SGLang/vLLM/BladeLLM): Optimizes model inference performance.
-
SGLang/vLLM: Provides interfaces that are fully compatible with the OpenAI API, which simplifies migrating existing applications.
-
BladeLLM: A proprietary high-performance inference framework that provides superior inference performance in specific scenarios.
-
-
API gateway: Provides a secure channel to access the model service. It supports calling the model service with a service endpoint and an authentication token.
Deployment method
For large-scale models, in addition to single-node deployment, Model Gallery also offers one-click deployment solutions such as distributed deployment and EP+PD separation.
For example, DeepSeek-V4-Pro-FP8 and DeepSeek-V4-Flash-FP8 can both be deployed by using the EP+PD separation method.
Quick deployment and validation
Step 1: Deploy the model service
-
Log on to the PAI console and select the target region in the upper-left corner. From the navigation pane on the left, go to Workspaces and select the target workspace.
-
In the workspace, go to QuickStart > Model Gallery.
-
In the model list, search for and select the target model, such as
DeepSeek-R1-Distill-Qwen-7B, to open the model details page. -
Click Deploy in the upper-right corner, and then configure the following parameters.
-
Inference Engine: SGLang or vLLM is recommended.
-
Deployment Resource: Choose a public resource or dedicated resource, and select the appropriate GPU specification based on the model's requirements.
-
By default, a public resource is used, and a recommended specification is provided. If no specifications are available, try switching the region.
ImportantWhen you deploy by using a public resource, billing starts as soon as the service instance provisions the resource, with charges based on duration, even if there are no calls. Stop the service promptly after testing.
-
If you select Resource Quota, make sure to choose the corresponding inference engine and deployment template for your instance type. For example, if you use a GP7V instance type, you can select SGLang for the Inference Engine and must select Single-Node-GP7V for the Deployment Template.
-
The Deployment Template is set to Single-Node by default. An example of a recommended GPU specification is
ecs.gn7i-c16g1.4xlarge(16 vCPUs, 60 GiB, 1 × NVIDIA A10), which costs approximately CNY 11.1 per hour. -
-
After confirming that all configurations are correct, click Deploy. The system begins creating the service.
NoteFor a large model, such as the full-version DeepSeek-R1, the model loading process might take 20 to 30 minutes.
-
You can view the status of the deployment job on the Model Gallery > Job Management > Deployment Jobs page. Click the service name to open the service details page. You can also click More Information in the upper-right corner to view the model service details page in EAS.
On the service details page, click View Deployment Events next to the Status field for more details. To obtain the service endpoint and token, click View Call Information in the Call Information section.
Step 2: Online debugging
On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service and switch to the Online Debugging tab, which supports both Conversation Debugging and API Debugging.
The official usage recommendations for the DeepSeek-R1 model series are as follows:
-
Set
temperatureto a value between 0.5 and 0.7. The recommended value is 0.6 to prevent repetitive or incoherent output. -
Do not add a system prompt. Place all instructions in the user prompt.
-
For math-related questions, include "Please reason step by step and put the final answer in a \boxed{}." in the prompt.
If only the API debugging page is available, the following steps provide an example using the chat API of a large language model (LLM) service:
-
Confirm the request path:
<EAS_ENDPOINT>/v1/chat/completions. In this path,<EAS_ENDPOINT>is the service endpoint, which is usually pre-filled.For services deployed with SGLang/vLLM, you can retrieve more supported APIs from
<EAS_ENDPOINT>/openapi.json. -
Construct the request body.
If the prompt is "What is 3 + 5?", the request body is formatted as follows.
The value for the
modelparameter is the model name obtained from the model list API<EAS_ENDPOINT>/v1/models. This example uses DeepSeek-R1-Distill-Qwen-7B.{ "model": "DeepSeek-R1-Distill-Qwen-7B", "messages": [ { "role": "user", "content": "What is 3 + 5?" } ] } -
Send the request.
On the Online Debugging tab, select POST as the HTTP method and enter the service endpoint in the URL field, for example,
http://<service_address>/v1/chat/completions. In the Body section, select raw and enter the JSON request body, including themodel, such asDeepSeek-R1-Distill-Qwen-7B, andmessagesfields. Click Send Request. The Response area on the right displays a status code of 200, and the model's response is shown in the body in JSON format.
API call
-
Get the service endpoint and token.
-
On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service to go to the service details page.
-
Click View Call Information to get the service endpoint and token.
-
-
The following example shows how to make a chat API call.
Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token.
OpenAI SDK
Note:
-
Append /v1 to the end of the endpoint.
-
BladeLLM-accelerated deployments do not support
client.models.list(). As a workaround, set themodelparameter to an empty string ("").
SGLang/vLLM
from openai import OpenAI # 1. Configure the client # Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your actual service endpoint and token. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) # 2. Get the model name try: model = client.models.list().data[0].id print(model) except Exception as e: print(f"Failed to get the model list. Check the endpoint and token. Error: {e}") # 3. Construct and send the request stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)BladeLLM
from openai import OpenAI ##### API Configuration ##### # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service. openai_api_key = "<EAS_TOKEN>" openai_api_base = "<EAS_ENDPOINT>/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) # BladeLLM accelerated deployment does not currently support using client.models.list() to get the model name. You can set the model value to "" for compatibility. model="" stream = True chat_completion = client.chat.completions.create( messages=[ {"role": "user", "content": "Hello, please introduce yourself."} ], model=model, max_tokens=2048, stream=stream, ) if stream: for chunk in chat_completion: print(chunk.choices[0].delta.content, end="") else: result = chat_completion.choices[0].message.content print(result)HTTP
SGLang/vLLM
Replace <model_name> with the model name obtained from the model list API
<EAS_ENDPOINT>/v1/models.curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "model": "<model_name>", "messages": [ { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completionsimport json import requests # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service. EAS_ENDPOINT = "<EAS_ENDPOINT>" EAS_TOKEN = "<EAS_TOKEN>" url = f"{EAS_ENDPOINT}/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": EAS_TOKEN, } # Replace <model_name> with the model name obtained from the model list API <EAS_ENDPOINT>/v1/models. model = "<model_name>" stream = True messages = [ {"role": "user", "content": "Hello, please introduce yourself."}, ] req = { "messages": messages, "stream": stream, "temperature": 0.6, "top_p": 0.5, "top_k": 10, "max_tokens": 300, "model": model, } response = requests.post( url, json=req, headers=headers, stream=stream, ) if stream: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False): msg = chunk.decode("utf-8") if msg.startswith("data"): info = msg[6:] if info == "[DONE]": break else: resp = json.loads(info) print(resp["choices"][0]["delta"]["content"], end="", flush=True) else: resp = json.loads(response.text) print(resp["choices"][0]["message"]["content"])BladeLLM
curl -X POST \ -H "Content-Type: application/json" \ -H "Authorization: <EAS_TOKEN>" \ -d '{ "messages": [ { "role": "user", "content": "hello!" } ] }' \ <EAS_ENDPOINT>/v1/chat/completionsimport json import requests # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service. EAS_ENDPOINT = "<EAS_ENDPOINT>" EAS_TOKEN = "<EAS_TOKEN>" url = f"{EAS_ENDPOINT}/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": EAS_TOKEN, } stream = True messages = [ {"role": "user", "content": "Hello, please introduce yourself."}, ] # When you use BladeLLM for accelerated deployment, if you do not specify the max_tokens parameter, the output is truncated to 16 tokens by default. We recommend that you adjust the max_tokens request parameter as needed. req = { "messages": messages, "stream": stream, "temperature": 0.6, "top_p": 0.5, "top_k": 10, "max_tokens": 300, } response = requests.post( url, json=req, headers=headers, stream=stream, ) if stream: for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False): msg = chunk.decode("utf-8") if msg.startswith("data"): info = msg[6:] if info == "[DONE]": break else: resp = json.loads(info) if resp["choices"][0]["delta"].get("content") is not None: print(resp["choices"][0]["delta"]["content"], end="", flush=True) else: resp = json.loads(response.text) print(resp["choices"][0]["message"]["content"]) -
-
Because different models and deployment frameworks have distinct behaviors, you should consult the detailed API call instructions on the model's introduction page in Model Gallery.
For example, the introduction page for the DeepSeek-R1-Distill-Qwen-7B model in Model Gallery specifies the resource requirements for BladeLLM accelerated deployment. This deployment requires 24 GB of GPU memory and supports only ecs.gn7 and later instance families. The page also states that the model is compatible with OpenAI's
v1/completionsandv1/chat/completionsendpoints and provides call examples using curl commands and Python scripts.
Third-party integration
To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.
Local Web UI
Resource cleanup
For instances deployed using public resources, billing is based on the duration of use and starts as soon as resources are provisioned. Usage of less than one hour is charged by the minute.
-
Go to the Job Management > Deployment Jobs page.
-
Find the service that you want to stop and click Stop or Delete in the Actions column.
-
Stop: The service instance is released, and billing stops. The service configuration is retained, and you can restart the service later.
-
Delete: Both the service configuration and the instance are permanently deleted.
-
Model and resource selection
Your choice of model determines the required compute resources and deployment costs. DeepSeek models are available as "full-version" and "distilled version" variants, which have vastly different resource requirements.
-
Development and testing: We recommend that you use a distilled version model, such as
DeepSeek-R1-Distill-Qwen-7B. These models have a smaller resource footprint, typically a single GPU with 24 GB of GPU memory, deploy quickly, and are cost-effective, which makes them ideal for rapid feature validation. -
Production environment: Evaluate based on a balance of performance and cost. The
DeepSeek-R1-Distill-Qwen-32Bmodel strikes a good balance between effectiveness and cost. If you require higher model performance, choose a full-version model. This requires multiple high-end GPUs, such as eight GPUs with 96 GB of GPU memory each, which significantly increases costs.
The following table lists the minimum configurations for different model versions and the maximum number of tokens supported by different instance types and inference engines.
Full-version models
|
Model |
Deployment method |
Max tokens |
Minimum configuration |
|
|
SGLang |
vLLM |
|||
|
DeepSeek-V4-Pro |
Single-node/Distributed |
1M |
1M |
Single-node with 8 × H20-3e (8 × 141 GB of GPU memory) |
|
DeepSeek-V4-Flash |
Single-node |
1M |
1M |
Single-node with 8 × H20-3e (8 × 141 GB of GPU memory) |
|
DeepSeek-V4-Pro-FP8 |
Single-node/PD separation |
1M |
/ |
Single-node with 8 × H20-3e (8 × 141 GB of GPU memory) |
|
DeepSeek-V4-Flash-FP8 |
Single-node/PD separation |
1M |
/ |
Single-node with 4 × H20-3e (4 × 141 GB of GPU memory) |
|
DeepSeek-V3 |
Single-node - NVIDIA GPU |
56,000 |
65,536 |
Single-node 8 × GU120 (8 × 96 GB of GPU memory) |
|
Single-node - GP7V instance type |
56,000 |
16,384 |
||
|
Distributed - PAI Lingjun Intelligent Computing Service |
163,840 |
163,840 |
||
|
DeepSeek-R1 |
Single-node - NVIDIA GPU |
56,000 |
65,536 |
Single-node 8 × GU120 (8 × 96 GB of GPU memory) |
|
Single-node - GP7V instance type |
56,000 |
16,384 |
||
|
Distributed - PAI Lingjun Intelligent Computing Service |
163,840 |
163,840 |
||
Single-node deployment instance type notes:
-
NVIDIA GPU:
-
ml.gu8v.c192m1024.8-gu120andecs.gn8v-8x.48xlargeare available as public resources, but their inventory might be limited. -
ecs.ebmgn8v.48xlargecannot be used as a public resource. You must purchase dedicated EAS resources.
-
-
GP7V instance type:
ml.gp7vf.16.40xlargeis a public resource and can only be used as a preemptible instance. If NVIDIA GPU resources are scarce, switch to the China (Ulanqab) region to find GP7V instance type resources. You must configure a VPC when deploying.
Distributed deployment instance type notes (recommended when high performance is required):
Distributed deployment relies on high-speed networking and must use PAI Lingjun Intelligent Computing Service, which provides high-performance, elastic heterogeneous computing power. You must also configure a VPC during deployment. To use PAI Lingjun Intelligent Computing Service, switch the region to China (Ulanqab).
-
Lingjun public resources:
-
ml.gu7xf.8xlarge-gu108: Requires four machines for a single-instance deployment and can be used only as a preemptible instance. -
GP7V instance type: Requires two machines for a single-instance deployment.
-
-
Lingjun prepaid resources: You must be added to a whitelist to use these resources. Contact your sales manager or submit a ticket for consultation.
Distilled version models
|
Model |
Max tokens |
Minimum configuration |
||
|
SGLang |
vLLM |
BladeLLM |
||
|
DeepSeek-R1-Distill-Qwen-1.5B |
131,072 |
131,072 |
131,072 |
1 × A10 GPU (24 GB of GPU memory) |
|
DeepSeek-R1-Distill-Qwen-7B |
131,072 |
32,768 |
131,072 |
1 × A10 GPU (24 GB of GPU memory) |
|
DeepSeek-R1-Distill-Llama-8B |
131,072 |
32,768 |
131,072 |
1 × A10 GPU (24 GB of GPU memory) |
|
DeepSeek-R1-Distill-Qwen-14B |
131,072 |
32,768 |
131,072 |
1 × GPU L (48 GB of GPU memory) |
|
DeepSeek-R1-Distill-Qwen-32B |
131,072 |
32,768 |
131,072 |
2 × GPU L (2 × 48 GB of GPU memory) |
|
DeepSeek-R1-Distill-Llama-70B |
131,072 |
32,768 |
131,072 |
2 × GU120 (2 × 96 GB of GPU memory) |
PAI-optimized models
Model Gallery provides one-click deployment for the following PAI-optimized DeepSeek-related models:
-
DeepSeek-R1-PAI-optimized
-
DeepSeek-R1-0528-PAI-optimized
-
DeepSeek-V3-0324-PAI-optimized
Costs and risks
Cost breakdown
For services that use public resources, billing is calculated by the minute, starting from when an instance is provisioned until it is stopped or deleted. Bills are settled hourly. Charges accrue even when the service is idle. Stopping the service halts billing.
For more information, see Billing of Elastic Algorithm Service (EAS).
Cost control
-
Clean up promptly: After development and testing, immediately stop or delete the service to effectively control costs.
-
Use trial resources: If you are using EAS for the first time, you can go to Alibaba Cloud Free Tier to claim PAI-EAS trial resources. After you claim the resources, you can choose to deploy a model with a minimum configuration of A10, such as DeepSeek-R1-Distill-Qwen-7B, and modify the resource specification to the instance type provided in the trial event during deployment.
-
Select an appropriate model: In non-production environments, prioritize lower-cost distilled version models.
-
Use preemptible instances: For non-production tasks, you can enable the preemptible mode during deployment. Note that certain conditions must be met for a successful bid, and there is a risk of resource instability.
-
Long-term usage discounts: For long-running production services, you can reduce costs by purchasing a savings plan or prepaid resources.
Key risks
-
Unexpected costs: Forgetting to stop the service results in continuous billing. Always clean up resources immediately after use.
-
BladeLLM output truncation: When you use the BladeLLM engine, if the
max_tokensparameter is not specified in the API request, the output is truncated to 16 tokens, which might prevent the feature from working as expected. -
Incorrect API usage:
-
When you call a DeepSeek-R1 series model, including a
systemprompt in themessagesmight cause unexpected behavior. -
The API request URL must end with a path such as
/v1/chat/completions. Otherwise, a 404 error is returned.
-
-
Resource inventory: Limited inventory of high-end GPU resources in a specific region can lead to deployment failures or long waiting times. You can try switching to another region.
Model deployment FAQ
Choosing an inference engine
-
Recommended: SGLang. It delivers high performance while being fully compatible with the OpenAI API standard, which makes it a great fit for the mainstream application ecosystem. In most scenarios, it supports a longer maximum context length than vLLM.
-
Alternative: vLLM. As a popular framework in the industry, it also offers excellent API compatibility.
-
Specific scenarios BladeLLM: Use BladeLLM, a high-performance inference framework developed in-house by Alibaba Cloud PAI, only if you require higher inference performance and can accept API differences from the OpenAI standard, such as the lack of support for
client.models.list()and a default truncation behavior for themax_tokensparameter.
Service stuck waiting
Possible reasons:
-
Insufficient machine resources in the current region.
-
The model is large, and model loading takes a long time. For large models such as DeepSeek-R1 and DeepSeek-V3, this can take 20 to 30 minutes.
You can wait for a period of time. If the service still fails to start after an extended period, we recommend that you try the following steps:
-
Go to the Job Management > Deployment Jobs page to view the deployment job details. In the upper-right corner, click to go to the PAI-EAS model service details page and check the service instance status.
If the Instance Status column shows Insufficient Inventory, it means the instance cannot be scheduled due to a lack of resources.
-
Stop the current service, and then switch to another region in the upper-left corner of the console to redeploy the service.
NoteFor very large parameter models such as DeepSeek-R1 and DeepSeek-V3, 8 GPUs are required to start the service (resource inventory is tight). You can choose to deploy smaller, distilled models such as DeepSeek-R1-Distill-Qwen-7B (resource inventory is abundant).
Model call FAQ
API call returns 404
Check whether an OpenAI API suffix, such as v1/chat/completions, has been added to the URL. For more information, refer to the API call instructions on the model's introduction page.
If you are using a vLLM-accelerated deployment, check that the model parameter in the request body of the conversation API is set to the correct model name. You can obtain the model name from v1/models.
Request timeout
The default gateway request timeout is 180 seconds. To extend it, configure a Dedicated Gateway and submit a ticket to adjust the timeout. The maximum is 600 seconds.
No "web search" feature
The "web search" feature is not enabled by deploying the model service alone; it requires building a separate AI application (Agent) on top of the service.
Use LangStudio, PAI's large model application development platform, to build a web search AI application. For more information, see Build a DeepSeek-based web search application flow using LangStudio and Alibaba Cloud Information Query Service.
Model skips thinking
If the DeepSeek-R1 model sometimes skips the thinking process, use the updated chat template from DeepSeek that forces thinking:
-
Modify the startup command.
In the service configuration, edit the JSON configuration. Modify the
containers-scriptfield to add--chat-template /model_dir/template_force_thinking.jinja, which can be added after--served-model-name DeepSeek-R1.For an already deployed service, go to Model Gallery > Job Management > Deployment Jobs, click the deployed service name, and then click Update service in the upper-right corner to go to the configuration page.
-
Modify the request body. In each request, add
{"role": "assistant", "content": "<think>\n"}at the end of the message.
Disabling thinking mode
DeepSeek-R1 series models do not support disabling the thinking process.
Multi-turn conversations
The model service does not save conversation history. The client application must store the history and include it in subsequent requests. The following example shows a multi-turn conversation with a service deployed by using SGLang.
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: <EAS_TOKEN>" \
-d '{
"model": "<model_name>",
"messages": [
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! I''m glad to see you. What can I help you with?"
},
{
"role": "user",
"content": "What was my previous question?"
}
]
}' \
<EAS_ENDPOINT>/v1/chat/completions