Deploy DeepSeek-V4 and DeepSeek-R1-Platform For AI(PAI)-阿里云帮助中心

Manually deploying DeepSeek models involves complex tasks, including compute environment configuration, model loading, and inference optimization. PAI Model Gallery simplifies this process with a one-click deployment feature. In just a few steps, you can create an OpenAI API-compatible model service and integrate it into your applications.

Solution architecture

Core components

This solution is built on PAI and includes the following core components:

Model Gallery: Serves as the entry point for model distribution and deployment. It provides pre-configured DeepSeek models and their corresponding deployment configurations.
Elastic Algorithm Service (EAS): The core service that hosts model deployment and inference. It automatically manages underlying compute resources, such as GPUs, and starts model service instances.
Inference acceleration engine (SGLang/vLLM/BladeLLM): Optimizes model inference performance.
- SGLang/vLLM: Provides interfaces that are fully compatible with the OpenAI API, which simplifies migrating existing applications.
- BladeLLM: A proprietary high-performance inference framework that provides superior inference performance in specific scenarios.
API gateway: Provides a secure channel to access the model service. It supports calling the model service with a service endpoint and an authentication token.

Deployment method

For large-scale models, in addition to single-node deployment, Model Gallery also offers one-click deployment solutions such as distributed deployment and EP+PD separation.

For example, DeepSeek-V4-Pro-FP8 and DeepSeek-V4-Flash-FP8 can both be deployed by using the EP+PD separation method.

Quick deployment and validation

Step 1: Deploy the model service

Log on to the PAI console and select the target region in the upper-left corner. From the navigation pane on the left, go to Workspaces and select the target workspace.
In the workspace, go to QuickStart > Model Gallery.
In the model list, search for and select the target model, such as DeepSeek-R1-Distill-Qwen-7B, to open the model details page.
Click Deploy in the upper-right corner, and then configure the following parameters.
- Inference Engine: SGLang or vLLM is recommended.
- Deployment Resource: Choose a public resource or dedicated resource, and select the appropriate GPU specification based on the model's requirements.
  - By default, a public resource is used, and a recommended specification is provided. If no specifications are available, try switching the region.
    Important
    When you deploy by using a public resource, billing starts as soon as the service instance provisions the resource, with charges based on duration, even if there are no calls. Stop the service promptly after testing.
  - If you select Resource Quota, make sure to choose the corresponding inference engine and deployment template for your instance type. For example, if you use a GP7V instance type, you can select SGLang for the Inference Engine and must select Single-Node-GP7V for the Deployment Template.
The Deployment Template is set to Single-Node by default. An example of a recommended GPU specification is ecs.gn7i-c16g1.4xlarge (16 vCPUs, 60 GiB, 1 × NVIDIA A10), which costs approximately CNY 11.1 per hour.
After confirming that all configurations are correct, click Deploy. The system begins creating the service.
Note
For a large model, such as the full-version DeepSeek-R1, the model loading process might take 20 to 30 minutes.
You can view the status of the deployment job on the Model Gallery > Job Management > Deployment Jobs page. Click the service name to open the service details page. You can also click More > More Information in the upper-right corner to view the model service details page in EAS.
On the service details page, click View Deployment Events next to the Status field for more details. To obtain the service endpoint and token, click View Call Information in the Call Information section.

Step 2: Online debugging

On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service and switch to the Online Debugging tab, which supports both Conversation Debugging and API Debugging.

Note

The official usage recommendations for the DeepSeek-R1 model series are as follows:

Set temperature to a value between 0.5 and 0.7. The recommended value is 0.6 to prevent repetitive or incoherent output.
Do not add a system prompt. Place all instructions in the user prompt.
For math-related questions, include "Please reason step by step and put the final answer in a \boxed{}." in the prompt.

If only the API debugging page is available, the following steps provide an example using the chat API of a large language model (LLM) service:

Confirm the request path: <EAS_ENDPOINT>/v1/chat/completions. In this path, <EAS_ENDPOINT> is the service endpoint, which is usually pre-filled. Its standard format is as follows:
http://<uid>.<region-id>.pai-eas.aliyuncs.com/api/predict/<service-name>.
For services deployed with SGLang/vLLM, you can retrieve more supported APIs from <EAS_ENDPOINT>/openapi.json.
Construct the request body.
If the prompt is "What is 3 + 5?", the request body is formatted as follows.
The value for the model parameter is the model name obtained from the model list API <EAS_ENDPOINT>/v1/models. This example uses DeepSeek-R1-Distill-Qwen-7B.
```
{
    "model": "DeepSeek-R1-Distill-Qwen-7B",
    "messages": [
        {
            "role": "user",
            "content": "What is 3 + 5?"
        }
    ]
}
```
Send the request.
On the Online Debugging tab, select POST as the HTTP method and enter the service endpoint in the URL field, for example, http://<service_address>/v1/chat/completions. In the Body section, select raw and enter the JSON request body, including the model, such as DeepSeek-R1-Distill-Qwen-7B, and messages fields. Click Send Request. The Response area on the right displays a status code of 200, and the model's response is shown in the body in JSON format.

BladeLLM API request sample

Important

When using the BladeLLM accelerated deployment method, if you do not specify the max_tokens parameter, the output is truncated to 16 tokens by default. Adjust the max_tokens request parameter based on your needs.

{
    "messages": [
        {
            "role": "user",
            "content": "What is 3 + 5?"
        }
    ],
    "max_tokens": 2000
}

API call

Get the service endpoint and token.
1. On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service to go to the service details page.
2. Click View Call Information to get the service endpoint and token.

The following example shows how to make a chat API call.

Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token.

OpenAI SDK

Note:

Append /v1 to the end of the endpoint.
BladeLLM-accelerated deployments do not support client.models.list(). As a workaround, set the model parameter to an empty string ("").

SGLang/vLLM

from openai import OpenAI
# 1. Configure the client
# Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your actual service endpoint and token.
openai_api_key = "<EAS_TOKEN>"
openai_api_base = "<EAS_ENDPOINT>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
# 2. Get the model name 
try:
    model = client.models.list().data[0].id
    print(model)
except Exception as e:
    print(f"Failed to get the model list. Check the endpoint and token. Error: {e}")
# 3. Construct and send the request
stream = True
chat_completion = client.chat.completions.create(
    messages=[
        {"role": "user", "content": "Hello, please introduce yourself."}
    ],
    model=model,
    max_tokens=2048,
    stream=stream,
)
if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

BladeLLM

from openai import OpenAI
##### API Configuration #####
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
openai_api_key = "<EAS_TOKEN>"
openai_api_base = "<EAS_ENDPOINT>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
# BladeLLM accelerated deployment does not currently support using client.models.list() to get the model name. You can set the model value to "" for compatibility.
model=""
stream = True
chat_completion = client.chat.completions.create(
    messages=[
              {"role": "user", "content": "Hello, please introduce yourself."}
    ],
     model=model,
     max_tokens=2048,
     stream=stream,
    )
if stream:
    for chunk in chat_completion:
        print(chunk.choices[0].delta.content, end="")
else:
    result = chat_completion.choices[0].message.content
    print(result)

HTTP

SGLang/vLLM

Replace <model_name> with the model name obtained from the model list API <EAS_ENDPOINT>/v1/models.

curl

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
        {
            "role": "user",
            "content": "hello!"
        }
        ]
    }' \
    <EAS_ENDPOINT>/v1/chat/completions

Python

import json
import requests
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
EAS_ENDPOINT = "<EAS_ENDPOINT>"
EAS_TOKEN = "<EAS_TOKEN>"
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}
# Replace <model_name> with the model name obtained from the model list API <EAS_ENDPOINT>/v1/models.
model = "<model_name>"
stream = True
messages = [
    {"role": "user", "content": "Hello, please introduce yourself."},
]
req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.6,
    "top_p": 0.5,
    "top_k": 10,
    "max_tokens": 300,
    "model": model,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)
if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        if msg.startswith("data"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

BladeLLM

curl

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "messages": [
        {
            "role": "user",
            "content": "hello!"
        }
        ]
    }' \
    <EAS_ENDPOINT>/v1/chat/completions

Python

import json
import requests
# Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
EAS_ENDPOINT = "<EAS_ENDPOINT>"
EAS_TOKEN = "<EAS_TOKEN>"
url = f"{EAS_ENDPOINT}/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": EAS_TOKEN,
}
stream = True
messages = [
    {"role": "user", "content": "Hello, please introduce yourself."},
]
# When you use BladeLLM for accelerated deployment, if you do not specify the max_tokens parameter, the output is truncated to 16 tokens by default. We recommend that you adjust the max_tokens request parameter as needed.
req = {
    "messages": messages,
    "stream": stream,
    "temperature": 0.6,
    "top_p": 0.5,
    "top_k": 10,
    "max_tokens": 300,
}
response = requests.post(
    url,
    json=req,
    headers=headers,
    stream=stream,
)
if stream:
    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
        msg = chunk.decode("utf-8")
        if msg.startswith("data"):
            info = msg[6:]
            if info == "[DONE]":
                break
            else:
                resp = json.loads(info)
                if resp["choices"][0]["delta"].get("content") is not None:
                      print(resp["choices"][0]["delta"]["content"], end="", flush=True)
else:
    resp = json.loads(response.text)
    print(resp["choices"][0]["message"]["content"])

Because different models and deployment frameworks have distinct behaviors, you should consult the detailed API call instructions on the model's introduction page in Model Gallery.
For example, the introduction page for the DeepSeek-R1-Distill-Qwen-7B model in Model Gallery specifies the resource requirements for BladeLLM accelerated deployment. This deployment requires 24 GB of GPU memory and supports only ecs.gn7 and later instance families. The page also states that the model is compatible with OpenAI's v1/completions and v1/chat/completions endpoints and provides call examples using curl commands and Python scripts.

Third-party integration

To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.

Local Web UI

See Build a local Web UI by using Gradio.

Resource cleanup

For instances deployed using public resources, billing is based on the duration of use and starts as soon as resources are provisioned. Usage of less than one hour is charged by the minute.

Go to the Job Management > Deployment Jobs page.
Find the service that you want to stop and click Stop or Delete in the Actions column.
- Stop: The service instance is released, and billing stops. The service configuration is retained, and you can restart the service later.
- Delete: Both the service configuration and the instance are permanently deleted.

Model and resource selection

Your choice of model determines the required compute resources and deployment costs. DeepSeek models are available as "full-version" and "distilled version" variants, which have vastly different resource requirements.

Development and testing: We recommend that you use a distilled version model, such as DeepSeek-R1-Distill-Qwen-7B. These models have a smaller resource footprint, typically a single GPU with 24 GB of GPU memory, deploy quickly, and are cost-effective, which makes them ideal for rapid feature validation.
Production environment: Evaluate based on a balance of performance and cost. The DeepSeek-R1-Distill-Qwen-32B model strikes a good balance between effectiveness and cost. If you require higher model performance, choose a full-version model. This requires multiple high-end GPUs, such as eight GPUs with 96 GB of GPU memory each, which significantly increases costs.

The following table lists the minimum configurations for different model versions and the maximum number of tokens supported by different instance types and inference engines.

Full-version models

Model	Deployment method	Max tokens		Minimum configuration
Model	Deployment method	SGLang	vLLM	Minimum configuration
DeepSeek-V4-Pro	Single-node/Distributed	1M	1M	Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)
DeepSeek-V4-Flash	Single-node	1M	1M	Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)
DeepSeek-V4-Pro-FP8	Single-node/PD separation	1M	/	Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)
DeepSeek-V4-Flash-FP8	Single-node/PD separation	1M	/	Single-node with 4 × H20-3e (4 × 141 GB of GPU memory)
DeepSeek-V3	Single-node - NVIDIA GPU	56,000	65,536	Single-node 8 × GU120 (8 × 96 GB of GPU memory)
	Single-node - GP7V instance type	56,000	16,384
	Distributed - PAI Lingjun Intelligent Computing Service	163,840	163,840
DeepSeek-R1	Single-node - NVIDIA GPU	56,000	65,536	Single-node 8 × GU120 (8 × 96 GB of GPU memory)
	Single-node - GP7V instance type	56,000	16,384
	Distributed - PAI Lingjun Intelligent Computing Service	163,840	163,840

Single-node deployment instance type notes:

NVIDIA GPU:
- ml.gu8v.c192m1024.8-gu120 and ecs.gn8v-8x.48xlarge are available as public resources, but their inventory might be limited.
- ecs.ebmgn8v.48xlarge cannot be used as a public resource. You must purchase dedicated EAS resources.
GP7V instance type: ml.gp7vf.16.40xlarge is a public resource and can only be used as a preemptible instance. If NVIDIA GPU resources are scarce, switch to the China (Ulanqab) region to find GP7V instance type resources. You must configure a VPC when deploying.

Distributed deployment instance type notes (recommended when high performance is required):

Distributed deployment relies on high-speed networking and must use PAI Lingjun Intelligent Computing Service, which provides high-performance, elastic heterogeneous computing power. You must also configure a VPC during deployment. To use PAI Lingjun Intelligent Computing Service, switch the region to China (Ulanqab).

Lingjun public resources:
- ml.gu7xf.8xlarge-gu108: Requires four machines for a single-instance deployment and can be used only as a preemptible instance.
- GP7V instance type: Requires two machines for a single-instance deployment.
Lingjun prepaid resources: You must be added to a whitelist to use these resources. Contact your sales manager or submit a ticket for consultation.

Distilled version models

Model	Max tokens			Minimum configuration
Model	SGLang	vLLM	BladeLLM	Minimum configuration
DeepSeek-R1-Distill-Qwen-1.5B	131,072	131,072	131,072	1 × A10 GPU (24 GB of GPU memory)
DeepSeek-R1-Distill-Qwen-7B	131,072	32,768	131,072	1 × A10 GPU (24 GB of GPU memory)
DeepSeek-R1-Distill-Llama-8B	131,072	32,768	131,072	1 × A10 GPU (24 GB of GPU memory)
DeepSeek-R1-Distill-Qwen-14B	131,072	32,768	131,072	1 × GPU L (48 GB of GPU memory)
DeepSeek-R1-Distill-Qwen-32B	131,072	32,768	131,072	2 × GPU L (2 × 48 GB of GPU memory)
DeepSeek-R1-Distill-Llama-70B	131,072	32,768	131,072	2 × GU120 (2 × 96 GB of GPU memory)

PAI-optimized models

Model Gallery provides one-click deployment for the following PAI-optimized DeepSeek-related models:

DeepSeek-R1-PAI-optimized
DeepSeek-R1-0528-PAI-optimized
DeepSeek-V3-0324-PAI-optimized

Costs and risks

Cost breakdown

For services that use public resources, billing is calculated by the minute, starting from when an instance is provisioned until it is stopped or deleted. Bills are settled hourly. Charges accrue even when the service is idle. Stopping the service halts billing.

For more information, see Billing of Elastic Algorithm Service (EAS).

Cost control

Clean up promptly: After development and testing, immediately stop or delete the service to effectively control costs.
Use trial resources: If you are using EAS for the first time, you can go to Alibaba Cloud Free Tier to claim PAI-EAS trial resources. After you claim the resources, you can choose to deploy a model with a minimum configuration of A10, such as DeepSeek-R1-Distill-Qwen-7B, and modify the resource specification to the instance type provided in the trial event during deployment.
Select an appropriate model: In non-production environments, prioritize lower-cost distilled version models.
Use preemptible instances: For non-production tasks, you can enable the preemptible mode during deployment. Note that certain conditions must be met for a successful bid, and there is a risk of resource instability.
Long-term usage discounts: For long-running production services, you can reduce costs by purchasing a savings plan or prepaid resources.

Key risks

Unexpected costs: Forgetting to stop the service results in continuous billing. Always clean up resources immediately after use.
BladeLLM output truncation: When you use the BladeLLM engine, if the max_tokens parameter is not specified in the API request, the output is truncated to 16 tokens, which might prevent the feature from working as expected.
Incorrect API usage:
- When you call a DeepSeek-R1 series model, including a system prompt in the messages might cause unexpected behavior.
- The API request URL must end with a path such as /v1/chat/completions. Otherwise, a 404 error is returned.
Resource inventory: Limited inventory of high-end GPU resources in a specific region can lead to deployment failures or long waiting times. You can try switching to another region.

Model deployment FAQ

Choosing an inference engine

Recommended: SGLang. It delivers high performance while being fully compatible with the OpenAI API standard, which makes it a great fit for the mainstream application ecosystem. In most scenarios, it supports a longer maximum context length than vLLM.
Alternative: vLLM. As a popular framework in the industry, it also offers excellent API compatibility.
Specific scenarios BladeLLM: Use BladeLLM, a high-performance inference framework developed in-house by Alibaba Cloud PAI, only if you require higher inference performance and can accept API differences from the OpenAI standard, such as the lack of support for client.models.list() and a default truncation behavior for the max_tokens parameter.

Service stuck waiting

Possible reasons:

Insufficient machine resources in the current region.
The model is large, and model loading takes a long time. For large models such as DeepSeek-R1 and DeepSeek-V3, this can take 20 to 30 minutes.

You can wait for a period of time. If the service still fails to start after an extended period, we recommend that you try the following steps:

Go to the Job Management > Deployment Jobs page to view the deployment job details. In the upper-right corner, click More > More Information to go to the PAI-EAS model service details page and check the service instance status.
If the Instance Status column shows Insufficient Inventory, it means the instance cannot be scheduled due to a lack of resources.
Stop the current service, and then switch to another region in the upper-left corner of the console to redeploy the service.
Note
For very large parameter models such as DeepSeek-R1 and DeepSeek-V3, 8 GPUs are required to start the service (resource inventory is tight). You can choose to deploy smaller, distilled models such as DeepSeek-R1-Distill-Qwen-7B (resource inventory is abundant).

Model call FAQ

API call returns 404

Check whether an OpenAI API suffix, such as v1/chat/completions, has been added to the URL. For more information, refer to the API call instructions on the model's introduction page.

If you are using a vLLM-accelerated deployment, check that the model parameter in the request body of the conversation API is set to the correct model name. You can obtain the model name from v1/models.

Request timeout

The default gateway request timeout is 180 seconds. To extend it, configure a Dedicated Gateway and submit a ticket to adjust the timeout. The maximum is 600 seconds.

No "web search" feature

The "web search" feature is not enabled by deploying the model service alone; it requires building a separate AI application (Agent) on top of the service.

Use LangStudio, PAI's large model application development platform, to build a web search AI application. For more information, see Build a DeepSeek-based web search application flow using LangStudio and Alibaba Cloud Information Query Service.

Model skips thinking

If the DeepSeek-R1 model sometimes skips the thinking process, use the updated chat template from DeepSeek that forces thinking:

Modify the startup command.
In the service configuration, edit the JSON configuration. Modify the containers-script field to add --chat-template /model_dir/template_force_thinking.jinja, which can be added after --served-model-name DeepSeek-R1.
For an already deployed service, go to Model Gallery > Job Management > Deployment Jobs, click the deployed service name, and then click Update service in the upper-right corner to go to the configuration page.
Modify the request body. In each request, add {"role": "assistant", "content": "<think>\n"} at the end of the message.

Disabling thinking mode

DeepSeek-R1 series models do not support disabling the thinking process.

Multi-turn conversations

The model service does not save conversation history. The client application must store the history and include it in subsequent requests. The following example shows a multi-turn conversation with a service deployed by using SGLang.

curl

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
         {
            "role": "user", 
            "content": "Hello"
        },
        {
            "role": "assistant",
            "content": "Hello! I''m glad to see you. What can I help you with?"
        },
        {
            "role": "user",
            "content": "What was my previous question?"
        }
        ]
    }' \
    <EAS_ENDPOINT>/v1/chat/completions

Solution architecture

Core components

Deployment method

Quick deployment and validation

Step 1: Deploy the model service

Step 2: Online debugging

API call

OpenAI SDK

SGLang/vLLM

BladeLLM

HTTP

SGLang/vLLM

BladeLLM

Third-party integration

Local Web UI

Resource cleanup

Model and resource selection

Full-version models

Distilled version models

PAI-optimized models

Costs and risks

Cost breakdown

Cost control

Key risks

Model deployment FAQ

Choosing an inference engine

Service stuck waiting

Model call FAQ

API call returns 404

Request timeout

No "web search" feature

Model skips thinking

Disabling thinking mode

Multi-turn conversations

Related documentation