Deploy DeepSeek-V4 and DeepSeek-R1

更新时间:
复制 MD 格式

Manually deploying DeepSeek models involves complex tasks, including compute environment configuration, model loading, and inference optimization. PAI Model Gallery simplifies this process with a one-click deployment feature. In just a few steps, you can create an OpenAI API-compatible model service and integrate it into your applications.

Solution architecture

Core components

This solution is built on PAI and includes the following core components:

  • Model Gallery: Serves as the entry point for model distribution and deployment. It provides pre-configured DeepSeek models and their corresponding deployment configurations.

  • Elastic Algorithm Service (EAS): The core service that hosts model deployment and inference. It automatically manages underlying compute resources, such as GPUs, and starts model service instances.

  • Inference acceleration engine (SGLang/vLLM/BladeLLM): Optimizes model inference performance.

    • SGLang/vLLM: Provides interfaces that are fully compatible with the OpenAI API, which simplifies migrating existing applications.

    • BladeLLM: A proprietary high-performance inference framework that provides superior inference performance in specific scenarios.

  • API gateway: Provides a secure channel to access the model service. It supports calling the model service with a service endpoint and an authentication token.

Deployment method

For large-scale models, in addition to single-node deployment, Model Gallery also offers one-click deployment solutions such as distributed deployment and EP+PD separation.

For example, DeepSeek-V4-Pro-FP8 and DeepSeek-V4-Flash-FP8 can both be deployed by using the EP+PD separation method.

Quick deployment and validation

Step 1: Deploy the model service

  1. Log on to the PAI console and select the target region in the upper-left corner. From the navigation pane on the left, go to Workspaces and select the target workspace.

  2. In the workspace, go to QuickStart > Model Gallery.

  3. In the model list, search for and select the target model, such as DeepSeek-R1-Distill-Qwen-7B, to open the model details page.

  4. Click Deploy in the upper-right corner, and then configure the following parameters.

    • Inference Engine: SGLang or vLLM is recommended.

    • Deployment Resource: Choose a public resource or dedicated resource, and select the appropriate GPU specification based on the model's requirements.

      • By default, a public resource is used, and a recommended specification is provided. If no specifications are available, try switching the region.

        Important

        When you deploy by using a public resource, billing starts as soon as the service instance provisions the resource, with charges based on duration, even if there are no calls. Stop the service promptly after testing.

      • If you select Resource Quota, make sure to choose the corresponding inference engine and deployment template for your instance type. For example, if you use a GP7V instance type, you can select SGLang for the Inference Engine and must select Single-Node-GP7V for the Deployment Template.

    The Deployment Template is set to Single-Node by default. An example of a recommended GPU specification is ecs.gn7i-c16g1.4xlarge (16 vCPUs, 60 GiB, 1 × NVIDIA A10), which costs approximately CNY 11.1 per hour.

  5. After confirming that all configurations are correct, click Deploy. The system begins creating the service.

    Note

    For a large model, such as the full-version DeepSeek-R1, the model loading process might take 20 to 30 minutes.

  6. You can view the status of the deployment job on the Model Gallery > Job Management > Deployment Jobs page. Click the service name to open the service details page. You can also click More Information in the upper-right corner to view the model service details page in EAS.

    On the service details page, click View Deployment Events next to the Status field for more details. To obtain the service endpoint and token, click View Call Information in the Call Information section.

Step 2: Online debugging

On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service and switch to the Online Debugging tab, which supports both Conversation Debugging and API Debugging.

Note

The official usage recommendations for the DeepSeek-R1 model series are as follows:

  • Set temperature to a value between 0.5 and 0.7. The recommended value is 0.6 to prevent repetitive or incoherent output.

  • Do not add a system prompt. Place all instructions in the user prompt.

  • For math-related questions, include "Please reason step by step and put the final answer in a \boxed{}." in the prompt.

If only the API debugging page is available, the following steps provide an example using the chat API of a large language model (LLM) service:

  1. Confirm the request path: <EAS_ENDPOINT>/v1/chat/completions. In this path, <EAS_ENDPOINT> is the service endpoint, which is usually pre-filled.

    For services deployed with SGLang/vLLM, you can retrieve more supported APIs from <EAS_ENDPOINT>/openapi.json.
  2. Construct the request body.

    If the prompt is "What is 3 + 5?", the request body is formatted as follows.

    The value for the model parameter is the model name obtained from the model list API <EAS_ENDPOINT>/v1/models. This example uses DeepSeek-R1-Distill-Qwen-7B.

    {
        "model": "DeepSeek-R1-Distill-Qwen-7B",
        "messages": [
            {
                "role": "user",
                "content": "What is 3 + 5?"
            }
        ]
    }
  3. Send the request.

    On the Online Debugging tab, select POST as the HTTP method and enter the service endpoint in the URL field, for example, http://<service_address>/v1/chat/completions. In the Body section, select raw and enter the JSON request body, including the model, such as DeepSeek-R1-Distill-Qwen-7B, and messages fields. Click Send Request. The Response area on the right displays a status code of 200, and the model's response is shown in the body in JSON format.

BladeLLM API request sample

Important

When using the BladeLLM accelerated deployment method, if you do not specify the max_tokens parameter, the output is truncated to 16 tokens by default. Adjust the max_tokens request parameter based on your needs.

{
    "messages": [
        {
            "role": "user",
            "content": "What is 3 + 5?"
        }
    ],
    "max_tokens": 2000
}

API call

  1. Get the service endpoint and token.

    1. On the Model Gallery > Job Management > Deployment Jobs page, click the name of the deployed service to go to the service details page.

    2. Click View Call Information to get the service endpoint and token.

  2. The following example shows how to make a chat API call.

    Replace <EAS_ENDPOINT> with the service endpoint and <EAS_TOKEN> with the service token.

    OpenAI SDK

    Note:

    • Append /v1 to the end of the endpoint.

    • BladeLLM-accelerated deployments do not support client.models.list(). As a workaround, set the model parameter to an empty string ("").

    SGLang/vLLM
    from openai import OpenAI
    # 1. Configure the client
    # Replace <EAS_ENDPOINT> and <EAS_TOKEN> with your actual service endpoint and token.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # 2. Get the model name 
    try:
        model = client.models.list().data[0].id
        print(model)
    except Exception as e:
        print(f"Failed to get the model list. Check the endpoint and token. Error: {e}")
    # 3. Construct and send the request
    stream = True
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        model=model,
        max_tokens=2048,
        stream=stream,
    )
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)
    BladeLLM
    from openai import OpenAI
    ##### API Configuration #####
    # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
    openai_api_key = "<EAS_TOKEN>"
    openai_api_base = "<EAS_ENDPOINT>/v1"
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )
    # BladeLLM accelerated deployment does not currently support using client.models.list() to get the model name. You can set the model value to "" for compatibility.
    model=""
    stream = True
    chat_completion = client.chat.completions.create(
        messages=[
                  {"role": "user", "content": "Hello, please introduce yourself."}
        ],
         model=model,
         max_tokens=2048,
         stream=stream,
        )
    if stream:
        for chunk in chat_completion:
            print(chunk.choices[0].delta.content, end="")
    else:
        result = chat_completion.choices[0].message.content
        print(result)

    HTTP

    SGLang/vLLM

    Replace <model_name> with the model name obtained from the model list API <EAS_ENDPOINT>/v1/models.

    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "model": "<model_name>",
            "messages": [
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    
    import json
    import requests
    # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
    EAS_ENDPOINT = "<EAS_ENDPOINT>"
    EAS_TOKEN = "<EAS_TOKEN>"
    url = f"{EAS_ENDPOINT}/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": EAS_TOKEN,
    }
    # Replace <model_name> with the model name obtained from the model list API <EAS_ENDPOINT>/v1/models.
    model = "<model_name>"
    stream = True
    messages = [
        {"role": "user", "content": "Hello, please introduce yourself."},
    ]
    req = {
        "messages": messages,
        "stream": stream,
        "temperature": 0.6,
        "top_p": 0.5,
        "top_k": 10,
        "max_tokens": 300,
        "model": model,
    }
    response = requests.post(
        url,
        json=req,
        headers=headers,
        stream=stream,
    )
    if stream:
        for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
            msg = chunk.decode("utf-8")
            if msg.startswith("data"):
                info = msg[6:]
                if info == "[DONE]":
                    break
                else:
                    resp = json.loads(info)
                    print(resp["choices"][0]["delta"]["content"], end="", flush=True)
    else:
        resp = json.loads(response.text)
        print(resp["choices"][0]["message"]["content"])
    
    BladeLLM
    curl -X POST \
        -H "Content-Type: application/json" \
        -H "Authorization: <EAS_TOKEN>" \
        -d '{
            "messages": [
            {
                "role": "user",
                "content": "hello!"
            }
            ]
        }' \
        <EAS_ENDPOINT>/v1/chat/completions
    
    import json
    import requests
    # Replace <EAS_ENDPOINT> with the endpoint of the deployed service and <EAS_TOKEN> with the token of the deployed service.
    EAS_ENDPOINT = "<EAS_ENDPOINT>"
    EAS_TOKEN = "<EAS_TOKEN>"
    url = f"{EAS_ENDPOINT}/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": EAS_TOKEN,
    }
    stream = True
    messages = [
        {"role": "user", "content": "Hello, please introduce yourself."},
    ]
    # When you use BladeLLM for accelerated deployment, if you do not specify the max_tokens parameter, the output is truncated to 16 tokens by default. We recommend that you adjust the max_tokens request parameter as needed.
    req = {
        "messages": messages,
        "stream": stream,
        "temperature": 0.6,
        "top_p": 0.5,
        "top_k": 10,
        "max_tokens": 300,
    }
    response = requests.post(
        url,
        json=req,
        headers=headers,
        stream=stream,
    )
    if stream:
        for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False):
            msg = chunk.decode("utf-8")
            if msg.startswith("data"):
                info = msg[6:]
                if info == "[DONE]":
                    break
                else:
                    resp = json.loads(info)
                    if resp["choices"][0]["delta"].get("content") is not None:
                          print(resp["choices"][0]["delta"]["content"], end="", flush=True)
    else:
        resp = json.loads(response.text)
        print(resp["choices"][0]["message"]["content"])
  3. Because different models and deployment frameworks have distinct behaviors, you should consult the detailed API call instructions on the model's introduction page in Model Gallery.

    For example, the introduction page for the DeepSeek-R1-Distill-Qwen-7B model in Model Gallery specifies the resource requirements for BladeLLM accelerated deployment. This deployment requires 24 GB of GPU memory and supports only ecs.gn7 and later instance families. The page also states that the model is compatible with OpenAI's v1/completions and v1/chat/completions endpoints and provides call examples using curl commands and Python scripts.

Third-party integration

To connect to Chatbox, Dify, or Cherry Studio, see Integrate third-party clients.

Local Web UI

See Build a local Web UI by using Gradio.

Resource cleanup

For instances deployed using public resources, billing is based on the duration of use and starts as soon as resources are provisioned. Usage of less than one hour is charged by the minute.

  1. Go to the Job Management > Deployment Jobs page.

  2. Find the service that you want to stop and click Stop or Delete in the Actions column.

    • Stop: The service instance is released, and billing stops. The service configuration is retained, and you can restart the service later.

    • Delete: Both the service configuration and the instance are permanently deleted.

Model and resource selection

Your choice of model determines the required compute resources and deployment costs. DeepSeek models are available as "full-version" and "distilled version" variants, which have vastly different resource requirements.

  • Development and testing: We recommend that you use a distilled version model, such as DeepSeek-R1-Distill-Qwen-7B. These models have a smaller resource footprint, typically a single GPU with 24 GB of GPU memory, deploy quickly, and are cost-effective, which makes them ideal for rapid feature validation.

  • Production environment: Evaluate based on a balance of performance and cost. The DeepSeek-R1-Distill-Qwen-32B model strikes a good balance between effectiveness and cost. If you require higher model performance, choose a full-version model. This requires multiple high-end GPUs, such as eight GPUs with 96 GB of GPU memory each, which significantly increases costs.

The following table lists the minimum configurations for different model versions and the maximum number of tokens supported by different instance types and inference engines.

Full-version models

Model

Deployment method

Max tokens

Minimum configuration

SGLang

vLLM

DeepSeek-V4-Pro

Single-node/Distributed

1M

1M

Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)

DeepSeek-V4-Flash

Single-node

1M

1M

Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)

DeepSeek-V4-Pro-FP8

Single-node/PD separation

1M

/

Single-node with 8 × H20-3e (8 × 141 GB of GPU memory)

DeepSeek-V4-Flash-FP8

Single-node/PD separation

1M

/

Single-node with 4 × H20-3e (4 × 141 GB of GPU memory)

DeepSeek-V3

Single-node - NVIDIA GPU

56,000

65,536

Single-node 8 × GU120 (8 × 96 GB of GPU memory)

Single-node - GP7V instance type

56,000

16,384

Distributed - PAI Lingjun Intelligent Computing Service

163,840

163,840

DeepSeek-R1

Single-node - NVIDIA GPU

56,000

65,536

Single-node 8 × GU120 (8 × 96 GB of GPU memory)

Single-node - GP7V instance type

56,000

16,384

Distributed - PAI Lingjun Intelligent Computing Service

163,840

163,840

Single-node deployment instance type notes:

  • NVIDIA GPU:

    • ml.gu8v.c192m1024.8-gu120 and ecs.gn8v-8x.48xlarge are available as public resources, but their inventory might be limited.

    • ecs.ebmgn8v.48xlarge cannot be used as a public resource. You must purchase dedicated EAS resources.

  • GP7V instance type: ml.gp7vf.16.40xlarge is a public resource and can only be used as a preemptible instance. If NVIDIA GPU resources are scarce, switch to the China (Ulanqab) region to find GP7V instance type resources. You must configure a VPC when deploying.

Distributed deployment instance type notes (recommended when high performance is required):

Distributed deployment relies on high-speed networking and must use PAI Lingjun Intelligent Computing Service, which provides high-performance, elastic heterogeneous computing power. You must also configure a VPC during deployment. To use PAI Lingjun Intelligent Computing Service, switch the region to China (Ulanqab).

  • Lingjun public resources:

    • ml.gu7xf.8xlarge-gu108: Requires four machines for a single-instance deployment and can be used only as a preemptible instance.

    • GP7V instance type: Requires two machines for a single-instance deployment.

  • Lingjun prepaid resources: You must be added to a whitelist to use these resources. Contact your sales manager or submit a ticket for consultation.

Distilled version models

Model

Max tokens

Minimum configuration

SGLang

vLLM

BladeLLM

DeepSeek-R1-Distill-Qwen-1.5B

131,072

131,072

131,072

1 × A10 GPU (24 GB of GPU memory)

DeepSeek-R1-Distill-Qwen-7B

131,072

32,768

131,072

1 × A10 GPU (24 GB of GPU memory)

DeepSeek-R1-Distill-Llama-8B

131,072

32,768

131,072

1 × A10 GPU (24 GB of GPU memory)

DeepSeek-R1-Distill-Qwen-14B

131,072

32,768

131,072

1 × GPU L (48 GB of GPU memory)

DeepSeek-R1-Distill-Qwen-32B

131,072

32,768

131,072

2 × GPU L (2 × 48 GB of GPU memory)

DeepSeek-R1-Distill-Llama-70B

131,072

32,768

131,072

2 × GU120 (2 × 96 GB of GPU memory)

PAI-optimized models

Model Gallery provides one-click deployment for the following PAI-optimized DeepSeek-related models:

  • DeepSeek-R1-PAI-optimized

  • DeepSeek-R1-0528-PAI-optimized

  • DeepSeek-V3-0324-PAI-optimized

Costs and risks

Cost breakdown

For services that use public resources, billing is calculated by the minute, starting from when an instance is provisioned until it is stopped or deleted. Bills are settled hourly. Charges accrue even when the service is idle. Stopping the service halts billing.

For more information, see Billing of Elastic Algorithm Service (EAS).

Cost control

  • Clean up promptly: After development and testing, immediately stop or delete the service to effectively control costs.

  • Use trial resources: If you are using EAS for the first time, you can go to Alibaba Cloud Free Tier to claim PAI-EAS trial resources. After you claim the resources, you can choose to deploy a model with a minimum configuration of A10, such as DeepSeek-R1-Distill-Qwen-7B, and modify the resource specification to the instance type provided in the trial event during deployment.

  • Select an appropriate model: In non-production environments, prioritize lower-cost distilled version models.

  • Use preemptible instances: For non-production tasks, you can enable the preemptible mode during deployment. Note that certain conditions must be met for a successful bid, and there is a risk of resource instability.

  • Long-term usage discounts: For long-running production services, you can reduce costs by purchasing a savings plan or prepaid resources.

Key risks

  • Unexpected costs: Forgetting to stop the service results in continuous billing. Always clean up resources immediately after use.

  • BladeLLM output truncation: When you use the BladeLLM engine, if the max_tokens parameter is not specified in the API request, the output is truncated to 16 tokens, which might prevent the feature from working as expected.

  • Incorrect API usage:

    • When you call a DeepSeek-R1 series model, including a system prompt in the messages might cause unexpected behavior.

    • The API request URL must end with a path such as /v1/chat/completions. Otherwise, a 404 error is returned.

  • Resource inventory: Limited inventory of high-end GPU resources in a specific region can lead to deployment failures or long waiting times. You can try switching to another region.

Model deployment FAQ

Choosing an inference engine

  • Recommended: SGLang. It delivers high performance while being fully compatible with the OpenAI API standard, which makes it a great fit for the mainstream application ecosystem. In most scenarios, it supports a longer maximum context length than vLLM.

  • Alternative: vLLM. As a popular framework in the industry, it also offers excellent API compatibility.

  • Specific scenarios BladeLLM: Use BladeLLM, a high-performance inference framework developed in-house by Alibaba Cloud PAI, only if you require higher inference performance and can accept API differences from the OpenAI standard, such as the lack of support for client.models.list() and a default truncation behavior for the max_tokens parameter.

Service stuck waiting

Possible reasons:

  • Insufficient machine resources in the current region.

  • The model is large, and model loading takes a long time. For large models such as DeepSeek-R1 and DeepSeek-V3, this can take 20 to 30 minutes.

You can wait for a period of time. If the service still fails to start after an extended period, we recommend that you try the following steps:

  1. Go to the Job Management > Deployment Jobs page to view the deployment job details. In the upper-right corner, click More > More Information to go to the PAI-EAS model service details page and check the service instance status.

    If the Instance Status column shows Insufficient Inventory, it means the instance cannot be scheduled due to a lack of resources.

  2. Stop the current service, and then switch to another region in the upper-left corner of the console to redeploy the service.

    Note

    For very large parameter models such as DeepSeek-R1 and DeepSeek-V3, 8 GPUs are required to start the service (resource inventory is tight). You can choose to deploy smaller, distilled models such as DeepSeek-R1-Distill-Qwen-7B (resource inventory is abundant).

Model call FAQ

API call returns 404

Check whether an OpenAI API suffix, such as v1/chat/completions, has been added to the URL. For more information, refer to the API call instructions on the model's introduction page.

If you are using a vLLM-accelerated deployment, check that the model parameter in the request body of the conversation API is set to the correct model name. You can obtain the model name from v1/models.

Request timeout

The default gateway request timeout is 180 seconds. To extend it, configure a Dedicated Gateway and submit a ticket to adjust the timeout. The maximum is 600 seconds.

No "web search" feature

The "web search" feature is not enabled by deploying the model service alone; it requires building a separate AI application (Agent) on top of the service.

Use LangStudio, PAI's large model application development platform, to build a web search AI application. For more information, see Build a DeepSeek-based web search application flow using LangStudio and Alibaba Cloud Information Query Service.

Model skips thinking

If the DeepSeek-R1 model sometimes skips the thinking process, use the updated chat template from DeepSeek that forces thinking:

  1. Modify the startup command.

    In the service configuration, edit the JSON configuration. Modify the containers-script field to add --chat-template /model_dir/template_force_thinking.jinja, which can be added after --served-model-name DeepSeek-R1.

    For an already deployed service, go to Model Gallery > Job Management > Deployment Jobs, click the deployed service name, and then click Update service in the upper-right corner to go to the configuration page.

  2. Modify the request body. In each request, add {"role": "assistant", "content": "<think>\n"} at the end of the message.

Disabling thinking mode

DeepSeek-R1 series models do not support disabling the thinking process.

Multi-turn conversations

The model service does not save conversation history. The client application must store the history and include it in subsequent requests. The following example shows a multi-turn conversation with a service deployed by using SGLang.

curl -X POST \
    -H "Content-Type: application/json" \
    -H "Authorization: <EAS_TOKEN>" \
    -d '{
        "model": "<model_name>",
        "messages": [
         {
            "role": "user", 
            "content": "Hello"
        },
        {
            "role": "assistant",
            "content": "Hello! I''m glad to see you. What can I help you with?"
        },
        {
            "role": "user",
            "content": "What was my previous question?"
        }
        ]
    }' \
    <EAS_ENDPOINT>/v1/chat/completions

Related documentation