Deploy MoE models with expert parallelism

更新时间:
复制 MD 格式

Deploy MoE models on EAS with expert parallelism (EP) and Prefill-Decode (PD) separation for higher throughput and lower costs.

Architecture

PAI EAS combines PD separation, large-scale EP, computation-communication co-optimization, and MTP for production-grade EP deployment.

image.png

Key benefits:

  • One-click deployment: EAS EP templates include built-in images, optional resources, and run commands for wizard-based distributed deployment.

  • Unified service management: Monitor, scale, and manage Prefill, Decode, and LLM intelligent router sub-services independently from a single view.

Deploy EP service

The following example deploys DeepSeek-R1-0528-PAI-optimized, which provides higher throughput and lower latency.

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click LLM Deployment.

  3. In Model Settings, select the public model DeepSeek-R1-0528-PAI-optimized.

    image.png

  1. Set Inference Engine to vLLM and Deployment Template to EP+PD Separation-PAI Optimized.

    image.png

  2. Configure deployment resources for Prefill and Decode services. Select public resources or a resource quota.

    • Public resources: For trials and development testing. Available specifications: ml.gu8tea.8.48xlarge or ml.gu8tef.8.46xlarge.image.png

    • Resource quota: Recommended for production to ensure resource stability and isolation.

      image.png

  3. (Optional) Adjust deployment parameters to optimize performance.

    • Number of instances: Adjust Prefill and Decode instance counts to change the PD ratio. Default: 1 per service.

    • Parallelism parameters: Adjust EP_SIZE, DP_SIZE, and TP_SIZE for Prefill and Decode services in environment variables. Defaults: TP_SIZE=8 for Prefill, EP_SIZE=8 and DP_SIZE=8 for Decode.

      Note

      To protect model weights of DeepSeek-R1-0528-PAI-optimized, the run command is not exposed. Use environment variables to modify key parameters.

      image.png

  4. Click Deploy and wait for the service to start. Deployment takes approximately 40 minutes.

  5. After deployment, go to the Online Debugging tab on the service details page to verify the service.

    Note

    API access and third-party integration are covered in Call an LLM service.

    Construct an OpenAI-format request. Append /v1/chat/completions to the URL path. Example request body:

    {
        "model": "",
        "messages": [
            {
                "role": "user",
                "content": "Hello!"
            }
        ],
        "max_tokens": 1024
    }

    Click Send Request. A 200 status code with a valid model response confirms the service is running.

    image.png

Manage EP service

  1. On the service list page, click a service name to open its details page. This page provides views for the aggregated service and individual sub-services (Prefill, Decode, and LLM intelligent router).

    image.png

  2. View monitoring data and logs, and configure auto-scaling policies.

    image.png