Deploy MoE models on EAS with expert parallelism (EP) and Prefill-Decode (PD) separation for higher throughput and lower costs.
Architecture
PAI EAS combines PD separation, large-scale EP, computation-communication co-optimization, and MTP for production-grade EP deployment.

Key benefits:
-
One-click deployment: EAS EP templates include built-in images, optional resources, and run commands for wizard-based distributed deployment.
-
Unified service management: Monitor, scale, and manage Prefill, Decode, and LLM intelligent router sub-services independently from a single view.
Deploy EP service
The following example deploys DeepSeek-R1-0528-PAI-optimized, which provides higher throughput and lower latency.
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
-
On the Inference Service tab, click Deploy Service. In the Scenario-based Model Deployment section, click LLM Deployment.
-
In Model Settings, select the public model DeepSeek-R1-0528-PAI-optimized.

-
Set Inference Engine to vLLM and Deployment Template to EP+PD Separation-PAI Optimized.

-
Configure deployment resources for Prefill and Decode services. Select public resources or a resource quota.
-
Public resources: For trials and development testing. Available specifications:
ml.gu8tea.8.48xlargeorml.gu8tef.8.46xlarge.
-
Resource quota: Recommended for production to ensure resource stability and isolation.

-
-
(Optional) Adjust deployment parameters to optimize performance.
-
Number of instances: Adjust Prefill and Decode instance counts to change the PD ratio. Default: 1 per service.
-
Parallelism parameters: Adjust
EP_SIZE,DP_SIZE, andTP_SIZEfor Prefill and Decode services in environment variables. Defaults:TP_SIZE=8 for Prefill,EP_SIZE=8 andDP_SIZE=8 for Decode.NoteTo protect model weights of DeepSeek-R1-0528-PAI-optimized, the run command is not exposed. Use environment variables to modify key parameters.

-
-
Click Deploy and wait for the service to start. Deployment takes approximately 40 minutes.
-
After deployment, go to the Online Debugging tab on the service details page to verify the service.
NoteAPI access and third-party integration are covered in Call an LLM service.
Construct an OpenAI-format request. Append
/v1/chat/completionsto the URL path. Example request body:{ "model": "", "messages": [ { "role": "user", "content": "Hello!" } ], "max_tokens": 1024 }Click Send Request. A 200 status code with a valid model response confirms the service is running.

Manage EP service
-
On the service list page, click a service name to open its details page. This page provides views for the aggregated service and individual sub-services (Prefill, Decode, and LLM intelligent router).

-
View monitoring data and logs, and configure auto-scaling policies.




