After training a model, you can use Elastic Algorithm Service (EAS) to quickly deploy it as an online inference service or an AI web application. EAS supports heterogeneous resources and integrates capabilities such as elastic scaling, one-click stress testing, canary release, and real-time monitoring. This ensures service stability and business continuity in high-concurrency scenarios while reducing costs.
Product architecture

Core capabilities
EAS provides end-to-end capabilities that cover resource management, model deployment, and managed O&M to ensure your services run stably and efficiently.
Flexible resource and cost management
Heterogeneous hardware support: Supports CPUs, GPUs, and specialized AI accelerator instances to meet the performance needs of different models.
Cost optimization: You can use preemptible instances to significantly reduce computing costs. The scheduled scaling feature lets you define policies based on business cycles to precisely control resource allocation.
Elastic resource pool: When a dedicated resource group is fully utilized, EAS automatically schedules new instances to a public resource group. This approach balances cost control with service stability.
Comprehensive stability and high availability
Elastic scaling: Automatically adjusts the number of service replicas based on the real-time workload, helping you manage unpredictable traffic spikes and prevent resource underutilization or service overload.
High-availability mechanism: An automatic fault recovery mechanism ensures service continuity. Dedicated resources are physically isolated, which eliminates the risk of resource contention.
Safe releases: Supports canary release, which lets you direct a percentage of traffic to a new version for validation. It also supports traffic mirroring, which copies production traffic to a test service for reliability verification without affecting real user requests.
Efficient deployment and O&M
One-click stress testing: Provides a one-click stress testing feature that dynamically increases the load to automatically find the service's performance limits. You can view second-level monitoring data and stress test reports in real time to quickly assess your service capabilities.
Real-time monitoring: Offers real-time monitoring for key metrics such as QPS, response time, and CPU utilization. It also supports enabling service monitoring alerts, providing a complete overview of your service's health.
Multiple deployment methods: Supports service deployment by using a runtime image (recommended) or a processor deployment to meet the needs of different technology stacks.
Diverse inference modes
Real-time synchronous inference: Features high throughput and low latency, making it suitable for latency-sensitive scenarios like search recommendations and conversational bots.
Near real-time asynchronous inference: Includes a built-in message queue, ideal for long-running tasks such as text-to-image generation and video processing. Supports elastic scaling based on the queue backlog to prevent request pile-ups.
Offline batch inference: Suitable for batch processing scenarios that are not sensitive to response times, such as batch conversion of voice data. It also supports preemptible instances to control costs.
Procedure
Step 1: Prepare resources and files
Prepare inference resources: Select a suitable EAS resource type based on your model size, concurrency needs, and budget. For guidance on selecting, purchasing, and configuring resources, see Overview of EAS deployment resources.
NoteYou can use public resources directly without purchasing them first. Other resource types, such as EAS resource groups and resource quotas, must be purchased before use.
Prepare files: Upload your trained model, processing code, and dependencies to a cloud storage service like Object Storage Service (OSS). You can then access these files from your service by using storage mounting.
Step 2: Deploy the service
You can deploy and manage services using the console, the EASCMD command-line tool, or an SDK. For more information, see Service deployment.
Console: Provides custom deployment and scenario-based deployment options, which are easy to use and suitable for beginners.
For models that EAS does not provide a ready-made deployment solution (such as MinerU and other non-preset or third-party models), prepare the model files and related configuration files yourself, upload them to OSS, and then use the Custom Deployment option in the console to create the service.
EASCMD command-line tool: Supports operations such as service creation, updates, and viewing. It is suitable for algorithm developers who are familiar with EAS deployment.
SDK: For unified scheduling and operations at scale.
Step 3: Invoke and stress test the service
Web application: If you deploy your service as an AI web application, you can open the interactive page directly in a browser to test it.
API service: You can use the online debugging feature to verify service functionality, or make synchronous or asynchronous calls through the API. For more information, see Service invocation.
Service stress testing: Use the built-in one-click stress testing tool to evaluate your service's performance under load. For more information, see Service stress testing.
Step 4: Monitor and manage the service
Monitoring and alerts: View the service status in the Inference Services list. Enable service monitoring alerts to track the health of your service in real time.
Elastic scaling: Configure elastic scaling or scheduled scaling policies based on your business requirements to dynamically manage compute resources.
Service updates: In the Actions column, click Update to deploy a new version. After the update is complete, you can view the version information or switch between versions.
WarningService updates cause a temporary interruption and may cause dependent requests to fail. Proceed with caution.
Service migration: To migrate an EAS service configuration from one region to another (for example, from China (Beijing) to China (Shanghai)), use the EASCMD command-line tool to export the service configuration from the source region, and then use the exported configuration file to create a new service in the target region. This enables cross-region migration and remote deployment without manual reconfiguration.
Resource group filtering: In the Inference Services list, you can directly view the resource group used by each service task, allowing you to quickly locate or filter services running on a specific resource group (such as a Lingjun resource group).
Important notes
If an EAS service remains in a non-Running state for 180 consecutive days, the system automatically deletes it.
For the regions that support EAS, see Regions and Availability Zones.
Billing
For more information, see Billing of Elastic Algorithm Service (EAS).
Quick start
Use cases
LLM: Deploy a large language model (LLM) | Deploy a Mixture-of-Experts (MoE) model by using expert parallelism and pipeline-data parallelism separation
AIGC: AI video generation - Deploy ComfyUI | AI art generation - Deploy SD-WebUI
RAG: Build an RAG-based LLM chatbot | Build a security-enhanced RAG application with an encrypted knowledge base
Other: Best practices for accessing a dedicated gateway across VPCs by using CEN | Best practices for PAI-EAS Spot Instances
FAQ
Q: Dedicated resources vs. public resources?
Public resources: Suitable for development, testing, or small-scale applications that are cost-sensitive and can tolerate performance fluctuations. Public resources are low-cost but may experience resource contention during peak hours.
Dedicated resources: Suitable for production services that require high stability and performance. These resources are physically isolated, which eliminates the risk of resource contention. The elastic resource pool feature allows workloads to overflow to public resources when dedicated capacity is full, balancing cost and stability during peak hours. To reserve instance types that have limited inventory, you must purchase them as dedicated resources.
Q: EAS vs. self-managed services?
EAS provides managed O&M. It automatically handles resource scheduling, fault recovery, and monitoring, and offers standardized features such as elastic scaling and canary release. This frees developers to focus on model development, reducing operational costs and accelerating time-to-market.
References
API documentation: API overview
FAQ: FAQ about EAS