EAS (Elastic Algorithm Service) custom deployment provides a flexible and comprehensive way to host AI inference services, letting you package any algorithm or model into an online service.
We recommend that you first try scenario-based deployment for use cases like LLMs and ComfyUI. If that approach does not meet your needs, use custom deployment.
Quick start: Deploy a simple web service
This section shows how to quickly deploy a simple web service by using image-based deployment.
Step 1: Prepare the code file
Save the following Flask application code as an app.py file. Note that the service listens on port 8000.
Step 2: Upload the code to OSS
Upload the app.py file to an OSS bucket. Make sure that the OSS bucket and the EAS workspace are in the same region. For example, upload the file to the oss://examplebucket/code/ directory.
Step 3: Configure and deploy the service
-
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).
On the Inference Service tab, click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
On the configuration page, configure the key parameters in the Environment Information and Resource Information sections as follows:
Deployment Method: Select Image-based Deployment.
Image Configuration: Select the Alibaba Cloud Image
python-inference:3.9-ubuntu2004.Storage Mounting: Mount the OSS directory that contains the
app.pyfile to the/mnt/data/path in the container.URI: Select the OSS directory where your code is located. In this example, the directory is
oss://examplebucket/code/.Mount Path: Specify a local path in the container for the directory. In this example, the path is
/mnt/data/.
Command: Because
app.pyhas been mounted to the/mnt/data/directory of the container, the startup command is:python /mnt/data/app.py.Third-party Library Settings: The sample code depends on the
flasklibrary, which is not included in the selected official image. You can addflaskto the Third-party Libraries. EAS automatically installs the library when the service starts.Resource Configuration: Allocate appropriate compute resources for the service. For this simple example, a small CPU instance is sufficient.
Resource Type: Public resources.
Resource Specification: Select
ecs.c7.large.
After you complete the configuration, click Deploy. When the service status changes to Running, you can invoke the service.
Additional configuration
Runtime environment
In the Environment Information section, configure the runtime environment and dependencies for your service.
Parameter | Description |
Image Configuration | The base runtime environment for the service. You can use official images provided by PAI, or use your own image by selecting Custom Image or entering an Image Address. For more information, see Custom images. Note If the image includes a Web UI, select Enable Web App. EAS automatically starts the web server, allowing you to access the front-end page directly. |
Mount storage | Mount models, code, or data from cloud storage services like OSS and NAS to a local path in the container. This practice decouples code and data from the environment and simplifies independent updates. For more information, see storage mounting. |
Mount dataset | To manage versions of your models or data, you can use the Dataset feature for mounting. For more information, see Create and manage datasets. |
Command | Set the startup command for the image, such as |
Port Number | Set the service's listening port. This is optional in some scenarios. For example, if your service subscribes to a message queue instead of depending on traffic from the EAS gateway, specifying a port is optional. Important The EAS engine reserves ports 8080 and 9090. To avoid startup conflicts, do not use these ports when you deploy a new service. |
Third-party Library Settings | If you only need to install a few extra Python libraries, you can add the library names directly or specify a Path of requirements.txt to avoid rebuilding the image. |
Environment Variables | Set environment variables for the service instance as key-value pairs. |
Compute resources
In the Resource Information section, you can configure compute resources for your service, including selecting a resource type and instance specification, configuring the system disk, and setting the number of replicas and scheduling policies. For more information, see resource configuration.
Service networking
VPC: Configure a VPC if your service needs to access public network resources, access databases or message queues within a VPC, or if you want clients to call the service directly through a VPC. For more information, see Accessing public or internal network resources.
Service invocation: After deployment, choose an appropriate invocation method based on your business scenario, such as a gateway, NLB, Nacos, or gRPC. For more information, see service invocation.
Service security
In the Features section, configure authentication and data security for your service.
Parameter | Description |
Custom Authentication | If you do not want to use the system-generated token, you can customize the authentication token for service access. |
Configure Secure Encryption Environment | Integrate with the system trust management service to securely encrypt data, models, and code during deployment and invocation, enabling trusted inference. This feature primarily applies to mounted storage files. Enable this feature after you configure storage mounting. For more information, see Secure encrypted inference service. |
Instance RAM Role | Associating an Instance RAM Role with an instance allows the service's code to use Security Token Service (STS) temporary credentials to access other cloud resources. This method eliminates the need for fixed AccessKeys, reducing the risk of key leakage. For more information, see Configure an EAS RAM role. |
AI Safety Guardrail | This feature performs content safety checks on the input and output of LLM inference services to intercept harmful content. It supports OpenAI Completions (text-to-text), Chat Completions (text-to-text), and Image Generation (text-to-image) APIs. For more information, see Best practices for risk detection with safety guardrails. Note This feature is available only in the China (Shanghai) region. |
Visibility | Controls the visibility of the service in the service list:
|
Service stability
In the Features section, you can configure the following stability-related settings:
Service Response Timeout Period: Configure the timeout period for each request. The default value is 5 seconds. For time-consuming scenarios such as large model inference or streaming output, increase this value to prevent requests from being truncated.
Health Check: The system periodically probes instance liveness, automatically replacing abnormal instances to enable fault self-healing. For more information, see health check.
Compute monitoring & fault tolerance: Monitors the health status of compute resources for distributed inference services in real time, enabling automatic fault detection and intelligent self-healing. For more information, see Compute power monitoring and fault tolerance.
Deployment and update strategies: Use features like canary release, rolling updates, graceful shutdown, and update schedules to ensure uninterrupted service during service version upgrades. For more information, see Release management.
Service performance
Use storage acceleration and intelligent scheduling to improve service performance, accelerate startup speed, increase throughput, and reduce latency.
Distributed cache acceleration: Caches model or data files from mounted storage, such as OSS, to local instance storage to reduce I/O latency. For more information, see Model cache acceleration.
Model Weight Service (MoWS): Significantly improves scaling efficiency and service startup speed for large-scale instance deployments by caching model weights locally and sharing them across instances. For more information, see Model Weight Service.
LLM Intelligent Router: For LLM services with multiple backend instances, this feature dynamically distributes requests based on the backend load. This balances the compute power and GPU memory usage across instances and improves cluster resource utilization. For more information, see LLM intelligent router deployment.
Service observability
In the Features section, enable the following observability features to gain insights into service status and troubleshoot issues:
Parameter | Description |
Save Call Records | Persist all service request and response records in MaxCompute or Simple Log Service for auditing, analysis, or troubleshooting.
|
Tracing Analysis | Some official images have a built-in collection component that allows you to enable tracing with a single click. For other images, you can integrate an ARMS agent through simple configuration for end-to-end monitoring of the service call chain. For more information, see Enable tracing for LLM applications in EAS. To configure tracing:
|
Asynchronous and elastic services
Asynchronous inference: For long-running inference scenarios such as AIGC and video processing, we recommend that you enable an Asynchronous Queue. The client receives an immediate response and can retrieve the final result by polling or using a callback. For more information, see Deploy an asynchronous inference service.
Elastic Job Service: In the Features section, enable Task Mode to deploy the inference service as an on-demand job service. This is suitable for batch data processing and scheduled tasks. Resources are automatically released after task completion to save costs. For more information, see Elastic Job Service.
JSON configuration
In the Service Configuration section, you can view and directly edit the complete JSON configuration for the current UI settings.
For automation and fine-grained configuration, you can also define and deploy services directly by using a JSON file. For more information, see Deploy services by using JSON.