Deploy a service with JSON configuration

更新时间:
复制 MD 格式

In EAS, you can define and deploy an online inference service with a JSON configuration file.

Quick start

1. Prepare a JSON configuration file

To deploy a service, you need a JSON file that defines the required configurations. For first-time users, we recommend navigating to Custom Model Deployment > Custom Deployment to configure parameters. The system automatically generates the JSON configuration, which you can use as a template.

The following code is a sample service.json file. For a complete list of parameters, see Appendix: JSON Parameter Reference.

{
    "metadata": {
        "name": "demo",
        "instance": 1,
        "workspace_id": "your-workspace-id"
    },
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.c7a.large"
                }
            ]
        }
    },
    "containers": [
        {
            "image": "eas-registry-vpc.cn-hangzhou.cr.aliyuncs.com/pai-eas/python-inference:py39-ubuntu2004",
            "script": "python app.py",
            "port": 8000
        }
    ]
}

2. Deploy the service with JSON

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service. Then, in the Custom Model Deployment section, select JSON Deployment.

  3. Paste your JSON configuration and click Deploy. A service status of Running indicates a successful deployment.

Appendix: JSON parameters

Parameter

Required

Description

metadata

Yes

The service's metadata. For more information, see metadata parameters.

cloud

No

The compute resource and VPC configurations. For more information, see cloud parameters.

containers

No

The image configuration. For more information, see containers parameters.

dockerAuth

No

This parameter is required to access a private repository that requires authentication. The value is the Base64-encoded string of username:password.

networking

No

The service invocation configuration. For more information, see networking parameters.

storage

No

Mounts data from storage services such as OSS or NAS into the container. For configuration details, see storage mount.

token

No

The access token for service authentication. If not specified, the system automatically generates one.

aimaster

No

Enables computing power check and fault tolerance for multi-node distributed inference services.

model_path

Yes

Required when deploying a service with a processor. The model_path and processor_path parameters specify the input data source locations for the model and the processor, respectively. The following formats are supported:

  • OSS path: The URL can point to a specific file or a directory.

  • HTTP URL: The URL must point to a compressed archive, such as a TAR.GZ, TAR, BZ2, or ZIP file.

  • local path: A local path can be used for local debugging with the test command.

oss_endpoint

No

The OSS endpoint, for example, oss-cn-beijing.aliyuncs.com. For other valid values, see Regions and endpoints.

Note

By default, you do not need to specify this parameter. The service uses the internal OSS endpoint of the current region to download model files or Processor files. You must specify this parameter when you access OSS across regions. For example, if you deploy a service in the Hangzhou region and specify an OSS address in the Beijing region for model_path, you must use this parameter to specify the public OSS endpoint of the Beijing region.

model_entry

No

The model's entry file, which can be any file within the model package. If unspecified, it defaults to the filename from model_path. The path to this entry file is passed to the initialize() function of the processor.

model_config

No

The configuration for the model, which can be any text. This value is passed as the second argument to the processor's initialize() function.

processor

No

  • If using a pre-built processor, specify its code. For the codes of pre-built processors available in eascmd, see pre-built processors.

  • If using a custom processor, configure the processor_path, processor_entry, processor_mainclass, and processor_type parameters instead.

processor_path

No

The path to the processor package. For supported path formats, see the description of the model_path parameter.

processor_entry

No

The entry file of the processor, such as libprocessor.so or app.py. This file must implement the initialize() and process() functions required for inference.

This parameter is required if processor_type is set to cpp or python.

processor_mainclass

No

The main class of the processor in the JAR package. For example, com.aliyun.TestProcessor.

This parameter is required if processor_type is set to java.

processor_type

No

The implementation language of the processor. The valid values are as follows:

  • cpp

  • java

  • python

warm_up_data_path

No

The path to the request file used for model warm-up. For more information about this feature, see model warm-up.

runtime.enable_crash_block

No

Specifies whether an instance that crashes due to a processor code exception automatically restarts. Valid values:

  • true: The instance does not restart automatically, which preserves the runtime environment for troubleshooting.

  • false (Default): The instance restarts automatically.

autoscaler

No

The configuration for horizontal auto scaling. For detailed parameter descriptions, see horizontal auto scaling.

labels

No

The labels to apply to the service. Use the key:value format.

unit.size

No

The number of machines per instance in a distributed inference configuration. The default value is 2.

sinker

No

Persists all service requests and responses to MaxCompute or Log Service (SLS). For detailed parameter descriptions, see sinker parameters.

confidential

No

Configures Trustee to ensure that information such as data, models, and code remains encrypted during service deployment and invocation. This enables a secure and encrypted inference service. The format is as follows:

Note

The secure encryption environment primarily protects mounted storage files. Ensure that you have mounted these files before enabling this feature.

"confidential": {
        "trustee_endpoint": "xxxx",
        "decryption_key": "xxxx"
    }

For more information, see secure and encrypted inference service.

  • trustee_endpoint: The URI of Trustee.

  • decryption_key: The KBS URI of the decryption key. For example, kbs:///default/key/test-key.

Metadata parameters

General parameters

Parameter

Required

Description

name

Yes

The name of the service. Must be unique within a region.

instance

Yes

The number of instances for the service.

workspace_id

No

The ID of the PAI workspace. If specified, this parameter restricts the service to the workspace. For example: 1405**.

cpu

No

The number of CPU cores required for each instance.

memory

No

The amount of memory required for each instance, in MB. The value must be an integer. For example, "memory": 4096 indicates that each instance requires 4 GB of memory.

gpu

No

The number of GPUs required for each instance.

gpu_memory

No

Enables gpu slicing, which allows multiple instances to share a single GPU. This parameter can be configured only with dedicated resource groups or resource quotas.

gpu_core_percentage

qos

No

Specifies the Quality of Service (QoS) for the instance. Valid values: BestEffort or omitted. When qos is set to BestEffort, the instance enters CPU sharing mode. In this mode, scheduling is based on GPU memory and system memory, and scheduling ignores the number of CPU cores on the node. All instances on the node share the CPU resources. The cpu parameter then specifies the maximum CPU quota that a single instance can use.

resource

No

The ID of the resource group. The deployment policy is as follows:

  • If deployed in a public resource group, omit this parameter. The service is then billed on a pay-as-you-go basis.

  • If deployed in a dedicated resource group, set this parameter to the resource group ID. For example: eas-r-6dbzve8ip0xnzt****.

cuda

No

The CUDA version that the service requires. At runtime, the specified CUDA version is automatically mounted to the /usr/local/cuda directory of the instance.

Supported CUDA versions: 8.0, 9.0, 10.0, 10.1, 10.2, 11.0, 11.1, and 11.2. For example: "cuda":"11.2".

rdma

No

Specifies whether to enable RDMA networking for distributed inference. Set the value to 1 to enable RDMA networking. If omitted, this feature is disabled.

Note

Currently, RDMA networking is available only for services that are deployed using Lingjun intelligent computing resources.

enable_grpc

No

Specifies whether to enable gRPC connections for the service gateway. Valid values:

  • false (Default): Disables gRPC connections. The gateway supports HTTP requests by default.

  • true: Enables gRPC connections.

Note

If you deploy a service using a custom image with a gRPC-based server, you must set this parameter to switch the gateway protocol to gRPC.

enable_webservice

No

Specifies whether to enable a web server to deploy the service as an AI-Web application.

  • false (Default): The web server is not enabled.

  • true: The web server is enabled.

type

No

Set this parameter to LLMGatewayService to deploy an LLM intelligent router service. For more information, see Deploy an LLM intelligent router.

Advanced parameters

Important

Modify these advanced parameters with caution.

Parameter

Required

Description

rpc

batching

No

Enables server-side batching to accelerate GPU model inference. This feature is supported only in pre-built processor mode. Valid values:

  • false (Default): Disables server-side batching.

  • true: Enables server-side batching.

keepalive

No

The maximum processing time for a single request, in milliseconds. If the processing time exceeds this value, the server returns a 408 Timeout error and closes the connection. The default value is 600000 for dedicated gateways. This parameter is not supported for Application Load Balancer (ALB)-based dedicated gateways.

io_threads

No

The number of threads used to process network I/O requests in each instance. The default value is 4.

max_batch_size

No

The maximum size of each batch. The default value is 16. This parameter takes effect only when rpc.batching is set to true. This feature is supported only in pre-built processor mode.

max_batch_timeout

No

The maximum timeout period for each batch, in milliseconds. The default value is 50. This parameter takes effect only when rpc.batching is set to true. This feature is supported only in pre-built processor mode.

max_queue_size

No

The maximum length of the queue for an asynchronous inference service. The default value is 64. If the queue is full, the server returns a 450 error and closes the connection. This allows the client to retry on other instances and prevent server overload. For services with long response times (RTs), you can reduce the queue length to prevent requests from piling up and causing timeouts.

worker_threads

No

The number of threads in each instance that are used to concurrently process requests. The default value is 5. This feature is supported only in pre-built processor mode.

rate_limit

No

Enables QPS rate limiting and specifies the maximum QPS that an instance can process. The default value is 0, which indicates that QPS rate limiting is disabled.

For example, if you set this parameter to 2000, requests are rejected with a 429 (Too Many Requests) error when the QPS exceeds 2,000.

enable_sigterm

No

Valid values:

  • false (Default): The system does not send a SIGTERM signal when an instance enters the terminating state.

  • true: When a service instance enters the terminating state, the system immediately sends a SIGTERM signal to the main process. The process within the service must handle this signal to perform a custom graceful termination. If the signal is not handled, the main process may exit immediately, preventing a graceful termination.

rolling_strategy

max_surge

No

The maximum number of additional instances created beyond the desired count during a rolling update. The value can be a positive integer that indicates the number of instances, or a percentage, such as 2%. The default value is 2%. A larger value accelerates service updates.

For example, if the service instance count is 100 and you set this parameter to 20, 20 new instances are created immediately after the service update starts.

max_unavailable

No

The maximum number of unavailable instances during a rolling update. This parameter can free up resources for new instances during an update and prevent the update from stalling due to insufficient resources. The default value is 1 for dedicated resource groups and 0 for public resource groups.

For example, if you set this parameter to N, N instances are stopped immediately after the service update starts.

Note

If idle resources are sufficient, you can set this parameter to 0. A large value may affect service stability because the number of available instances decreases during the update, which increases the traffic load on a single instance. Balance service stability with resource availability when you configure this parameter.

eas.termination_grace_period

No

The graceful termination period of an instance, in seconds. The default value is 30.

EAS services use a rolling update strategy. An instance first enters the Terminating state, and the service routes traffic away from the terminating instance. The instance then waits for 30 seconds to process any received requests before it exits. If requests take a long time to process, you can increase this value to ensure that all in-flight requests are completed during a service update.

Important

A smaller value may affect service stability, while a larger value may slow down service updates. Only change this parameter if necessary.

scheduling

spread.policy

No

The spread policy for scheduling service instances. The following policies are supported:

  • host: Spreads instances across different nodes.

  • zone: Spreads instances across different availability zones.

  • default: Schedules instances based on the default policy using the system's default placement strategy.

Configuration example:

{
  "metadata": {
    "scheduling": {
      "spread": {
        "policy": "host"
      }
    }
}

resource_rebalancing

No

Valid values:

  • false (Default): This feature is disabled.

  • true: EAS periodically creates probe instances on high-priority resources. If a probe instance is scheduled successfully, it creates more probe instances exponentially until scheduling fails. When a successfully scheduled probe instance completes initialization and becomes ready, it replaces an instance running on a lower-priority resource.

This feature helps resolve the following issues:

  • Prevents new instances from being temporarily scheduled to a public resource group during a rolling update. This can occur when terminating instances in a dedicated resource group have not yet freed their resources.

  • When using both spot and regular instances, the system periodically checks for available spot instances and migrates regular instances to them.

resource_burstable

No

Enables the elastic resource pool feature for an EAS service that is deployed in a dedicated resource group.

  • true: Enables the feature.

  • false: Disables the feature.

shm_size

No

The size of the shared memory for each instance, in GB. Shared memory allows direct read and write operations, eliminating the need for data copying or transfer.

Cloud parameters

Parameter

Required

Description

computing

instances

No

Specifies a list of instance types to use when deploying the service in a public resource group. If a bid for a spot instance fails or an instance type is out of stock, the system creates the service by using the next instance type in the list.

  • type: The instance type.

  • spot_price_limit: Optional.

    • If you specify this parameter, the instance type becomes a pay-as-you-go spot instance, and this value is its maximum price in CNY.

    • If you omit this parameter, a regular pay-as-you-go instance is created.

  • capacity: The maximum number of instances of this type to create. You can specify a number, such as "500", or a percentage in a string, such as "20%". After the capacity limit is reached, the system stops creating instances of this type, even if resources are available. 

    For example, if the total number of instances for a service is 200 and you set the capacity of an instance type to 20%, the system launches a maximum of 40 instances of this type. The remaining instances are launched by using other specified instance types.

disable_spot_protection_period

No

Specifies whether to disable the protection period for a spot instance. This parameter applies only to spot instances. Valid values:

  • false (Default): The spot instance has a 1-hour protection period after it is created. During the protection period, the system does not reclaim the instance even if the market price exceeds your bid.

  • true: Disables the protection period. Instances without a protection period are typically about 10% cheaper than those with a protection period.

networking

vpc_id

No

The ID of the VPC.

vswitch_id

No

The ID of the VSwitch.

security_group_id

No

The ID of the security group.

destination_cidrs

No

If the CIDR block of the configured VSwitch conflicts with the EAS management CIDR blocks (10.224.0.0/16 or 10.240.0.0/12), you must explicitly set this parameter to the CIDR block of your VSwitch.
Example:

"cloud": {
    "networking": {
      "destination_cidrs": "10.241.28.0/22"
    }
  } 

Replace 10.241.28.0/22 with the actual CIDR block of your VSwitch.

Example:

{
    "cloud": {
        "computing": {
            "instances": [
                {
                    "type": "ecs.c8i.2xlarge",
                    "spot_price_limit": 1
                },
                {
                    "type": "ecs.c8i.xlarge",
                    "capacity": "20%"
                }
            ],
            "disable_spot_protection_period": false
        },
        "networking": {
            "vpc_id": "vpc-bp1oll7xawovg9*****",
            "vswitch_id": "vsw-bp1jjgkw51nsca1e****",
            "security_group_id": "sg-bp1ej061cnyfn0b*****"
        }
    }
}

Container parameters

To deploy a service using a custom image, see Custom Images.

Parameter

Required

Description

image

Yes

The image address for the model service. Required when you deploy using an image.

env

name

No

The name of the environment variable.

value

No

The value of the environment variable.

command

You must specify either command or script.

The entry point command for the image. This parameter supports only a single command. For complex scripts, such as cd xxx && python app.py, use the script parameter. Use the command parameter if the image lacks the /bin/sh command.

script

The entry point script for the image. You can specify complex scripts with multiple lines. Separate commands with \n or a semicolon (;).

port

No

The container port.

Important
  • The EAS engine listens on fixed ports 8080 and 9090. To avoid port conflicts, ensure the container port is not 8080 or 9090.

  • This port must match the port configured in the xxx.py file specified by the command.

prepare

pythonRequirements

No

A list of Python requirements to install before the instance starts. The image must have the python and pip commands available in the system PATH. For example:

"prepare": {
  "pythonRequirements": [
    "numpy==1.16.4",
    "absl-py==0.11.0"
  ]
}

pythonRequirementsPath

No

The path to a requirements.txt file for installing Python packages before the instance starts. The image must have the python and pip commands available in the system PATH. This file can be included in the image or mounted from external storage. For example:

"prepare": {
  "pythonRequirementsPath": "/data_oss/requirements.txt"
}

Networking parameters

Parameter

Required

Description

gateway

No

Specifies the dedicated gateway for the EAS service.

gateway_policy

No

  • rate_limit: Sets the maximum number of requests per second (QPS) for global rate limiting.

    • enable: Set to true to enable rate limiting, or false to disable it.

    • limit: The maximum QPS.

      Note

      Services on a shared gateway default to 1,000 QPS per service and 10,000 QPS per server group. Dedicated gateways have no default value.

  • concurrency_limit: Sets the maximum number of concurrent requests for global concurrency control. This setting is not supported for ALB-based dedicated gateways.

    • enable: Set to true to enable concurrency control, or false to disable it.

    • limit: The maximum number of concurrent requests.

Example configuration:

{
    "networking": {
        "gateway_policy": {
            "rate_limit": {
                "enable": true,
                "limit": 100
            },
            "concurrency_limit": {
                "enable": true,
                "limit": 50
            }
        }
    }
}

Sinker parameters

Parameter

Required

Description

type

No

Specifies the destination storage service. Supported values:

  • maxcompute: MaxCompute.

  • sls: Log Service (SLS).

config

maxcompute.project

No

The MaxCompute project name.

maxcompute.table

No

The MaxCompute table name.

sls.project

No

The Log Service (SLS) project name.

sls.logstore

No

The Logstore name.

Example configurations:

Sink to MaxCompute

"sinker": {
        "type": "maxcompute",
        "config": {
            "maxcompute": {
                "project": "cl****",
                "table": "te****"
            }
        }
    }

Sink to SLS

"sinker": {
        "type": "sls",
        "config": {
            "sls": {
                "project": "k8s-log-****",
                "logstore": "d****"
            }
        }
    }

JSON configuration example

The following is a sample JSON configuration:

{
  "token": "****M5Mjk0NDZhM2EwYzUzOGE0OGMx****",
  "processor": "tensorflow_cpu_1.12",
  "model_path": "oss://examplebucket/exampledir/",
  "oss_endpoint": "oss-cn-beijing.aliyuncs.com",
  "model_entry": "",
  "model_config": "",
  "processor_path": "",
  "processor_entry": "",
  "processor_mainclass": "",
  "processor_type": "",
  "warm_up_data_path": "",
  "runtime": {
    "enable_crash_block": false
  },
  "unit": {
        "size": 2
    },
  "sinker": {
        "type": "MaxCompute",
        "config": {
            "maxcompute": {
                "project": "cl****",
                "table": "te****"
            }
        }
    },
  "cloud": {
    "computing": {
      "instances": [
        {
          "capacity": 800,
          "type": "dedicated_resource"
        },
        {
          "capacity": 200,
          "type": "ecs.c7.4xlarge",
          "spot_price_limit": 3.6
        }
      ],
      "disable_spot_protection_period": true
    },
    "networking": {
            "vpc_id": "vpc-bp1oll7xawovg9t8****",
            "vswitch_id": "vsw-bp1jjgkw51nsca1e****",
            "security_group_id": "sg-bp1ej061cnyfn0b****"
        }
  },
  "autoscaler": {
    "min": 2,
    "max": 5,
    "strategies": {
      "qps": 10
    }
  },
  "storage": [
    {
      "mount_path": "/data_oss",
      "oss": {
        "endpoint": "oss-cn-shanghai-internal.aliyuncs.com",
        "path": "oss://bucket/path/"
      }
    }
  ],
  "confidential": {
        "trustee_endpoint": "xx",
        "decryption_key": "xx"
    },
  "metadata": {
    "name": "test_eascmd",
    "resource": "eas-r-9lkbl2jvdm0puv****",
    "instance": 1,
    "workspace_id": "1405**",
    "gpu": 0,
    "cpu": 1,
    "memory": 2000,
    "gpu_memory": 10,
    "gpu_core_percentage": 10,
    "qos": "",
    "cuda": "11.2",
    "enable_grpc": false,
    "enable_webservice": false,
    "rdma": 1,
    "rpc": {
      "batching": false,
      "keepalive": 5000,
      "io_threads": 4,
      "max_batch_size": 16,
      "max_batch_timeout": 50,
      "max_queue_size": 64,
      "worker_threads": 5,
      "rate_limit": 0,
      "enable_sigterm": false
    },
    "rolling_strategy": {
      "max_surge": 1,
      "max_unavailable": 1
    },
    "eas.termination_grace_period": 30,
    "scheduling": {
      "spread": {
        "policy": "host"
      }
    },
    "resource_rebalancing": false,
    "shm_size": 100
  },
  "features": {
    "eas.aliyun.com/extra-ephemeral-storage": "100Gi",
    "eas.aliyun.com/gpu-driver-version": "tesla=550.127.08"
  },
  "networking": {
    "gateway": "gw-m2vkzbpixm7mo****"
  },
  "containers": [
    {
      "image": "registry-vpc.cn-shanghai.aliyuncs.com/xxx/yyy:zzz",
      "prepare": {
        "pythonRequirements": [
          "numpy==1.16.4",
          "absl-py==0.11.0"
        ]
      },
      "command": "python app.py",
      "port": 8000
    }
  ],
  "dockerAuth": "dGVzdGNhbzoxM*******"
}