Create an E-HPC managed cluster to run HPC workloads. E-HPC provisions and maintains the management node — you only configure compute nodes and job queues.
Creating a cluster provisions billable resources such as ECS instances. Billing overview.
Cluster architecture
A managed cluster has three components:
-
Compute nodes: ECS instances organized into scalable queues. Node count adjusts based on workload demand.
-
Logon node: A single ECS instance with the Login addon and an EIP for remote access.
-
Shared file system: A NAS or CPFS file system shared across all nodes for job and application data.
Do not use the ECS console to manage nodes in an E-HPC cluster unless necessary. Use the E-HPC console instead.
Prerequisites
Ensure the following:
-
An E-HPC service-linked role (created automatically on first login to the E-HPC console)
-
A VPC and vSwitch. Create and manage a VPC. Create a vSwitch.
Procedure
Step 1: Open the Create Cluster page
Go to the Create Cluster page in the E-HPC console.
Step 2: Configure the cluster
On the Cluster Configuration step, configure network, cluster type, and scheduler settings.
Basic settings
| Parameter | Description |
|---|---|
| Region | Region where the cluster is created. |
| Network and Availability Zone | VPC and vSwitch for the cluster. Nodes use IP addresses from the vSwitch. Make sure the vSwitch has more available IP addresses than the number of cluster nodes. |
| Security group | Controls traffic for cluster nodes. Options: Automatically create a normal security group, Automatically create enterprise security groups, or Select Existing Security Group. Inter-node communication rules are created automatically. Basic security groups support up to 2,000 nodes; use advanced groups for larger clusters. Basic security groups and advanced security groups. |
Cluster type
E-HPC creates and maintains the management node. You only manage compute nodes.
| Parameter | Description |
|---|---|
| Series | Select Managed Edition. |
| Deployment Mode | Select Public cloud cluster. |
| Cluster Type | Select Slurm (only supported option). |
Custom options
| Parameter | Description |
|---|---|
| Scheduler | Scheduler software to deploy. Only Slurm 22 is supported. |
| Domain Account | Domain account service for the cluster. Only NIS (Network Information Service) is supported for managed clusters. |
| Domain name resolution | Use the default value. |
| Maximum number of cluster nodes | Maximum number of nodes the cluster can contain. Works with Maximum number of cores in the cluster to control cluster size. |
| Maximum number of cores in the cluster | Maximum number of vCPUs available to compute nodes. Works with Maximum number of cluster nodes to control cluster size. |
| Cluster Deletion Protection | Prevents accidental cluster deletion. When enabled, the cluster cannot be released until you disable this setting. |
Resource group
Assign the cluster to a resource group. Default: the default resource group. Resource groups.
Step 3: Configure compute nodes and queues
On the Compute Node and Queue step, set up queues and compute nodes.
Nodes are organized into queues — specify a target queue when submitting jobs. Each cluster has a default comp queue. Click Add more queues to add more.
Configure each queue:
Basic settings
| Parameter | Description |
|---|---|
| Automatic queue scaling | Enable automatic scaling. Select Auto Grow and/or Auto Shrink to add or remove compute nodes based on workload. |
| Queue Compute Nodes | Initial, maximum, and minimum node counts. Without auto-scaling: set only the initial count. With auto-scaling: set the minimum and maximum. |
Setting Minimal Nodes to a non-zero value retains that number of nodes during scale-in, even when idle. Set this value carefully to avoid unnecessary costs.
Queue node configuration
Required when auto-scaling is enabled or the initial node count exceeds 0.
| Parameter | Description |
|---|---|
| Inter-node interconnection | Communication mode between compute nodes. Options: VPC Network (standard) or eRDMA Network (for instances that support ERIs). eRDMA overview. Configure eRDMA on an enterprise-level instance. |
| Use Preset Node Pool | Reuse pre-allocated resources during scale-out. Use reserved node pools in clusters. |
| Virtual Switch | vSwitch for compute nodes. The system assigns IP addresses from the vSwitch CIDR block. |
| Instance type Group | Click Add Instance to select instance types. Without auto-scaling: one instance type. With auto-scaling: multiple instance types. |
Specify multiple vSwitches and instance types as fallbacks for inventory shortages. The system attempts to create nodes in the order of specified instance types and zones. The first vSwitch determines the initial zone.
Auto scale
Available when automatic scaling is enabled.
| Parameter | Description |
|---|---|
| Scaling Policy | Only Supply Priority Strategy is supported. Nodes are created in specified zones in the order of configured vSwitches. |
| Maximum number of single expansion nodes | Nodes to add or remove per scaling cycle. Default: 99. Adjust to control compute costs. |
| Prefix of Hostnames | Hostname prefix that distinguishes nodes in different queues. |
| Hostname Suffix | Hostname suffix that distinguishes nodes in different queues. |
| Instance RAM role | RAM role granting nodes access to Alibaba Cloud services. Default: AliyunECSInstanceForEHPCRole (recommended). |
Step 4: Configure shared file storage
On the Shared File Storage step, configure the file system shared across cluster nodes.
By default, the file system mounts to /home and /opt on the management node. Click Add more storage to mount additional directories.
You cannot mount different file system directories to /home and /opt.
| Parameter | Description |
|---|---|
| Type | File system type: General-purpose NAS, Extreme NAS, or Parallel file CPFS. |
| File System | ID and mount point of the file system. Make sure the file system has sufficient mount points. |
| File System Directory | Directory of the file system to mount. |
| Mount Options | Mount protocol settings. |
Step 5: Configure software and addons
On the Software and Service Component step, install software and configure addons.
-
Click Add software. In the dialog box, select the HPC applications to install.
-
Click Add Service Component. In the dialog box, select and configure an addon.
Only the Login addon is supported, enabled by default for internet access.
The Login addon has the following parameters:
| Category | Parameter | Description |
|---|---|---|
| Custom parameters | SSH | Port number, protocol, and allowed CIDR blocks for SSH connections. |
| Custom parameters | VNC | Port number, protocol, and allowed CIDR blocks for VNC connections. |
| Custom parameters | Web Portal | Port number, protocol, and allowed CIDR blocks for client connections. |
| Addon deployment resources | EIP | EIP bound to the Login addon ECS instance for internet access. Select an existing EIP or create a new one. |
| Addon deployment resources | ECS Instance | Instance type for the ECS instance that runs the Login addon. |
Step 6: Confirm and create
On the Confirm configuration step, review settings and set a cluster name and credentials.
| Parameter | Description |
|---|---|
| Cluster Name | Display name on the Clusters page. |
| Login Credentials | Only Custom Password is supported. |
| Set Password and Repeat Password | Root user password for all cluster nodes. |
Accept the service agreement, confirm fees, and click Create Cluster.
What's next
After the cluster is created, create a cluster user to submit jobs. Manage users. Job overview.