A public cloud Standard Edition cluster runs in the cloud and consists of components such as Elastic Compute Service (ECS) instances and shared file systems. You manage the availability of cluster services yourself. This topic describes how to create a public cloud cluster in the console.
Background information
An E-HPC public cloud Standard Edition cluster includes the following components:
-
Control plane node: One ECS instance that runs the scheduler and domain account service to manage job scheduling and user information.
-
Compute nodes: Multiple ECS instances grouped into queues for management. These nodes support scaling and run jobs.
-
Logon node: One ECS instance with the Login component deployed and bound to an elastic IP address (EIP) for remote cluster access.
-
Shared storage: Supports mounting NAS and CPFS file systems to share data such as job data and software data.
-
When you create an E-HPC cluster, the system automatically creates resources such as ECS instances, which may incur fees. For more information, see Billing overview.
-
After creating an E-HPC cluster, do not adjust individual cluster nodes in the ECS console unless necessary. Use the E-HPC console instead.
For more information about E-HPC clusters, see Overview.
Prerequisites
-
You have created a service-linked role. The first time you log on to the E-HPC console, the system prompts you to create the E-HPC service-linked role.
-
You have created a virtual private cloud (VPC) and vSwitches. For more information, see Create a VPC and Create a vSwitch.
-
You have created storage resources. E-HPC clusters support mounting NAS and CPFS file systems. Choose as needed.
-
To mount NAS: Activate the NAS service and create a NAS file system and mount target. For more information, see Create a file system and Add a mount target.
-
To mount CPFS-NFS: Activate the CPFS service and create a CPFS file system, protocol service, and export directory. For more information, see Create a file system, Manage protocol services, and Manage export directories.
-
Create manually
Step 1: Open the Create Cluster page
Go to the Create Cluster page.
Step 2: Configure the cluster
On the Cluster Configuration page, configure the cluster network, type, scheduler, and other settings.
-
Basic settings
Configuration item
Description
Region
Select the region where the cluster resides.
Network and Availability Zone
Select the VPC and vSwitch for the cluster.
NoteCluster nodes consume IP addresses from the selected vSwitch. Ensure the number of available IP addresses in the vSwitch exceeds the required number of nodes.
Security Group
Security groups control inbound and outbound traffic for the cluster and its nodes. Automatically created security groups include rules that enable communication between cluster nodes.
Select the type of automatically created security group as needed. For differences between basic and advanced security groups, see Basic security groups vs. advanced security groups.
-
Cluster type
This cluster type includes one control plane node and multiple compute nodes. Select a scheduler type and configure the control plane node.
Configuration item
Description
Series
Select Standard Edition.
Deployment Mode
Select Public cloud cluster.
Cluster Type
Select a scheduler type. Supported schedulers for HPC scenarios include Slurm, OpenPBS.
Management node
The control plane node is an ECS instance that runs the scheduler and domain account service. Select an appropriate configuration based on your business scenario and cluster scale.
-
Billing Method
Select how to pay for the control plane node. For billing details, see Instance billing.
-
Pay-as-you-go: Post-payment based on actual usage duration. Spot instances are not supported.
-
Subscription: Prepayment billed by week, month, or year.
-
-
Instance Specification
Select an appropriate control plane node type. Recommended configurations for different cluster scales are as follows:
-
If the number of compute nodes ≤ 100, use at least 16 vCPUs and 64 GiB memory.
-
If 100 < number of compute nodes ≤ 500, use at least 32 vCPUs and 128 GiB memory.
-
If the number of compute nodes > 500, use at least 64 vCPUs and 256 GiB memory.
-
-
Image
After selecting an image type, choose a specific image. Different images correspond to different operating systems. The system deploys cluster nodes based on your selected image.
NoteCustom images have the following limitations:
-
Only custom images created from Alibaba Cloud official images or imported CentOS images are supported. When importing an image, select Run validation after import. Otherwise, the E-HPC console cannot recognize the image.
-
Do not use custom images created from existing E-HPC cluster nodes. Doing so causes errors when creating compute nodes.
-
Do not modify the yum repository configuration in the custom image. Doing so prevents cluster creation or scaling.
-
The mount paths for NAS file systems (mounted using the mount command) in custom images must not include the
/homeor/optdirectories.
-
-
Storage
Select the system disk type for the control plane node and whether to attach a data disk. For information about disk types and performance, see Disks overview.
-
Hyper-Threading
CPU hyper-threading is enabled by default. Disable it if your workload requires better performance.
NoteAfter cluster creation, the control plane node automatically binds to the instance RAM role
AliyunECSInstanceForEHPCRole. This role enables core features such as automatic scaling. Do not unbind or replace this role in the ECS console. To extend API call permissions, see E-HPC service role. -
-
Custom Options
Configuration item
Description
Scheduler
Select the scheduler software to deploy based on your cluster type and the image configured for the control plane node.
Domain Account
Select the domain account service to deploy for the cluster.
Domain name resolution
Keep the default setting.
Specify a cluster post-processing script
Specify a script to process result data or perform other operations after compute tasks complete.
Maximum number of cluster nodes
Set the maximum number of nodes allowed in the cluster. This works together with maximum cores to control cluster size.
Maximum number of cores in the cluster
Set the maximum number of cores allowed in the cluster. This works together with maximum nodes to control cluster size.
Cluster Deletion Protection
Enable or disable deletion protection. If enabled, you must disable this feature before releasing the cluster to prevent accidental deletion.
-
Resource group
Resource groups help organize resources. For more information, see Resource groups. By default, clusters belong to the default resource group. You can change this as needed.
Step 3: Configure compute nodes and queues
On the Compute Node and Queue page, configure queues.
Queues group compute nodes for management. You can specify a queue when running jobs. The cluster includes one default queue (comp). Click Add more queues to add more queues. Configure the following for each queue:
-
Basic settings
Configuration item
Description
Automatic queue scaling
Enable or disable Auto Scale. If enabled, you can further enable Auto Grow and Auto Shrink as needed.
With auto scaling enabled, the system automatically adds or removes compute nodes based on configuration and real-time load.
Queue Compute Nodes
Set the number of nodes in the queue.
-
If auto scaling is disabled, set the initial number of compute nodes.
-
If auto scaling is enabled, set the minimum and maximum number of nodes.
ImportantIf you set the minimum node count to a non-zero value, the queue retains that number of nodes during scale-in—even idle nodes remain. Set this value carefully to avoid unnecessary resource waste and costs.
-
-
Select queue node configuration
If auto scaling is enabled or if the initial node count is greater than zero, configure the following so the system can create compute nodes.
Configuration item
Description
Inter-node interconnection
Select the network connectivity method between nodes.
-
VPC network: Nodes communicate over the VPC network.
-
eRDMA network: If nodes use ERI-supported instance types, they communicate over the eRDMA network.
NoteOnly some instance types support ERI. For more information, see eRDMA overview and Enable eRDMA on enterprise-grade instances.
Use Preset Node Pool
Select a preset node pool. The system automatically assigns IP addresses and hostnames from unallocated nodes in the pool to create compute nodes.
NoteUsing a preset node pool enables fast reuse of preallocated resources during scale-out. For more information, see Use preset node pools in clusters.
vSwitch
Select the vSwitch for the nodes. The system automatically assigns IP addresses from the available vSwitch CIDR block.
Instance type Group
Click Add an instance specification to select node specifications.
If auto scaling is disabled, you can add only one instance type. If enabled, you can add multiple instance types.
ImportantYou can select multiple vSwitches and instance types as alternatives to avoid creation failures due to inventory shortages. During node creation, the system attempts to create nodes starting from the first vSwitch's zone and in the order of specified instance types until it meets the required node count. The final instance types may vary based on inventory availability.
-
-
Auto scaling
Configuration item
Description
Scaling Policy
Select a scaling strategy. Only Supply Priority Strategy is supported, which attempts to create compute nodes in the order of configured vSwitches and zones.
Maximum number of single expansion nodes
Set the maximum number of nodes added or removed in each scale-out or scale-in cycle. The default value is 0, which means no limit.
If you have cost constraints, set this value to ensure scale-out stays within expected limits.
Hostname Prefix
Specify the starting characters for node hostnames to distinguish nodes.
Hostname Suffix
Specify the ending characters for node hostnames to distinguish nodes.
Host RAM role
Bind a RAM role to nodes so they can access Alibaba Cloud services.
We recommend using the default role AliyunECSInstanceForEHPCRole created by the system.
Step 4: Configure shared file storage
On the Shared File Storage page, configure storage.
By default, the control plane node mounts file systems to /home and /opt as shared storage directories. To mount file systems to other directories, click Add more storage and configure accordingly. For each directory, configure the following file system information:
The /home and /opt directories do not support mounting different file system directories.
|
Configuration item |
Description |
|
Type |
Select the file system type to mount.
|
|
File System |
Select the file system ID and mount target. Ensure the file system has available mount target capacity. |
|
File System Directory |
Enter the file system directory to mount. |
|
Mount Options |
Select the mount protocol. |
Step 5: Configure software and service components
On the Software and Service Component page, configure software and components.
-
Click Add software and select the software to install in the dialog box. E-HPC provides commonly used HPC software. Select as needed.
-
Click Add Service Component and select a component in the dialog box. Then configure component parameters.
NoteOnly the Login component is currently supported.
Public cloud clusters include the Login component by default for public network remote access. Component parameters are as follows:
Configuration
configuration item
Description
Login component custom parameters
SSH
Set the port, protocol, and allowed IP CIDR block for SSH connections to the cluster.
VNC
Set the port, protocol, and allowed IP CIDR block for VNC connections to the cluster.
CLIENT
Set the port, protocol, and allowed IP CIDR block for client connections to the cluster.
Component deployment resources
EIP instance
Bind an EIP to the ECS instance that runs the Login component for public network access. You can automatically create an EIP or select an existing one.
ECS instance
Set the instance type for the ECS instance that runs the Login component.
NoteAfter creation, the logon node automatically binds to the instance RAM role
AliyunECSInstanceForEHPCRole. This role ensures features like Web Portal work properly. Do not unbind or replace this role in the ECS console. To extend API call permissions, see E-HPC service role.
Step 6: Confirm configuration
On the Confirm configuration page, review your settings and configure the cluster name and logon credentials.
|
Configuration item |
Description |
|
Cluster Name |
Enter a name. This name appears in the cluster list for easy identification. |
|
Cluster Password-free |
Allow root users to switch from the control plane node to compute nodes without a password. Important
Enabling this feature configures one-way passwordless logon from the control plane node to all compute nodes. Passwordless logon from compute nodes to the control plane node is not supported. Proceed with caution. |
|
Login Credentials |
Select credentials for cluster logon. Only Custom Password is supported. |
|
Set Password, Confirm Password |
Enter a password for cluster logon. All cluster nodes use this password as the root user logon password by default. |
After completing the configuration, read the Service Agreement, confirm the billing information, and click Create Cluster.
Create from template
E-HPC supports creating clusters quickly and in batches using templates. Templates define basic parameters for cluster creation. You can use E-HPC-provided templates or create custom templates.
Create a cluster from a public template
-
Go to the Cluster List page.
-
Log on to the E-HPC console.
-
In the left part of the top navigation bar, select a region.
-
In the left-side navigation pane, click Cluster.
-
-
On the Cluster List page, click Cluster Templates.
-
In the dialog box, select a template and click Create Cluster.
Templates are grouped as follows: The General-purpose clusters group includes SLURM and SLURM Serverless templates. The Big data clusters group includes the FastMR template. The Climate and weather clusters group includes the WRF template. Each template card includes a version selector and a Create Cluster button.
-
Review the configuration and enter information such as the cluster name.
-
Under Configuration Summary, the default template configuration appears. To modify settings, click Edit and update the relevant parameters.
-
Under Manage Settings, complete the configuration as prompted.
-
-
Read the Service Agreement, confirm the billing information, and click Create Cluster.
Create a cluster from a custom template
-
Create a custom template locally.
-
Go to the Cluster List page.
-
Log on to the E-HPC console.
-
In the left part of the top navigation bar, select a region.
-
In the left-side navigation pane, click Cluster.
-
-
On the Cluster List page, click Cluster Templates.
-
In the dialog box, click Import local template and upload your custom template file.
-
In the Edit Cluster Template dialog box, verify your custom template and click Confirm and Create.
-
On the Create Cluster page, verify the configuration and click Create Cluster.
References
After you create a cluster, you need to create a user to submit jobs. For detailed steps, see User Management and Job Overview.