Manage Lingjun clusters and nodes

更新时间:
复制 MD 格式

A Lingjun cluster is a collection of high-performance Lingjun compute nodes equipped with the Lingjun optimization suite. Each Lingjun node is a GPU compute server used to deploy heterogeneous computing services. This topic describes how to manage Lingjun clusters and Lingjun nodes, such as viewing cluster information and node details, and scaling out a Lingjun cluster.

Manage Lingjun clusters

image

A Lingjun cluster can be in one of the following states:

  • Initialization failed.: To view details about the failed task, see O&M Task Center.

  • Initializing: The system configures the Lingjun network and initializes the Lingjun compute nodes.

  • Running: You can scale out or scale in a cluster, or reinstall or restart a node only when the cluster is in the Running state.

    Important

    You can run scale-out, scale-in, reinstall, and restart tasks in parallel if they involve different Lingjun compute nodes.

Cluster information

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Cluster Management.

  3. Click Details next to the cluster ID to go to the Cluster Details page.

    1. View basic information about the cluster, such as the cluster name, number of Node Groups, and creation information.

    2. View detailed cluster information on the Node Group, Monitoring and Alerting, Basic Metrics, RDMA, and GPU tabs.

Scale out

Note

When you scale out a cluster, you must install a Cloud Parallel File Storage (CPFS) client on each new GPU node and add the node to the CPFS cluster.

When you scale out a cluster, you must also add tags to the new nodes.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Cluster Management.

  3. Click Scale out next to the target cluster ID.

    1. In the Original Group Details: section, click Scale out next to the name of the target Node Group.

    2. In the dialog box that appears, enter the node name prefix, logon password, and confirm the password.

    3. Select the check boxes next to unused node instances or purchase new nodes, and then click Yes.

  4. In the Detailed configurations for scale-out section, click Confirm Submission.

  5. Return to the Cluster Management page. The cluster state changes to Scaling out. Wait for the scale-out process to complete.

Scale in

Warning
  • Scaling in a cluster reinstalls the removed nodes, which erases all their local data. Back up the data on these nodes before you proceed.

  • When you scale in a cluster, you must remove the nodes being scaled in from the Cloud Parallel File Storage (CPFS) cluster.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Cluster Management.

  3. Click Scale-in next to the cluster ID.

    1. In the Original Group Details: section, select the check boxes next to the nodes that you want to remove, and then click Batch Remove from Cluster.

    2. In the The following information displays the detailed configurations for scale-down: section, click Confirm Submission.

  4. On the Confirm scale-in page, enter DELETE in the text box and click Yes to scale in the cluster.

  5. Return to the Cluster Management page. The cluster state changes to Scaling in. Wait for the scale-in process to complete.

Delete cluster

Important
  • Before you delete a cluster, you must scale in the cluster and remove all nodes from it.

  • Deleting a cluster does not delete the associated CPFS cluster.

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Cluster Management.

  3. Click the Cluster ID/Name of the cluster that you want to delete. On the Cluster Details page that appears, click Delete in the upper-right corner.

  4. In the dialog box that appears, click OK to delete the cluster.

Create node group

You can create a Node Group for a Lingjun cluster in one of the following ways:

  • Create a Node Group when you create the cluster. For more information, see Configure clusters and Node Groups.

  • Create a Node Group for an existing cluster.

    1. Log on to the Intelligent Computing Lingjun console.

    2. In the left-side navigation pane, choose Resources & Nodes > Cluster Management.

    3. Click the Cluster ID/Name of the target cluster.

    4. Click the Node Group tab.

    5. Click Create Group. Enter information for the Node Group, such as its name and default instance type.

    6. (Optional) After creating a Node Group, you can edit its name or delete the group.

Manage Lingjun nodes

Important

A Lingjun compute node can run only one operation at a time, including cluster scale-out, cluster scale-in, node reinstall, and node restart.

Purchase node

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Node Management.

  3. On the Node Management page, click Purchase Node to go to the purchase page.

  4. Follow the on-screen instructions to purchase a new node.

Node details

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Node Management to open the Node Management page.

  3. Click the All tab to view all nodes.

    • You can view basic information about the nodes, such as Node ID/Name, Image Name, and Zone.

    • To search for nodes, select a filter criterion from the drop-down list, such as Image Name, Zone, or IP Address, and then enter a search term in the text box.

  4. Click the Unused tab to view unused nodes. You can view their basic information, such as Node Model and Resource Group.

Log on

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Node Management.

  3. In the Actions column, click Remote Logon next to the target node ID.

Reinstall node

Important
  • Reinstalling a node deletes all data on it. Proceed with caution.

  • You can reinstall a node only when the Lingjun cluster is in the Running state.

  • When you reinstall a node, you must first remove it from the Cloud Parallel File Storage (CPFS) cluster. After the reinstallation, add the node's new information to the CPFS cluster.

You may need to reinstall a node in the following scenarios:

  • Redeploy your workloads.

  • Change the OS version.

  • For O&M purposes.

Procedure:

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Node Management.

  3. On the Node Management page, click Reinstall next to the instance ID. In the dialog box that appears, select an image version, modify the node name, enter and confirm the root password, and then click Reinstall.

Restart node

Important
  • Restarting a node can disrupt your services.

  • You can restart a node only when the Lingjun cluster is in the Running state.

You may need to restart a node in the following scenarios:

  • Deploy a new application or service.

  • Apply system configuration changes.

  • For O&M purposes.

Procedure:

  1. Log on to the Intelligent Computing Lingjun console.

  2. In the left-side navigation pane, choose Resources & Nodes > Node Management.

  3. On the Node Management page, click Restart next to the instance ID.