Cloud-native AI Suite O&M guide

更新时间:
复制 MD 格式

The Cloud-native AI Suite lets you install AI/ML components on an ACK Pro cluster, monitor GPU resource usage across multiple views, and allocate resources fairly across teams — all from AI Dashboard or the ACK console.

This guide walks through three core tasks: installing the suite, viewing resource dashboards, and managing users and quotas with capacity scheduling.

Prerequisites

Before you begin, ensure that you have:

  • An ACK Pro cluster running Kubernetes 1.18 or later

  • Monitoring Agents and Simple Log Service enabled when the cluster was created (set on the Component Configurations page of the cluster creation wizard). For details, see Create an ACK Pro cluster

Key concepts

Security boundary

Kubernetes is a single-tenant orchestrator: a single control plane instance is shared across all tenants in a cluster. The cluster itself is the only construct that provides a hard security boundary. Quota trees, user groups, and namespaces provide organizational guardrails and logical isolation — they do not provide the same security guarantees as cluster-level separation. Design your multi-tenant architecture with this constraint in mind.

How capacity scheduling works

Each quota node in a quota tree has two parameters:

  • Min: the guaranteed minimum resources the namespace can always claim, even when the cluster is under pressure

  • Max: the maximum resources the namespace can use when idle capacity is available

When a namespace is idle, other namespaces can borrow its unused capacity up to their Max. When the owning namespace needs its guaranteed minimum back, the scheduler reclaims resources from borrowing namespaces, considering workload priority, availability, and creation time.

Resource objects

ObjectRole
Quota treeHierarchical structure defining resource allocation across the organization
Quota nodeA node in the tree; each leaf node maps to one or more namespaces
User groupThe smallest allocation unit; maps to a leaf quota node
UserHolds a Kubernetes service account used to submit jobs and log in to the console
NamespaceKubernetes namespace bound to a leaf quota node

User roles

RolePermissions
adminLog in to AI Dashboard, manage cluster components, includes all researcher permissions
researcherSubmit jobs, use cluster resources, log in to AI Developer Console

Step 1: Install the Cloud-native AI Suite

The Cloud-native AI Suite consists of six component categories. Install only the components your workloads require. AI Dashboard and AI Developer Console are installed separately and require additional RAM permissions.

Component categoryPurposeInstall separately?
Task elasticityScale AI workloads dynamicallyNo
Data accelerationSpeed up dataset accessNo
AI task schedulingSchedule AI workloads with capacity-aware policiesNo
AI task lifecycle managementManage training job lifecyclesNo
AI DashboardMonitor GPU resources and quotasYes (requires RAM permissions)
AI Developer ConsoleSubmit and manage AI jobsYes (requires RAM permissions)

Deploy the suite

  1. Log in to the ACK console and click Clusters in the left navigation pane.

  2. Click the cluster name, then choose Applications > Cloud-native AI Suite in the left pane.

  3. Click Deploy.

  4. Select the components to install, then click Deploy Cloud-native AI Suite at the bottom of the page. The system checks environment dependencies before deploying the selected components. After installation, the Components list shows component names and versions. From this list you can Deploy, Upgrade, or Uninstall individual components.

  5. After installing ack-ai-dashboard and ack-ai-dev-console, links to AI Dashboard and AI Developer Console appear in the upper-left corner of the Cloud-native AI Suite page.

    控制台

Configure AI Dashboard

Important

Alibaba Cloud rolled out the AI Console (AI Dashboard and AI Developer Console) through a whitelist mechanism starting January 22, 2025. If you deployed AI Dashboard or AI Developer Console before that date, your deployment is unaffected. If you are not whitelisted, install and configure the AI Console through the open-source community — see Open-source AI Console.

Grant RAM permissions to the worker role

Before AI Dashboard can access cluster data, attach a custom RAM policy to the cluster's worker role.

  1. Create a custom policy.

    1. Log in to the RAM console and choose Permissions > Policies in the left navigation pane.

    2. Click Create Policy, select the JSON tab, and add the following policy:

      {
          "Version": "1",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "cs:*",
                      "log:GetProject",
                      "log:GetLogStore",
                      "log:GetConfig",
                      "log:GetMachineGroup",
                      "log:GetAppliedMachineGroups",
                      "log:GetAppliedConfigs",
                      "log:GetIndex",
                      "log:GetSavedSearch",
                      "log:GetDashboard",
                      "log:GetJob",
                      "ecs:DescribeInstances",
                      "ecs:DescribeSpotPriceHistory",
                      "ecs:DescribePrice",
                      "eci:DescribeContainerGroups",
                      "eci:DescribeContainerGroupPrice",
                      "log:GetLogStoreLogs",
                      "ims:CreateApplication",
                      "ims:UpdateApplication",
                      "ims:GetApplication",
                      "ims:ListApplications",
                      "ims:DeleteApplication",
                      "ims:CreateAppSecret",
                      "ims:GetAppSecret",
                      "ims:ListAppSecretIds",
                      "ims:ListUsers"
                  ],
                  "Resource": "*"
              }
          ]
      }
    3. Name the policy using the format k8sWorkerRolePolicy-{ClusterID} and click OK.

  2. Attach the policy to the cluster's worker role.

    1. In the RAM console, choose Identities > Roles and search for the role in the format KubernetesWorkerRole-{ClusterID}.

    2. Click Grant Permission for the role.

    3. In the Select Policy section, click Custom Policy, search for the policy you created (k8sWorkerRolePolicy-{ClusterID}), select it, and click OK.

Complete the AI Dashboard setup

  1. On the Cloud-native AI Suite page, select Sample Console in the Interaction Mode section. The Note dialog box appears.

    • If Authorized is shown, skip to step 3.

    • If Unauthorized is shown in red and OK is dimmed, complete the RAM permissions above, then click Authorization Check. After authorization succeeds, Authorized is displayed. 提示框 已授权

  2. Set the Console Data Storage parameter. Select Pre-installed MySQL for testing or ApsaraDB RDS for production. For details, see Install and configure AI Dashboard and AI Developer Console.

  3. Click Deploy Cloud-native AI Suite. AI Dashboard is ready when its status changes to Ready.

(Optional) Create and accelerate a dataset

Algorithm developers can mount datasets from Object Storage Service (OSS) as persistent volumes (PVs) and accelerate access through AI Dashboard.

Create a PV and PVC

  1. Create a namespace:

    kubectl create ns demo-ns
  2. Create a file named fashion-mnist.yaml with the following content:

    PlaceholderDescriptionExample
    fashion-mnistOSS bucket namemy-dataset-bucket
    oss-cn-beijing.aliyuncs.comOSS endpoint for the bucket's regionoss-cn-hangzhou.aliyuncs.com
    AKIDAccessKey IDLTAI5tXxx
    AKSECRETAccessKey secretxXxXxXx
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: fashion-demo-pv
    spec:
      accessModes:
      - ReadWriteMany
      capacity:
        storage: 10Gi
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeAttributes:
          bucket: fashion-mnist
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
          url: oss-cn-beijing.aliyuncs.com
          akId: "AKID"
          akSecret: "AKSECRET"
        volumeHandle: fashion-demo-pv
      persistentVolumeReclaimPolicy: Retain
      storageClassName: oss
      volumeMode: Filesystem
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: fashion-demo-pvc
      namespace: demo-ns
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: fashion-demo-pv
      storageClassName: oss
      volumeMode: Filesystem
      volumeName: fashion-demo-pv

    Replace the following placeholders:

  3. Apply the manifest:

    kubectl create -f fashion-mnist.yaml
  4. Verify the PV and PVC are bound:

    kubectl get pv fashion-demo-pv
    kubectl get pvc fashion-demo-pvc -n demo-ns

    Expected output for the PV:

    NAME              CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS   AGE
    fashion-demo-pv   10Gi       RWX            Retain           Bound    demo-ns/fashion-demo-pvc     oss            8h

    Expected output for the PVC:

    NAME               STATUS   VOLUME            CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    fashion-demo-pvc   Bound    fashion-demo-pv   10Gi       RWX            oss            8h

Accelerate the dataset

  1. Log in to AI Dashboard as an administrator.

  2. Choose Dataset > Dataset List in the left navigation pane.

  3. Find the dataset (fashion-demo-pvc) and click Accelerate in the Operator column.

    image

Step 2: View resource dashboards

AI Dashboard provides four dashboard views. Each view answers a different question about GPU resource health.

Important

The AI Console whitelist restriction applies here as well. If you are not whitelisted, access dashboards through the open-source community — see Open-source AI Console.

Cluster dashboard

After logging in, AI Dashboard opens the cluster dashboard by default. It shows cluster-wide GPU health and allocation.

MetricWhat it shows
GPU Summary Of ClusterTotal GPU nodes, allocated GPU nodes, unhealthy GPU nodes
Total GPU NodesTotal number of GPU-accelerated nodes
Unhealthy GPU NodesNumber of GPU nodes with detected issues
GPU Memory (Used/Total)Ratio of GPU memory in use to total GPU memory
GPU Memory (Allocated/Total)Ratio of allocated GPU memory to total GPU memory
GPU UtilizationAverage GPU utilization across the cluster
GPUs (Allocated/Total)Ratio of allocated GPUs to total GPUs
Training Job Summary Of ClusterCount of training jobs by status: Running, Pending, Succeeded, Failed
GPU Utilization shows whether the GPU executed any work during the sample window — it does not indicate how efficiently the hardware was used. A node showing 100% GPU Utilization may be running lightweight kernels rather than heavy parallel workloads. Pair this metric with GPU Memory (Used/Total) to get a fuller picture of actual compute load.

Node dashboard

Click Nodes in the upper-right corner of the Cluster page to open the node dashboard. It breaks down GPU metrics per node and per GPU device.

MetricWhat it shows
GPU Node DetailsPer-node table: name, IP, role, GPU mode (exclusive or shared), GPU count, total GPU memory, allocated GPUs, allocated GPU memory, used GPU memory, average GPU utilization
GPU Duty CycleUtilization per GPU device per node
GPU Memory UsageMemory used per GPU device per node
GPU Memory Usage PercentageMemory usage percentage per GPU per node
Allocated GPUs Per NodeNumber of allocated GPUs per node
GPU Number Per NodeTotal GPUs per node
Total GPU Memory Per NodeTotal GPU memory per node

Training job dashboard

Click TrainingJobs in the upper-right corner of the Nodes page to open the training job dashboard. It shows resource consumption and GPU efficiency per job.

MetricWhat it shows
Training JobsPer-job table: namespace, name, type, status, duration, requested GPUs, requested GPU memory, used GPU memory, average GPU utilization
Job Instance Used GPU MemoryGPU memory used per job instance
Job Instance Used GPU Memory PercentagePercentage of GPU memory used per job instance
Job Instance GPU Duty CycleGPU utilization per job instance

Resource quota dashboard

Click Quota in the upper-right corner of the Training Jobs page to open the quota dashboard. It displays quota consumption by resource type (CPU, memory, nvidia.com/gpu, aliyun.com/gpu-mem, aliyun.com/gpu).

ColumnWhat it shows
Elastic Quota NameName of the quota group
NamespaceNamespace the quota applies to
Resource NameResource type
Max QuotaMaximum resources available in the namespace
Min QuotaGuaranteed minimum resources, honored when the cluster is under pressure
Used QuotaResources currently in use in the namespace

Step 3: Manage users and quotas

Cloud-native AI Suite uses a quota tree to enforce hierarchical resource limits and enable resource sharing across teams.

概念关系

The organizational structure maps to the quota tree:

orgchart

Each department or project team corresponds to a quota tree branch. Leaf nodes map to namespaces. Setting Min and Max at each level lets teams share idle resources while guaranteeing each team's minimum allocation.

Quota trees and namespace-level controls provide organizational guardrails, not hard security boundaries. If your threat model requires strong tenant isolation, use separate clusters. For more context, see the security boundary note in Key concepts.

Set up a quota tree

  1. Create namespaces for each team. If namespaces already exist, make sure they contain no running pods before associating them with a quota node.

    kubectl create ns namespace1
    kubectl create ns namespace2
    kubectl create ns namespace3
    kubectl create ns namespace4
  2. In AI Dashboard, create quota nodes and associate each leaf node with a namespace. Set Min (guaranteed minimum) and Max (maximum when cluster has idle capacity) for each node.

Create users and user groups

A user can belong to multiple user groups. A user group can contain multiple users. Associate users with user groups to grant access to the resources allocated to that group.

  1. Create a user. See Generate the kubeconfig file and logon token for a new user.

  2. Create a user group. See Add a user group.

Capacity scheduling example

This example shows how the scheduler shares and reclaims CPU resources across four namespaces. The quota tree has this structure:

orgchart2

Quota configuration:

Quota nodeMin (CPU cores)Max (CPU cores)
root4040
root.a2040
root.b2040
root.a.11020
root.a.21020
root.b.11020
root.b.21020

Walk-through:

Without elastic quotas, each leaf namespace can only use its Min (10 cores, so 2 pods at 5 cores/pod). With elastic quotas enabled and 40 cores available in the cluster, namespaces can borrow idle capacity up to their Max.

Step 1: Deploy five pods to namespace1, each requesting 5 CPU cores (25 cores total).

With root.a.1 Max set to 20 cores, 4 pods run (20 cores). The fifth pod stays Pending.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx1
  namespace: namespace1
  labels:
    app: nginx1
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx1
  template:
    metadata:
      name: nginx1
      labels:
        app: nginx1
    spec:
      containers:
      - name: nginx1
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Step 2: Deploy five pods to namespace2, each requesting 5 CPU cores.

With 20 cores remaining (40 total minus 20 used by namespace1), 4 pods run. The fifth stays Pending. Both namespace1 and namespace2 together now consume all 40 root cores.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx2
  namespace: namespace2
  labels:
    app: nginx2
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx2
  template:
    metadata:
      name: nginx2
      labels:
        app: nginx2
    spec:
      containers:
      - name: nginx2
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Step 3: Deploy five pods to namespace3, each requesting 5 CPU cores.

The cluster has no idle capacity. The scheduler reclaims 10 cores from root.a to guarantee root.b.1's minimum. It reclaims 5 cores from root.a.1 (reducing namespace1 from 4 running pods to 3) and 5 cores from root.a.2 (reducing namespace2 from 4 running pods to 3). With 10 reclaimed cores, 2 pods run in namespace3.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx3
  namespace: namespace3
  labels:
    app: nginx3
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx3
  template:
    metadata:
      name: nginx3
      labels:
        app: nginx3
    spec:
      containers:
      - name: nginx3
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Step 4: Deploy five pods to namespace4, each requesting 5 CPU cores.

The scheduler reclaims another 10 cores from root.a to guarantee root.b.2's minimum: 5 cores from root.a.1 and 5 from root.a.2. After reclamation, namespace1 and namespace2 each have 2 running pods (10 cores), and namespace4 gets 2 running pods.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx4
  namespace: namespace4
  labels:
    app: nginx4
spec:
  replicas: 5
  selector:
    matchLabels:
      app: nginx4
  template:
    metadata:
      name: nginx4
      labels:
        app: nginx4
    spec:
      containers:
      - name: nginx4
        image: nginx
        resources:
          limits:
            cpu: 5
          requests:
            cpu: 5

Result:

NamespaceQuota nodeRunning podsCPU cores in use
namespace1root.a.1210
namespace2root.a.2210
namespace3root.b.1210
namespace4root.b.2210

Each team's minimum guarantee was honored. Teams that borrowed idle capacity had resources reclaimed when other teams needed their guaranteed minimums back.

What's next