The Cloud-native AI Suite lets you install AI/ML components on an ACK Pro cluster, monitor GPU resource usage across multiple views, and allocate resources fairly across teams — all from AI Dashboard or the ACK console.
This guide walks through three core tasks: installing the suite, viewing resource dashboards, and managing users and quotas with capacity scheduling.
Prerequisites
Before you begin, ensure that you have:
An ACK Pro cluster running Kubernetes 1.18 or later
Monitoring Agents and Simple Log Service enabled when the cluster was created (set on the Component Configurations page of the cluster creation wizard). For details, see Create an ACK Pro cluster
Key concepts
Security boundary
Kubernetes is a single-tenant orchestrator: a single control plane instance is shared across all tenants in a cluster. The cluster itself is the only construct that provides a hard security boundary. Quota trees, user groups, and namespaces provide organizational guardrails and logical isolation — they do not provide the same security guarantees as cluster-level separation. Design your multi-tenant architecture with this constraint in mind.
How capacity scheduling works
Each quota node in a quota tree has two parameters:
Min: the guaranteed minimum resources the namespace can always claim, even when the cluster is under pressure
Max: the maximum resources the namespace can use when idle capacity is available
When a namespace is idle, other namespaces can borrow its unused capacity up to their Max. When the owning namespace needs its guaranteed minimum back, the scheduler reclaims resources from borrowing namespaces, considering workload priority, availability, and creation time.
Resource objects
| Object | Role |
|---|---|
| Quota tree | Hierarchical structure defining resource allocation across the organization |
| Quota node | A node in the tree; each leaf node maps to one or more namespaces |
| User group | The smallest allocation unit; maps to a leaf quota node |
| User | Holds a Kubernetes service account used to submit jobs and log in to the console |
| Namespace | Kubernetes namespace bound to a leaf quota node |
User roles
| Role | Permissions |
|---|---|
| admin | Log in to AI Dashboard, manage cluster components, includes all researcher permissions |
| researcher | Submit jobs, use cluster resources, log in to AI Developer Console |
Step 1: Install the Cloud-native AI Suite
The Cloud-native AI Suite consists of six component categories. Install only the components your workloads require. AI Dashboard and AI Developer Console are installed separately and require additional RAM permissions.
| Component category | Purpose | Install separately? |
|---|---|---|
| Task elasticity | Scale AI workloads dynamically | No |
| Data acceleration | Speed up dataset access | No |
| AI task scheduling | Schedule AI workloads with capacity-aware policies | No |
| AI task lifecycle management | Manage training job lifecycles | No |
| AI Dashboard | Monitor GPU resources and quotas | Yes (requires RAM permissions) |
| AI Developer Console | Submit and manage AI jobs | Yes (requires RAM permissions) |
Deploy the suite
Log in to the ACK console and click Clusters in the left navigation pane.
Click the cluster name, then choose Applications > Cloud-native AI Suite in the left pane.
Click Deploy.
Select the components to install, then click Deploy Cloud-native AI Suite at the bottom of the page. The system checks environment dependencies before deploying the selected components. After installation, the Components list shows component names and versions. From this list you can Deploy, Upgrade, or Uninstall individual components.
After installing
ack-ai-dashboardandack-ai-dev-console, links to AI Dashboard and AI Developer Console appear in the upper-left corner of the Cloud-native AI Suite page.
Configure AI Dashboard
Alibaba Cloud rolled out the AI Console (AI Dashboard and AI Developer Console) through a whitelist mechanism starting January 22, 2025. If you deployed AI Dashboard or AI Developer Console before that date, your deployment is unaffected. If you are not whitelisted, install and configure the AI Console through the open-source community — see Open-source AI Console.
Grant RAM permissions to the worker role
Before AI Dashboard can access cluster data, attach a custom RAM policy to the cluster's worker role.
Create a custom policy.
Log in to the RAM console and choose Permissions > Policies in the left navigation pane.
Click Create Policy, select the JSON tab, and add the following policy:
{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": [ "cs:*", "log:GetProject", "log:GetLogStore", "log:GetConfig", "log:GetMachineGroup", "log:GetAppliedMachineGroups", "log:GetAppliedConfigs", "log:GetIndex", "log:GetSavedSearch", "log:GetDashboard", "log:GetJob", "ecs:DescribeInstances", "ecs:DescribeSpotPriceHistory", "ecs:DescribePrice", "eci:DescribeContainerGroups", "eci:DescribeContainerGroupPrice", "log:GetLogStoreLogs", "ims:CreateApplication", "ims:UpdateApplication", "ims:GetApplication", "ims:ListApplications", "ims:DeleteApplication", "ims:CreateAppSecret", "ims:GetAppSecret", "ims:ListAppSecretIds", "ims:ListUsers" ], "Resource": "*" } ] }Name the policy using the format
k8sWorkerRolePolicy-{ClusterID}and click OK.
Attach the policy to the cluster's worker role.
In the RAM console, choose Identities > Roles and search for the role in the format
KubernetesWorkerRole-{ClusterID}.Click Grant Permission for the role.
In the Select Policy section, click Custom Policy, search for the policy you created (
k8sWorkerRolePolicy-{ClusterID}), select it, and click OK.
Complete the AI Dashboard setup
On the Cloud-native AI Suite page, select Sample Console in the Interaction Mode section. The Note dialog box appears.
If Authorized is shown, skip to step 3.
If Unauthorized is shown in red and OK is dimmed, complete the RAM permissions above, then click Authorization Check. After authorization succeeds, Authorized is displayed.

Set the Console Data Storage parameter. Select Pre-installed MySQL for testing or ApsaraDB RDS for production. For details, see Install and configure AI Dashboard and AI Developer Console.
Click Deploy Cloud-native AI Suite. AI Dashboard is ready when its status changes to Ready.
(Optional) Create and accelerate a dataset
Algorithm developers can mount datasets from Object Storage Service (OSS) as persistent volumes (PVs) and accelerate access through AI Dashboard.
Create a PV and PVC
Create a namespace:
kubectl create ns demo-nsCreate a file named
fashion-mnist.yamlwith the following content:Placeholder Description Example fashion-mnistOSS bucket name my-dataset-bucket oss-cn-beijing.aliyuncs.comOSS endpoint for the bucket's region oss-cn-hangzhou.aliyuncs.com AKIDAccessKey ID LTAI5tXxx AKSECRETAccessKey secret xXxXxXx apiVersion: v1 kind: PersistentVolume metadata: name: fashion-demo-pv spec: accessModes: - ReadWriteMany capacity: storage: 10Gi csi: driver: ossplugin.csi.alibabacloud.com volumeAttributes: bucket: fashion-mnist otherOpts: "-o max_stat_cache_size=0 -o allow_other" url: oss-cn-beijing.aliyuncs.com akId: "AKID" akSecret: "AKSECRET" volumeHandle: fashion-demo-pv persistentVolumeReclaimPolicy: Retain storageClassName: oss volumeMode: Filesystem --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: fashion-demo-pvc namespace: demo-ns spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi selector: matchLabels: alicloud-pvname: fashion-demo-pv storageClassName: oss volumeMode: Filesystem volumeName: fashion-demo-pvReplace the following placeholders:
Apply the manifest:
kubectl create -f fashion-mnist.yamlVerify the PV and PVC are bound:
kubectl get pv fashion-demo-pv kubectl get pvc fashion-demo-pvc -n demo-nsExpected output for the PV:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE fashion-demo-pv 10Gi RWX Retain Bound demo-ns/fashion-demo-pvc oss 8hExpected output for the PVC:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE fashion-demo-pvc Bound fashion-demo-pv 10Gi RWX oss 8h
Accelerate the dataset
Log in to AI Dashboard as an administrator.
Choose Dataset > Dataset List in the left navigation pane.
Find the dataset (
fashion-demo-pvc) and click Accelerate in the Operator column.
Step 2: View resource dashboards
AI Dashboard provides four dashboard views. Each view answers a different question about GPU resource health.
The AI Console whitelist restriction applies here as well. If you are not whitelisted, access dashboards through the open-source community — see Open-source AI Console.
Cluster dashboard
After logging in, AI Dashboard opens the cluster dashboard by default. It shows cluster-wide GPU health and allocation.
| Metric | What it shows |
|---|---|
| GPU Summary Of Cluster | Total GPU nodes, allocated GPU nodes, unhealthy GPU nodes |
| Total GPU Nodes | Total number of GPU-accelerated nodes |
| Unhealthy GPU Nodes | Number of GPU nodes with detected issues |
| GPU Memory (Used/Total) | Ratio of GPU memory in use to total GPU memory |
| GPU Memory (Allocated/Total) | Ratio of allocated GPU memory to total GPU memory |
| GPU Utilization | Average GPU utilization across the cluster |
| GPUs (Allocated/Total) | Ratio of allocated GPUs to total GPUs |
| Training Job Summary Of Cluster | Count of training jobs by status: Running, Pending, Succeeded, Failed |
GPU Utilization shows whether the GPU executed any work during the sample window — it does not indicate how efficiently the hardware was used. A node showing 100% GPU Utilization may be running lightweight kernels rather than heavy parallel workloads. Pair this metric with GPU Memory (Used/Total) to get a fuller picture of actual compute load.
Node dashboard
Click Nodes in the upper-right corner of the Cluster page to open the node dashboard. It breaks down GPU metrics per node and per GPU device.
| Metric | What it shows |
|---|---|
| GPU Node Details | Per-node table: name, IP, role, GPU mode (exclusive or shared), GPU count, total GPU memory, allocated GPUs, allocated GPU memory, used GPU memory, average GPU utilization |
| GPU Duty Cycle | Utilization per GPU device per node |
| GPU Memory Usage | Memory used per GPU device per node |
| GPU Memory Usage Percentage | Memory usage percentage per GPU per node |
| Allocated GPUs Per Node | Number of allocated GPUs per node |
| GPU Number Per Node | Total GPUs per node |
| Total GPU Memory Per Node | Total GPU memory per node |
Training job dashboard
Click TrainingJobs in the upper-right corner of the Nodes page to open the training job dashboard. It shows resource consumption and GPU efficiency per job.
| Metric | What it shows |
|---|---|
| Training Jobs | Per-job table: namespace, name, type, status, duration, requested GPUs, requested GPU memory, used GPU memory, average GPU utilization |
| Job Instance Used GPU Memory | GPU memory used per job instance |
| Job Instance Used GPU Memory Percentage | Percentage of GPU memory used per job instance |
| Job Instance GPU Duty Cycle | GPU utilization per job instance |
Resource quota dashboard
Click Quota in the upper-right corner of the Training Jobs page to open the quota dashboard. It displays quota consumption by resource type (CPU, memory, nvidia.com/gpu, aliyun.com/gpu-mem, aliyun.com/gpu).
| Column | What it shows |
|---|---|
| Elastic Quota Name | Name of the quota group |
| Namespace | Namespace the quota applies to |
| Resource Name | Resource type |
| Max Quota | Maximum resources available in the namespace |
| Min Quota | Guaranteed minimum resources, honored when the cluster is under pressure |
| Used Quota | Resources currently in use in the namespace |
Step 3: Manage users and quotas
Cloud-native AI Suite uses a quota tree to enforce hierarchical resource limits and enable resource sharing across teams.

The organizational structure maps to the quota tree:

Each department or project team corresponds to a quota tree branch. Leaf nodes map to namespaces. Setting Min and Max at each level lets teams share idle resources while guaranteeing each team's minimum allocation.
Quota trees and namespace-level controls provide organizational guardrails, not hard security boundaries. If your threat model requires strong tenant isolation, use separate clusters. For more context, see the security boundary note in Key concepts.
Set up a quota tree
Create namespaces for each team. If namespaces already exist, make sure they contain no running pods before associating them with a quota node.
kubectl create ns namespace1 kubectl create ns namespace2 kubectl create ns namespace3 kubectl create ns namespace4In AI Dashboard, create quota nodes and associate each leaf node with a namespace. Set Min (guaranteed minimum) and Max (maximum when cluster has idle capacity) for each node.
Create users and user groups
A user can belong to multiple user groups. A user group can contain multiple users. Associate users with user groups to grant access to the resources allocated to that group.
Create a user. See Generate the kubeconfig file and logon token for a new user.
Create a user group. See Add a user group.
Capacity scheduling example
This example shows how the scheduler shares and reclaims CPU resources across four namespaces. The quota tree has this structure:

Quota configuration:
| Quota node | Min (CPU cores) | Max (CPU cores) |
|---|---|---|
| root | 40 | 40 |
| root.a | 20 | 40 |
| root.b | 20 | 40 |
| root.a.1 | 10 | 20 |
| root.a.2 | 10 | 20 |
| root.b.1 | 10 | 20 |
| root.b.2 | 10 | 20 |
Walk-through:
Without elastic quotas, each leaf namespace can only use its Min (10 cores, so 2 pods at 5 cores/pod). With elastic quotas enabled and 40 cores available in the cluster, namespaces can borrow idle capacity up to their Max.
Step 1: Deploy five pods to namespace1, each requesting 5 CPU cores (25 cores total).
With root.a.1 Max set to 20 cores, 4 pods run (20 cores). The fifth pod stays Pending.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx1
namespace: namespace1
labels:
app: nginx1
spec:
replicas: 5
selector:
matchLabels:
app: nginx1
template:
metadata:
name: nginx1
labels:
app: nginx1
spec:
containers:
- name: nginx1
image: nginx
resources:
limits:
cpu: 5
requests:
cpu: 5Step 2: Deploy five pods to namespace2, each requesting 5 CPU cores.
With 20 cores remaining (40 total minus 20 used by namespace1), 4 pods run. The fifth stays Pending. Both namespace1 and namespace2 together now consume all 40 root cores.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx2
namespace: namespace2
labels:
app: nginx2
spec:
replicas: 5
selector:
matchLabels:
app: nginx2
template:
metadata:
name: nginx2
labels:
app: nginx2
spec:
containers:
- name: nginx2
image: nginx
resources:
limits:
cpu: 5
requests:
cpu: 5Step 3: Deploy five pods to namespace3, each requesting 5 CPU cores.
The cluster has no idle capacity. The scheduler reclaims 10 cores from root.a to guarantee root.b.1's minimum. It reclaims 5 cores from root.a.1 (reducing namespace1 from 4 running pods to 3) and 5 cores from root.a.2 (reducing namespace2 from 4 running pods to 3). With 10 reclaimed cores, 2 pods run in namespace3.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx3
namespace: namespace3
labels:
app: nginx3
spec:
replicas: 5
selector:
matchLabels:
app: nginx3
template:
metadata:
name: nginx3
labels:
app: nginx3
spec:
containers:
- name: nginx3
image: nginx
resources:
limits:
cpu: 5
requests:
cpu: 5Step 4: Deploy five pods to namespace4, each requesting 5 CPU cores.
The scheduler reclaims another 10 cores from root.a to guarantee root.b.2's minimum: 5 cores from root.a.1 and 5 from root.a.2. After reclamation, namespace1 and namespace2 each have 2 running pods (10 cores), and namespace4 gets 2 running pods.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx4
namespace: namespace4
labels:
app: nginx4
spec:
replicas: 5
selector:
matchLabels:
app: nginx4
template:
metadata:
name: nginx4
labels:
app: nginx4
spec:
containers:
- name: nginx4
image: nginx
resources:
limits:
cpu: 5
requests:
cpu: 5Result:
| Namespace | Quota node | Running pods | CPU cores in use |
|---|---|---|---|
| namespace1 | root.a.1 | 2 | 10 |
| namespace2 | root.a.2 | 2 | 10 |
| namespace3 | root.b.1 | 2 | 10 |
| namespace4 | root.b.2 | 2 | 10 |
Each team's minimum guarantee was honored. Teams that borrowed idle capacity had resources reclaimed when other teams needed their guaranteed minimums back.