You can containerize the software environment required for your AI training jobs and run the jobs on ECI. This simplifies environment setup and lets you pay only for the runtime, reducing costs and improving efficiency. This topic uses a GPU-based TensorFlow training job from GitHub as an example to show you how to run a training job on an ACK Serverless cluster by using ECI.
Background
Artificial intelligence and machine learning are now widely used, leading to a variety of training models and an increase in cloud-based training jobs. However, after you migrate to the cloud, you may encounter the following challenges when you run training jobs:
-
You have to purchase a GPU instance and install a GPU driver. Even after containerizing the training job, you still need to install a GPU runtime hook.
-
To save costs, you typically release resources after a job is complete. The next time you start the job, however, you have to recreate the instance and reconfigure the environment. If compute nodes have insufficient resources, you must manually scale out, recreate, and reconfigure.
To address these issues, we recommend that you use an ACK Serverless cluster and ECI to run your training jobs. This solution provides the following benefits:
-
Pay-as-you-go and O&M-free.
-
Configure once and reuse indefinitely.
-
The image cache feature accelerates instance creation to start training jobs faster.
-
Data is decoupled from the training model and can be persistently stored.
Prerequisites
-
Prepare the training data and container image.
-
Training data: This topic uses a TensorFlow training job from GitHub as an example. For more information, see TensorFlow training job.
-
Container image: ECI provides a sample image that is uploaded to Alibaba Cloud Container Registry (ACR). You can use the image directly or customize it for your needs.
-
Internal address: registry-vpc.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
-
Public address: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
-
-
-
Create an ACK Serverless cluster.
Create an ACK Serverless cluster in the Container Service for Kubernetes console. For more information, see Create an ACK Serverless cluster.
ImportantIf you need to pull images from the internet or your training job requires internet access, configure a NAT Gateway.
You can use kubectl to manage and access the ACK Serverless cluster as follows:
-
To manage the cluster from your local computer, install and configure the kubectl client. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
-
You can also use kubectl in Cloud Shell to manage the cluster. For more information, see Use kubectl to manage a Kubernetes cluster in Cloud Shell.
-
-
Create a NAS file system and add a mount target.
Create a file system and add a mount target in the File Storage NAS console. The file system must be in the same VPC as the ACK Serverless cluster. For more information, see Manage file systems and Manage mount targets.
Steps
Create an image cache
The image cache feature is integrated into ACK Serverless clusters as a Kubernetes CRD to accelerate pulling container images.
-
Create a YAML file for the image cache.
The following code provides a sample imagecache.yaml file:
NoteIf your cluster is in China (Hangzhou), use the internal image address. Otherwise, use the public address.
apiVersion: eci.alibabacloud.com/v1 kind: ImageCache metadata: name: tensorflow spec: images: - registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0 -
Create the image cache.
kubectl create -f imagecache.yamlCreating an image cache involves pulling the image, which may take some time depending on its size and network conditions. You can run the following command to check the creation progress of the image cache:
kubectl get imagecache tensorflowThe image cache is created when output similar to the following is returned.
:~$ kubectl get imagecache tensorflow NAME AGE CACHEID PHASE PROGRESS tensorflow 13m imc-2zei4b7k43lxxxbvoz Ready 100%
Create the training job
-
Create a PV and a PVC for the NAS file system.
-
Prepare a YAML file.
The following code provides a sample nas.yaml file:
apiVersion: v1 kind: PersistentVolume metadata: name: pv-nas labels: alicloud-pvname: pv-nas spec: capacity: storage: 100Gi accessModes: - ReadWriteMany csi: driver: nasplugin.csi.alibabacloud.com volumeHandle: pv-nas volumeAttributes: server: 15e1d4****-gt***.cn-beijing.nas.aliyuncs.com # The mount target of the NAS file system. path: / mountOptions: - nolock,tcp,noresvport - vers=3 --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: pvc-nas spec: accessModes: - ReadWriteMany resources: requests: storage: 100Gi selector: matchLabels: alicloud-pvname: pv-nas -
Create the PV and PVC.
kubectl create -f nas.yaml
-
-
Create an ECI pod to run the training job.
-
Prepare a YAML file.
The following code provides a sample tensorflow.yaml file:
apiVersion: v1 kind: Pod metadata: name: tensorflow labels: app: tensorflow alibabacloud.com/eci: "true" annotations: k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c4g1.xlarge" # Specify the GPU instance type. k8s.aliyun.com/eci-auto-imc: "true" # Enable automatic image cache matching. spec: restartPolicy: OnFailure containers: - name: tensorflow image: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0 # Use the image address that corresponds to the image cache. command: - python args: - /home/classify_image/classify_image.py # The training script to run after the container starts. resources: limits: nvidia.com/gpu: "1" # The number of GPUs required by the container. volumeMounts: # Mount the NAS file system to persist training results. - name: pvc-nas mountPath: /tmp/classify_image_model volumes: - name: pvc-nas persistentVolumeClaim: claimName: pvc-nas -
Create the pod.
kubectl create -f tensorflow.yaml
-
-
Check the status of the training job.
kubectl get podWhen the pod status changes to 'Completed', the training job is finished.
kubectl get pod NAME READY STATUS RESTARTS AGE tensorflow 0/1 Completed 0 118sNoteYou can also view pod details by running
kubectl describe pod <pod name>or view logs by runningkubectl logs <pod name>.
View results
You can view the results of the training job.
-
In the File Storage NAS console, you can see that the training results are stored on the NAS file system. After you remount the file system, you can view the result data in the corresponding path.
In the file system list, the Used Capacity column displays the used storage capacity, for example, 45.71 MiB, and the file system status is Running.
-
In the Elastic Container Instance console, you can see the ECI instance that corresponds to the pod.
A 'Succeeded' status indicates that the container in the instance has stopped running. The system then reclaims the underlying compute resources and stops billing for the pod.
Related topics
This topic uses the image cache feature to accelerate image pulling and a NAS file system for persistent storage. For more information, see the following topics: