Run TensorFlow tasks with ECI-Elastic Container Instance(ECI)-阿里云帮助中心

You can containerize the software environment required for your AI training jobs and run the jobs on ECI. This simplifies environment setup and lets you pay only for the runtime, reducing costs and improving efficiency. This topic uses a GPU-based TensorFlow training job from GitHub as an example to show you how to run a training job on an ACK Serverless cluster by using ECI.

Background

Artificial intelligence and machine learning are now widely used, leading to a variety of training models and an increase in cloud-based training jobs. However, after you migrate to the cloud, you may encounter the following challenges when you run training jobs:

You have to purchase a GPU instance and install a GPU driver. Even after containerizing the training job, you still need to install a GPU runtime hook.
To save costs, you typically release resources after a job is complete. The next time you start the job, however, you have to recreate the instance and reconfigure the environment. If compute nodes have insufficient resources, you must manually scale out, recreate, and reconfigure.

To address these issues, we recommend that you use an ACK Serverless cluster and ECI to run your training jobs. This solution provides the following benefits:

Pay-as-you-go and O&M-free.
Configure once and reuse indefinitely.
The image cache feature accelerates instance creation to start training jobs faster.
Data is decoupled from the training model and can be persistently stored.

Prerequisites

Prepare the training data and container image.
- Training data: This topic uses a TensorFlow training job from GitHub as an example. For more information, see TensorFlow training job.
- Container image: ECI provides a sample image that is uploaded to Alibaba Cloud Container Registry (ACR). You can use the image directly or customize it for your needs.
  - Internal address: registry-vpc.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
  - Public address: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
Create an ACK Serverless cluster.
Create an ACK Serverless cluster in the Container Service for Kubernetes console. For more information, see Create an ACK Serverless cluster.

Important
If you need to pull images from the internet or your training job requires internet access, configure a NAT Gateway.

You can use kubectl to manage and access the ACK Serverless cluster as follows:
- To manage the cluster from your local computer, install and configure the kubectl client. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
- You can also use kubectl in Cloud Shell to manage the cluster. For more information, see Use kubectl to manage a Kubernetes cluster in Cloud Shell.
Create a NAS file system and add a mount target.

Create a file system and add a mount target in the File Storage NAS console. The file system must be in the same VPC as the ACK Serverless cluster. For more information, see Manage file systems and Manage mount targets.

Steps

Create an image cache

The image cache feature is integrated into ACK Serverless clusters as a Kubernetes CRD to accelerate pulling container images.

Create a YAML file for the image cache.

The following code provides a sample imagecache.yaml file:

Note
If your cluster is in China (Hangzhou), use the internal image address. Otherwise, use the public address.
```
apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
metadata:
  name: tensorflow
spec:
  images:
  - registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
```
Create the image cache.
```
kubectl create -f imagecache.yaml
```
Creating an image cache involves pulling the image, which may take some time depending on its size and network conditions. You can run the following command to check the creation progress of the image cache:
```
kubectl get imagecache tensorflow
```
The image cache is created when output similar to the following is returned.
```
:~$ kubectl get imagecache tensorflow
NAME          AGE   CACHEID                    PHASE   PROGRESS
tensorflow    13m   imc-2zei4b7k43lxxxbvoz     Ready   100%
```

Create the training job

Create a PV and a PVC for the NAS file system.

Prepare a YAML file.

The following code provides a sample nas.yaml file:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-nas
  labels:
    alicloud-pvname: pv-nas
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  csi:
    driver: nasplugin.csi.alibabacloud.com
    volumeHandle: pv-nas
    volumeAttributes:
      server: 15e1d4****-gt***.cn-beijing.nas.aliyuncs.com    # The mount target of the NAS file system.
      path: /
  mountOptions:
    - nolock,tcp,noresvport
    - vers=3
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-nas
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  selector:
    matchLabels:
      alicloud-pvname: pv-nas

Create the PV and PVC.
```
kubectl create -f nas.yaml
```

Create an ECI pod to run the training job.

Prepare a YAML file.

The following code provides a sample tensorflow.yaml file:

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow
  labels:
    app: tensorflow
    alibabacloud.com/eci: "true"
  annotations:
    k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c4g1.xlarge"   # Specify the GPU instance type.
    k8s.aliyun.com/eci-auto-imc: "true"                    # Enable automatic image cache matching.
spec:
  restartPolicy: OnFailure
  containers:
    - name: tensorflow
      image: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0  # Use the image address that corresponds to the image cache.
      command:
        - python
      args:
        - /home/classify_image/classify_image.py      # The training script to run after the container starts.
      resources:
        limits:
          nvidia.com/gpu: "1"   # The number of GPUs required by the container.
      volumeMounts:             # Mount the NAS file system to persist training results.
        - name: pvc-nas
          mountPath: /tmp/classify_image_model
  volumes:
    - name: pvc-nas
      persistentVolumeClaim:
        claimName: pvc-nas

Create the pod.
```
kubectl create -f tensorflow.yaml
```

Check the status of the training job.
```
kubectl get pod
```
When the pod status changes to 'Completed', the training job is finished.
```
kubectl get pod
NAME            READY   STATUS      RESTARTS   AGE
tensorflow      0/1     Completed   0          118s
```
Note
You can also view pod details by running kubectl describe pod <pod name> or view logs by running kubectl logs <pod name>.

View results

You can view the results of the training job.

In the File Storage NAS console, you can see that the training results are stored on the NAS file system. After you remount the file system, you can view the result data in the corresponding path.

In the file system list, the Used Capacity column displays the used storage capacity, for example, 45.71 MiB, and the file system status is Running.
In the Elastic Container Instance console, you can see the ECI instance that corresponds to the pod.

A 'Succeeded' status indicates that the container in the instance has stopped running. The system then reclaims the underlying compute resources and stops billing for the pod.