Run TensorFlow tasks with ECI

更新时间:
复制 MD 格式

You can containerize the software environment required for your AI training jobs and run the jobs on ECI. This simplifies environment setup and lets you pay only for the runtime, reducing costs and improving efficiency. This topic uses a GPU-based TensorFlow training job from GitHub as an example to show you how to run a training job on an ACK Serverless cluster by using ECI.

Background

Artificial intelligence and machine learning are now widely used, leading to a variety of training models and an increase in cloud-based training jobs. However, after you migrate to the cloud, you may encounter the following challenges when you run training jobs:

  • You have to purchase a GPU instance and install a GPU driver. Even after containerizing the training job, you still need to install a GPU runtime hook.

  • To save costs, you typically release resources after a job is complete. The next time you start the job, however, you have to recreate the instance and reconfigure the environment. If compute nodes have insufficient resources, you must manually scale out, recreate, and reconfigure.

To address these issues, we recommend that you use an ACK Serverless cluster and ECI to run your training jobs. This solution provides the following benefits:

  • Pay-as-you-go and O&M-free.

  • Configure once and reuse indefinitely.

  • The image cache feature accelerates instance creation to start training jobs faster.

  • Data is decoupled from the training model and can be persistently stored.

Prerequisites

  1. Prepare the training data and container image.

    • Training data: This topic uses a TensorFlow training job from GitHub as an example. For more information, see TensorFlow training job.

    • Container image: ECI provides a sample image that is uploaded to Alibaba Cloud Container Registry (ACR). You can use the image directly or customize it for your needs.

      • Internal address: registry-vpc.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0

      • Public address: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0

  2. Create an ACK Serverless cluster.

    Create an ACK Serverless cluster in the Container Service for Kubernetes console. For more information, see Create an ACK Serverless cluster.

    Important

    If you need to pull images from the internet or your training job requires internet access, configure a NAT Gateway.

    You can use kubectl to manage and access the ACK Serverless cluster as follows:

  3. Create a NAS file system and add a mount target.

    Create a file system and add a mount target in the File Storage NAS console. The file system must be in the same VPC as the ACK Serverless cluster. For more information, see Manage file systems and Manage mount targets.

Steps

Create an image cache

The image cache feature is integrated into ACK Serverless clusters as a Kubernetes CRD to accelerate pulling container images.

  1. Create a YAML file for the image cache.

    The following code provides a sample imagecache.yaml file:

    Note

    If your cluster is in China (Hangzhou), use the internal image address. Otherwise, use the public address.

    apiVersion: eci.alibabacloud.com/v1
    kind: ImageCache
    metadata:
      name: tensorflow
    spec:
      images:
      - registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0
  2. Create the image cache.

    kubectl create -f imagecache.yaml

    Creating an image cache involves pulling the image, which may take some time depending on its size and network conditions. You can run the following command to check the creation progress of the image cache:

    kubectl get imagecache tensorflow

    The image cache is created when output similar to the following is returned.

    :~$ kubectl get imagecache tensorflow
    NAME          AGE   CACHEID                    PHASE   PROGRESS
    tensorflow    13m   imc-2zei4b7k43lxxxbvoz     Ready   100%

Create the training job

  1. Create a PV and a PVC for the NAS file system.

    1. Prepare a YAML file.

      The following code provides a sample nas.yaml file:

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: pv-nas
        labels:
          alicloud-pvname: pv-nas
      spec:
        capacity:
          storage: 100Gi
        accessModes:
          - ReadWriteMany
        csi:
          driver: nasplugin.csi.alibabacloud.com
          volumeHandle: pv-nas
          volumeAttributes:
            server: 15e1d4****-gt***.cn-beijing.nas.aliyuncs.com    # The mount target of the NAS file system.
            path: /
        mountOptions:
          - nolock,tcp,noresvport
          - vers=3
      ---
      kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: pvc-nas
      spec:
        accessModes:
          - ReadWriteMany
        resources:
          requests:
            storage: 100Gi
        selector:
          matchLabels:
            alicloud-pvname: pv-nas
    2. Create the PV and PVC.

      kubectl create -f nas.yaml
  2. Create an ECI pod to run the training job.

    1. Prepare a YAML file.

      The following code provides a sample tensorflow.yaml file:

      apiVersion: v1
      kind: Pod
      metadata:
        name: tensorflow
        labels:
          app: tensorflow
          alibabacloud.com/eci: "true"
        annotations:
          k8s.aliyun.com/eci-use-specs: "ecs.gn6i-c4g1.xlarge"   # Specify the GPU instance type.
          k8s.aliyun.com/eci-auto-imc: "true"                    # Enable automatic image cache matching.
      spec:
        restartPolicy: OnFailure
        containers:
          - name: tensorflow
            image: registry.cn-hangzhou.aliyuncs.com/eci_open/tensorflow:1.0  # Use the image address that corresponds to the image cache.
            command:
              - python
            args:
              - /home/classify_image/classify_image.py      # The training script to run after the container starts.
            resources:
              limits:
                nvidia.com/gpu: "1"   # The number of GPUs required by the container.
            volumeMounts:             # Mount the NAS file system to persist training results.
              - name: pvc-nas
                mountPath: /tmp/classify_image_model
        volumes:
          - name: pvc-nas
            persistentVolumeClaim:
              claimName: pvc-nas
    2. Create the pod.

      kubectl create -f tensorflow.yaml
  3. Check the status of the training job.

    kubectl get pod

    When the pod status changes to 'Completed', the training job is finished.

    kubectl get pod
    NAME            READY   STATUS      RESTARTS   AGE
    tensorflow      0/1     Completed   0          118s
    Note

    You can also view pod details by running kubectl describe pod <pod name> or view logs by running kubectl logs <pod name>.

View results

You can view the results of the training job.

  • In the File Storage NAS console, you can see that the training results are stored on the NAS file system. After you remount the file system, you can view the result data in the corresponding path.

    In the file system list, the Used Capacity column displays the used storage capacity, for example, 45.71 MiB, and the file system status is Running.

  • In the Elastic Container Instance console, you can see the ECI instance that corresponds to the pod.

    A 'Succeeded' status indicates that the container in the instance has stopped running. The system then reclaims the underlying compute resources and stops billing for the pod.

Related topics

This topic uses the image cache feature to accelerate image pulling and a NAS file system for persistent storage. For more information, see the following topics: