本教程介绍如何在云原生环境下进行深度学习训练。

可体验到

  • 手写数字识别训练
  • arena工具的使用

前置知识

  • 了解Python语言
  • Jupyter Notebook的使用
  • Tensorboard的使用

前提条件

在本教程开始之前您需要:在集群中安装Notebook,请参见在Kubernetes集群中部署数据科学家工作环境

背景信息

Arena(阿里云Kubernetes深度学习工具)是一种命令行界面工具,可以很方便、快捷地运行和监控机器学习训练作业,并检查结果。它兼容多种深度学习框架(Tensorflow,Caffe,Hovorod,Pytorch)。集成训练数据管理,实验任务管理,模型开发,持续训练、评估、上线预测等全流程的深度学习生产流水线。在本教程中我们将使用Arena完成以下任务:

  1. 下载并准备数据。
  2. 利用Arena提交单机训练任务,并且查看训练任务状态和日志,并通过TensorBoard查看训练任务。
  3. 部署一个模型预测的在线服务。
  4. 在Notebook中调用在线服务,验证模型准确率。

步骤一:安装Arena

  • 在集群中安装Arena。

    执行以下命令安装:

    curl -s https://raw.githubusercontent.com/AliyunContainerService/ai-starter/master/scripts/install_arena.sh | \
    bash -s -- \
    --prometheus
  • 在Notebook中安装Arena的Python依赖。

    执行以下命令安装:

    ! pip install arena

步骤二:下载代码和数据

  1. 下载TensorFlow样例源代码到/root/models目录。
    ! mkdir  -p /root/models
    ! git clone https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git /root/models/tensorflow-sample-code
    Cloning into '/root/models/tensorflow-sample-code'...
    remote: Enumerating objects: 358, done.
    remote: Counting objects: 100% (358/358), done.
    remote: Total 358 (delta 93), reused 358 (delta 93)
    Receiving objects: 100% (358/358), 11.34 MiB | 22.17 MiB/s, done.
    Resolving deltas: 100% (93/93), done.
    Checking connectivity... done.
  2. 下载mnist数据到/root/dataset/mnist目录。
    ! mkdir -p /root/output
    ! mkdir -p /root/dataset/mnist && \
      cd /root/dataset/mnist && \
      curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
      curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
      curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
      curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 1610k    0 1610k    0     0  3233k      0 --:--:-- --:--:-- --:--:-- 3233k
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  4542    0  4542    0     0  15353      0 --:--:-- --:--:-- --:--:-- 15396
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 9680k    0 9680k    0     0  14.4M      0 --:--:-- --:--:-- --:--:-- 14.4M
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 28881    0 28881    0     0  89717      0 --:--:-- --:--:-- --:--:-- 89971
  3. 查看目录结构。
    ! tree -I ai-starter -L 3 /root
    /root
    |-- ai-starter1.zip
    |-- dataset
    |   `-- mnist
    |       |-- t10k-images-idx3-ubyte.gz
    |       |-- t10k-labels-idx1-ubyte.gz
    |       |-- train-images-idx3-ubyte.gz
    |       `-- train-labels-idx1-ubyte.gz
    |-- models
    |   `-- tensorflow-sample-code
    |       |-- README.md
    |       |-- data
    |       |-- mnist-tf
    |       |-- models
    |       |-- mpijob
    |       `-- tfjob
    `-- output
    
    						10 directories, 6 files

    以上目录结构中:

    • dataset是数据目录。
    • models是模型代码目录。
    • output是训练结果目录。
  4. 检查可用GPU资源。
    ! arena top node
    NAME                     IPADDRESS     ROLE    GPU(Total)  GPU(Allocated)
    cn-beijing.192.168.0.90  192.168.0.90  master  0           0
    cn-beijing.192.168.0.91  192.168.0.91  master  0           0
    cn-beijing.192.168.0.92  192.168.0.92  master  0           0
    cn-beijing.192.168.0.93  192.168.0.93  <none>  1           0
    cn-beijing.192.168.0.94  192.168.0.94  <none>  1           1
    cn-beijing.192.168.0.95  192.168.0.95  <none>  1           0
    -----------------------------------------------------------------------------------------
    Allocated/Total GPUs In Cluster:
    1/3 (33%) 

步骤三:训练模型

  1. 通过Arena提交训练任务。
    %env JOB_NAME=tf-mnist
    %env USER_DATA_NAME=training-data
    ! arena submit tf \
                 --name=$JOB_NAME \
                 --gpus=1 \
                 --data=$USER_DATA_NAME:/training \
                 --tensorboard \
                 --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
                 --logdir=/training/output/mnist \
                 "python /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000 --data_dir /training/dataset/mnist --log_dir /training/output/mnist"
    env: JOB_NAME=tf-mnist
    env: USER_DATA_NAME=training-data
    configmap/tf-mnist-tfjob created
    configmap/tf-mnist-tfjob labeled
    service/tf-mnist-tensorboard created
    deployment.extensions/tf-mnist-tensorboard created
    tfjob.kubeflow.org/tf-mnist created
    INFO[0004] The Job tf-mnist has been submitted successfully 
    INFO[0004] You can run `arena get tf-mnist --type tfjob` to check the job status

    在上述命令中:

    • JOB_NAME:设置任务名称,建议在多人共同使用的时候,设置自己独有的JOB_NAME。
    • USER_DATA_NAME:Notebook使用的共享存储,存储的根目录和Notebook中/root对应。
      • USER_DATA_NAME是存放您私有数据的共享存储,文件内容对应Notebook的/root目录。
      • PUBLIC_DATA_NAME是存放公共数据的共享存储,文件内容对应Notebook的/root/public目录。在arena的命令中,如果需要使用公共存储里的文件,可以指定参数--data=$PUBLIC_DATA_NAME:/public,并在训练代码中指定容器使用/public目录里的代码或数据。
    • --data=$USER_DATA_NAME:/training表示将共享存储映射到训练任务的/training目录。
    • --logdir指定tensorboard从训练任务的指定目录读取event。

    完整参数可以参考命令行文档

  2. 检查模型训练状态。
    ! arena get $JOB_NAME -e
    STATUS: RUNNING
    NAMESPACE: default
    TRAINING DURATION: 2m
    
    NAME      STATUS   TRAINER  AGE  INSTANCE          NODE
    tf-mnist  RUNNING  TFJOB    2m   tf-mnist-chief-0  192.168.0.95
    
    Your tensorboard will be available on:
    http://192.168.0.90:31310
    
    Events: 
    No events for pending pod    
    说明 当任务状态从Pending转为Running后就可以查看日志和GPU使用率了。这里-e为了方便检查任务Pending的原因。通常看到[Pulling] pulling image "tensorflow/tensorflow:1.11.0-gpu-py3"代表容器镜像过大,导致任务处于Pending。这时可以重复执行上述命令直到任务状态变为Running。
  3. 检查实时日志。
    ! arena logs --tail=50 $JOB_NAME
    2020-02-27T07:36:45.83305473Z WARNING:tensorflow:From /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py:41: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    2020-02-27T07:36:45.833106285Z Instructions for updating:
    2020-02-27T07:36:45.833112673Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
    2020-02-27T07:36:45.882516276Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
    2020-02-27T07:36:45.882534876Z Instructions for updating:
    2020-02-27T07:36:45.882539299Z Please write your own downloading logic.
    2020-02-27T07:36:45.891543112Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    2020-02-27T07:36:45.891560333Z Instructions for updating:
    2020-02-27T07:36:45.891564792Z Please use tf.data to implement this functionality.
    2020-02-27T07:36:46.487076723Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    2020-02-27T07:36:46.48711123Z Instructions for updating:
    2020-02-27T07:36:46.487116886Z Please use tf.data to implement this functionality.
    2020-02-27T07:36:46.495455748Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    2020-02-27T07:36:46.495477172Z Instructions for updating:
    2020-02-27T07:36:46.49548407Z Please use tf.one_hot on tensors.
    2020-02-27T07:36:46.565767475Z WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
    2020-02-27T07:36:46.565782106Z Instructions for updating:
    2020-02-27T07:36:46.565786217Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
    2020-02-27T07:36:46.852551962Z WARNING:tensorflow:From /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
    2020-02-27T07:36:46.852582614Z Instructions for updating:
    2020-02-27T07:36:46.852587774Z 
    2020-02-27T07:36:46.852593114Z Future major versions of TensorFlow will allow gradients to flow
    2020-02-27T07:36:46.852598255Z into the labels input on backprop by default.
    2020-02-27T07:36:46.852603359Z 
    2020-02-27T07:36:46.852608128Z See `tf.nn.softmax_cross_entropy_with_logits_v2`.
    2020-02-27T07:36:46.85262585Z 
    2020-02-27T07:36:47.008582996Z 2020-02-27 07:36:47.008412: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
    2020-02-27T07:36:47.169451856Z 2020-02-27 07:36:47.169260: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2020-02-27T07:36:47.170947708Z 2020-02-27 07:36:47.170803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
    2020-02-27T07:36:47.17096618Z name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
    2020-02-27T07:36:47.170971096Z pciBusID: 0000:00:08.0
    2020-02-27T07:36:47.170988394Z totalMemory: 15.90GiB freeMemory: 15.64GiB
    2020-02-27T07:36:47.170993093Z 2020-02-27 07:36:47.170856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
    说明 默认输出全部日志,使用-f参数可查看实时日志。
  4. 查看实时训练的GPU使用情况。
    ! arena top job $JOB_NAME
    INSTANCE NAME     GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)         STATUS   NODE
    tf-mnist-chief-0  0                  3%               513.0MiB / 16280.9MiB   Running  192.168.0.95
  5. 查看训练结果。
    1. 通过TensorBoard查看训练趋势。
      ! arena get $JOB_NAME
      STATUS: SUCCEEDED
      NAMESPACE: default
      TRAINING DURATION: 4m
      
      NAME      STATUS     TRAINER  AGE  INSTANCE          NODE
      tf-mnist  SUCCEEDED  TFJOB    7m   tf-mnist-chief-0  192.168.0.95
      
      Your tensorboard will be available on:
      http://192.168.0.90:32764
      您可以使用192.168.0.90:32764访问Tensorboard。如果您通过笔记本电脑无法直接访问Tensorboard,·可以考虑在您的笔记本电脑使用sshuttle。例如:sshuttle -r root@41.82.59.51 192.168.0.0/16。其中41.82.59.51为集群内某个节点的外网IP,且该外网IP可以通过ssh访问。
    2. 查看模型训练产生的结果文件,在/root/output下生成了训练结果。
      ! tree -I ai-starter -L 3 /root/output
      /root/output
      `-- mnist
          |-- checkpoint
          |-- model.ckpt-4500.data-00000-of-00001
          |-- model.ckpt-4500.index
          |-- model.ckpt-4500.meta
          |-- model.ckpt-4600.data-00000-of-00001
          |-- model.ckpt-4600.index
          |-- model.ckpt-4600.meta
          |-- model.ckpt-4700.data-00000-of-00001
          |-- model.ckpt-4700.index
          |-- model.ckpt-4700.meta
          |-- model.ckpt-4800.data-00000-of-00001
          |-- model.ckpt-4800.index
          |-- model.ckpt-4800.meta
          |-- model.ckpt-4900.data-00000-of-00001
          |-- model.ckpt-4900.index
          |-- model.ckpt-4900.meta
          |-- test
          |   `-- events.out.tfevents.1582797664.tf-mnist-chief-0
          `-- train
              `-- events.out.tfevents.1582797663.tf-mnist-chief-0
      
      3 directories, 18 files 
      说明 /root/output/mnist目录中是训练过程中产生的checkpoint文件,代表训练结束时模型的变量状态。

步骤四:模型导出

  1. 将训练过程中产生的checkpoint转换为模型文件。
    ! arena submit tf \
         --name=export-model \
         --workers=1 \
         --gpus=1 \
         --data=$USER_DATA_NAME:/training \
         --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
         "python /training/models/tensorflow-sample-code/tfjob/docker/mnist/export_model.py \
          --checkpoint_step=4900 \
         --checkpoint_path=/training/output/mnist /training/output/mnist-model/ "
    configmap/export-model-tfjob created
    configmap/export-model-tfjob labeled
    tfjob.kubeflow.org/export-model created
    INFO[0004] The Job export-model has been submitted successfully 
    INFO[0004] You can run `arena get export-model --type tfjob` to check the job status 
  2. 查看导出任务执行状态。
    ! arena get export-model
    STATUS: SUCCEEDED
    NAMESPACE: default
    TRAINING DURATION: 8s
    
    NAME          STATUS     TRAINER  AGE  INSTANCE              NODE
    export-model  SUCCEEDED  TFJOB    26s  export-model-chief-0  192.168.0.95
    导出任务执行完毕后,可以在output/mnist-model目录中看到导出的模型文件。
    ! tree -I ai-starter -L 3 /root/output/mnist-model
    /root/output/mnist-model
    `-- 1
        |-- saved_model.pb
        `-- variables
            |-- variables.data-00000-of-00001
            `-- variables.index
    
    2 directories, 3 files

步骤五:模型预测

  1. 部署预测服务。
    ! arena serve tensorflow \
        --servingName=mnist \
        --modelName=mnist \
        --image=tensorflow/serving:1.13.0  \
        --data=$USER_DATA_NAME:/training \
        --modelPath=/training/output/mnist-model
    configmap/mnist-tf-serving created
    configmap/mnist-tf-serving labeled
    configmap/mnist-tensorflow-serving-cm created
    service/mnist-tensorflow-serving created
    deployment.extensions/mnist-tensorflow-serving created
  2. 查看预测服务。
    ! arena serve list
    NAME   TYPE        VERSION  STATUS   CLUSTER-IP
    mnist  Tensorflow           RUNNING  172.21.1.144
  3. 定义函数,函数内部通过HTTP调用预测服务。
    以下代码定义了一个Python方法pick_image_and_predict,这个方法会从mnist的测试数据集中随机选择一张图片,作为请求的参数,通过HTTP方式调用预测服务,得到模型预测的结果。这个方法执行后会同时打印图片的真实值,和通过预测服务推理得到的值, 您可以用于判断模型的准确率是否满足要求。
    import matplotlib.pyplot as plt
    import numpy as np
    import random
    import requests
    import json
    from tensorflow.examples.tutorials.mnist import input_data
    %matplotlib inline
    data_dir="/root/dataset/mnist/"
    mnist = input_data.read_data_sets(data_dir, one_hot=True)
    test_images = mnist.test.images
    test_labels = mnist.test.labels
    digits = ['0','1','2','3','4','5','6','7','8','9']
    # -*- coding: utf-8 -*-
    
    def show(idx, title):
      plt.figure()
      plt.imshow(test_images[idx].reshape(28,28))
      plt.axis('off')
      plt.title('\n\n{}'.format(title), fontdict={'size': 16})
    
    def predict(url, num):
        test_cls = np.argmax(test_labels, axis=1)
        show(num, 'The Picture is {}'.format(test_cls[num]))
        headers = {"content-type": "application/json"}
        metadata=requests.get(url + '/metadata')
        data = json.dumps({"signature_name": "predict_images", "dropout/Placeholder": 1.0,"inputs": test_images[num].reshape(1, 784).tolist()})
        json_response = requests.post(model_api+':predict', data=data, headers=headers)
        scores = json.loads(json_response.text)['outputs']
        predicted_digits_idx = np.argmax(scores)
        print('预测识别的数字: {}'.format(digits[predicted_digits_idx]))
        return scores
    
    def pick_image_and_predict(model_api):
        random_image = random.randint(0,len(test_images)-1)
        score = predict(model_api, random_image)
    Extracting /root/dataset/mnist/train-images-idx3-ubyte.gz
    Extracting /root/dataset/mnist/train-labels-idx1-ubyte.gz
    Extracting /root/dataset/mnist/t10k-images-idx3-ubyte.gz
    Extracting /root/dataset/mnist/t10k-labels-idx1-ubyte.gz
  4. 调用预测服务。手写数字识别
    说明 这里mnist-tensorflow-serving代表预测服务的服务域名,您也可以改为IP地址,通过arena serve list可以得到的预测服务的服务IP。

步骤六:清理任务

实验结束后,使用delete命令清理训练任务、模型导出任务和预测服务。

! arena delete $JOB_NAME
! arena delete export-model
! arena serve delete mnist
service "tf-mnist-tensorboard" deleted
deployment.extensions "tf-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-mnist" deleted
configmap "tf-mnist-tfjob" deleted
INFO[0005] The Job tf-mnist has been deleted successfully 
tfjob.kubeflow.org "export-model" deleted
configmap "export-model-tfjob" deleted
INFO[0003] The Job export-model has been deleted successfully 
configmap "mnist-tensorflow-serving-cm" deleted
service "mnist-tensorflow-serving" deleted
deployment.extensions "mnist-tensorflow-serving" deleted
configmap "mnist-tf-serving" deleted
INFO[0000] The Serving job mnist has been deleted successfully