Deploy an NGC container environment for deep learning development

更新时间:
复制 MD 格式

NVIDIA GPU Cloud (NGC) is a deep learning ecosystem developed by NVIDIA that provides free access to deep learning software stacks for building development environments. This topic walks through deploying an NGC environment on a GPU-accelerated instance, using TensorFlow as an example.

Background information

  • The NGC website provides images of different versions of mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. Select an image to deploy an NGC container environment based on your requirements. This topic uses TensorFlow as an example.

  • Alibaba Cloud provides NGC container images optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. You can select these images when creating GPU-accelerated instances to quickly deploy NGC container environments with optimized, regularly updated deep learning frameworks.

Limits

NGC environments can be deployed on instances of the following instance families:

  • gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s

  • ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e, ebmgn7ex, and sccgn7ex

For more information, see GPU-accelerated compute-optimized instance families.

Prerequisites

Note

Before you begin, make sure that you have created an NGC account on the NGC website.

Obtain the URL of the TensorFlow container image before you begin.

  1. Log on to the NGC website.

  2. In the search box, enter TensorFlow.

    In the search results, click the TensorFlow container card.

  3. On the TensorFlow page, click the Tags tab and copy the path of the desired TensorFlow container image version.

    In this example, the 22.05-tf1-py3 image URL is nvcr.io/nvidia/tensorflow:22.05-tf1-py3, which you will use to download the image on a GPU-accelerated instance.

    Important

    The CUDA version in the TensorFlow image must match the driver version of the GPU-accelerated instance. Otherwise, deployment fails. For information about version compatibility, see TensorFlow Release Notes.

Procedure

The following example uses a gn7i instance to deploy an NGC environment.

  1. Create a GPU-accelerated instance.

    For more information, see Create an instance on the Custom Launch tab. Configure the following key parameters:

    Parameter

    Description

    Region

    Select a region that offers GPU-accelerated instances.

    To check the availability of GPU-accelerated instances in each region, see Instance types available for each region.

    Instance

    Select an instance type. In this example, gn7i is used.

    Images

    1. On the Marketplace Images tab, click Select from Alibaba Cloud Marketplace.

    2. In the dialog box that appears, enter NVIDIA GPU Cloud VM Image and click Search.

    3. Find the image and click Use.

    Public IP Address

    Select Assign Public IPv4 Address.

    Note

    If no public IP address is assigned, associate an elastic IP address (EIP) with the instance after it is created. For more information, see Associate one or more EIPs with an instance.

    Security Group

    Select a security group with TCP port 22 enabled. If your instance requires HTTPS or Deep Learning GPU Training System (DIGITS) 6, also enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.

  2. Use one of the methods that are described in the following table to connect to the instance.

    Method

    References

    Workbench

    Connect to a Linux instance by using a password or key

    VNC

    Connect to an instance by using VNC

  3. Run the nvidia-smi command to view information about the current GPU.

    The output shows that the Driver Version is 515.48.07, which is compatible with CUDA 11.7 as required by the 22.05-tf1-py3 TensorFlow image.

    root@xxxszfZ:~# nvidia-smi
    Sun Apr  7 11:01:55 2024
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA A10          On   | 00000000:00:07.0 Off |                    0 |
    | N/A   32C    P8    14W / 150W |      2MiB / 23028MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
  4. Run the following command to download the TensorFlow container image:

    docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3
    Important

    Downloading the TensorFlow container image may take a long time.

  5. Run the following command to view information about the downloaded TensorFlow container image:

    docker image ls

    The TAG column shows 22.05-tf1-py3, confirming that the image was downloaded.

    root@xxx lz:~# docker image ls
    REPOSITORY                    TAG              IMAGE ID       CREATED        SIZE
    nvcr.io/nvidia/tensorflow     22.05-tf1-py3    7c92d95961e9   2 years ago    14.4GB
  6. Run the following command to deploy the TensorFlow development environment by running the container:

    docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:22.05-tf1-py3

    On success, output similar to the following is displayed:

    ================
    == TensorFlow ==
    ================
    NVIDIA Release 22.05-tf1 (build 18410160)
    TensorFlow Version 1.15.5
    Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
    Copyright 2017-2020 The TensorFlow Authors.  All rights reserved.
    NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
    Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
    NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
    NOTE: MOFED driver for multi-node communication was not detected.
          Multi-node communication performance may be reduced.
  7. Run the following commands in sequence to run a simple test for TensorFlow:

    python
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    with tf.compat.v1.Session() as sess:
        result = sess.run(hello)
        print(result.decode())
    

    If TensorFlow loads the GPU device as expected, the Hello, TensorFlow! string is printed, as shown in the following output.

    2024-08-21 04:17:14.076698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 20427 MB memory) -> physical GPU (device: 0, name: NVIDIA A10, pci bus id: 0000:00:07.0, compute capability: 8.6)
    2024-08-21 04:17:14.077064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2024-08-21 04:17:14.077724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 20427 MB memory) -> physical GPU (device: 1, name: NVIDIA A10, pci bus id: 0000:00:08.0, compute capability: 8.6)
    2024-08-21 04:17:14.078021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2024-08-21 04:17:14.078688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 20427 MB memory) -> physical GPU (device: 2, name: NVIDIA A10, pci bus id: 0000:00:09.0, compute capability: 8.6)
    2024-08-21 04:17:14.078955: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2024-08-21 04:17:14.079624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 20427 MB memory) -> physical GPU (device: 3, name: NVIDIA A10, pci bus id: 0000:00:0a.0, compute capability: 8.6)
    Hello, TensorFlow!
    >>>
  8. Save the modified TensorFlow image.

    1. On the GPU connection page, open a new window for remote connection.

    2. Run the following command to query the CONTAINER_ID:

      docker ps
      root@xxxg9dZ:~# docker ps
      CONTAINER ID   IMAGE                                        COMMAND                  CREATED          STATUS          PORTS                                NAMES
      f76a5a4347df   nvcr.io/nvidia/tensorflow:22.05-tf1-py3      "/opt/nvidia/nvidia_..."  46 seconds ago   Up 45 seconds   6006/tcp, 6064/tcp, 8888/tcp         reverent_brattain
      68xxxxxxfba    nvcr.io/nvidia/tensorflow:22.05-tf1-py3      "/opt/nvidia/nvidia_..."  9 minutes ago    Up 9 minutes    6006/tcp, 6064/tcp, 8888/tcp         jolly_liskov
    3. Run the following command to save the modified TensorFlow image:

      # Replace CONTAINER_ID with the container ID that is queried by using the docker ps command, such as f76a5a4347d. 
      docker commit   -m "commit docker" CONTAINER_ID  nvcr.io/nvidia/tensorflow:20.01-tf1-py3
      Important

      Make sure that the modified TensorFlow image is saved. Otherwise, changes may be lost the next time you log on to the instance.