NVIDIA GPU Cloud (NGC) is a deep learning ecosystem developed by NVIDIA that provides free access to deep learning software stacks for building development environments. This topic walks through deploying an NGC environment on a GPU-accelerated instance, using TensorFlow as an example.
Background information
-
The NGC website provides images of different versions of mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. Select an image to deploy an NGC container environment based on your requirements. This topic uses TensorFlow as an example.
-
Alibaba Cloud provides NGC container images optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. You can select these images when creating GPU-accelerated instances to quickly deploy NGC container environments with optimized, regularly updated deep learning frameworks.
Limits
NGC environments can be deployed on instances of the following instance families:
-
gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s
-
ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e, ebmgn7ex, and sccgn7ex
For more information, see GPU-accelerated compute-optimized instance families.
Prerequisites
Before you begin, make sure that you have created an NGC account on the NGC website.
Obtain the URL of the TensorFlow container image before you begin.
-
In the search box, enter TensorFlow.
In the search results, click the TensorFlow container card.
-
On the TensorFlow page, click the Tags tab and copy the path of the desired TensorFlow container image version.
In this example, the
22.05-tf1-py3image URL is nvcr.io/nvidia/tensorflow:22.05-tf1-py3, which you will use to download the image on a GPU-accelerated instance.ImportantThe CUDA version in the TensorFlow image must match the driver version of the GPU-accelerated instance. Otherwise, deployment fails. For information about version compatibility, see TensorFlow Release Notes.
Procedure
The following example uses a gn7i instance to deploy an NGC environment.
-
Create a GPU-accelerated instance.
For more information, see Create an instance on the Custom Launch tab. Configure the following key parameters:
Parameter
Description
Region
Select a region that offers GPU-accelerated instances.
To check the availability of GPU-accelerated instances in each region, see Instance types available for each region.
Instance
Select an instance type. In this example, gn7i is used.
Images
-
On the Marketplace Images tab, click Select from Alibaba Cloud Marketplace.
-
In the dialog box that appears, enter NVIDIA GPU Cloud VM Image and click Search.
-
Find the image and click Use.
Public IP Address
Select Assign Public IPv4 Address.
NoteIf no public IP address is assigned, associate an elastic IP address (EIP) with the instance after it is created. For more information, see Associate one or more EIPs with an instance.
Security Group
Select a security group with TCP port 22 enabled. If your instance requires HTTPS or Deep Learning GPU Training System (DIGITS) 6, also enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.
-
-
Use one of the methods that are described in the following table to connect to the instance.
Method
References
Workbench
VNC
-
Run the
nvidia-smicommand to view information about the current GPU.The output shows that the Driver Version is 515.48.07, which is compatible with CUDA 11.7 as required by the
22.05-tf1-py3TensorFlow image.root@xxxszfZ:~# nvidia-smi Sun Apr 7 11:01:55 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 On | 00000000:00:07.0 Off | 0 | | N/A 32C P8 14W / 150W | 2MiB / 23028MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ -
Run the following command to download the TensorFlow container image:
docker pull nvcr.io/nvidia/tensorflow:22.05-tf1-py3ImportantDownloading the TensorFlow container image may take a long time.
-
Run the following command to view information about the downloaded TensorFlow container image:
docker image lsThe TAG column shows
22.05-tf1-py3, confirming that the image was downloaded.root@xxx lz:~# docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE nvcr.io/nvidia/tensorflow 22.05-tf1-py3 7c92d95961e9 2 years ago 14.4GB -
Run the following command to deploy the TensorFlow development environment by running the container:
docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:22.05-tf1-py3On success, output similar to the following is displayed:
================ == TensorFlow == ================ NVIDIA Release 22.05-tf1 (build 18410160) TensorFlow Version 1.15.5 Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. Copyright 2017-2020 The TensorFlow Authors. All rights reserved. NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced. -
Run the following commands in sequence to run a simple test for TensorFlow:
pythonimport tensorflow as tf hello = tf.constant('Hello, TensorFlow!') with tf.compat.v1.Session() as sess: result = sess.run(hello) print(result.decode())If TensorFlow loads the GPU device as expected, the
Hello, TensorFlow!string is printed, as shown in the following output.2024-08-21 04:17:14.076698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 20427 MB memory) -> physical GPU (device: 0, name: NVIDIA A10, pci bus id: 0000:00:07.0, compute capability: 8.6) 2024-08-21 04:17:14.077064: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-08-21 04:17:14.077724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 20427 MB memory) -> physical GPU (device: 1, name: NVIDIA A10, pci bus id: 0000:00:08.0, compute capability: 8.6) 2024-08-21 04:17:14.078021: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-08-21 04:17:14.078688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 20427 MB memory) -> physical GPU (device: 2, name: NVIDIA A10, pci bus id: 0000:00:09.0, compute capability: 8.6) 2024-08-21 04:17:14.078955: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2024-08-21 04:17:14.079624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1351] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 20427 MB memory) -> physical GPU (device: 3, name: NVIDIA A10, pci bus id: 0000:00:0a.0, compute capability: 8.6) Hello, TensorFlow! >>> -
Save the modified TensorFlow image.
-
On the GPU connection page, open a new window for remote connection.
-
Run the following command to query the
CONTAINER_ID:docker psroot@xxxg9dZ:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES f76a5a4347df nvcr.io/nvidia/tensorflow:22.05-tf1-py3 "/opt/nvidia/nvidia_..." 46 seconds ago Up 45 seconds 6006/tcp, 6064/tcp, 8888/tcp reverent_brattain 68xxxxxxfba nvcr.io/nvidia/tensorflow:22.05-tf1-py3 "/opt/nvidia/nvidia_..." 9 minutes ago Up 9 minutes 6006/tcp, 6064/tcp, 8888/tcp jolly_liskov -
Run the following command to save the modified TensorFlow image:
# Replace CONTAINER_ID with the container ID that is queried by using the docker ps command, such as f76a5a4347d. docker commit -m "commit docker" CONTAINER_ID nvcr.io/nvidia/tensorflow:20.01-tf1-py3ImportantMake sure that the modified TensorFlow image is saved. Otherwise, changes may be lost the next time you log on to the instance.
-