Alibaba Cloud Linux 3 with pre-installed NVIDIA GPU driver

更新时间:
复制 MD 格式

Deploy a containerized GPU environment for model training and inference with pre-installed NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.

Pre-installed software

The image includes the following drivers and software:

Component

Alibaba Cloud Linux 3 (NVIDIA proprietary driver)

Alibaba Cloud Linux 3 (NVIDIA open-source driver)

Kernel version

5.10.134-19.2.al8.x86_64

5.10.134-19.2.al8.x86_64

NVIDIA GPU driver

580.126.09

580.126.09 (open-source kernel module)

CUDA

12.8 (default), 13.0

13.0 (default), 12.8

cuDNN

9.19.1.2

9.19.1.2

NCCL

v2.29.3-1

v2.29.3-1

nccl-test

v2.17.9

v2.17.9

OpenMPI

4.1.3

4.1.3

Docker

26.1.3

26.1.3

NVIDIA Container Toolkit

1.17.8

1.17.8

OFED and eRDMA

Supported

Supported

keentune (performance tuning)

Disabled by default.

Supported

Supported

Python 3

3.6.8

3.6.8

Alibaba Cloud Linux 3 (NVIDIA proprietary driver)

Supported instance families

  • gn7e, gn7s, gn7i, gn6v, gn6i, gn6e, gn5, and gn5i

  • ebmgn7e, ebmgn7i, ebmgn6v, ebmgn6i, and ebmgn6e

  • ebmgn7ix, ebmgn7ex

  • gn8is, ebmgn8is, gn8v, and ebmgn8v

Environment variables

  • /etc/profile.d/nccl.sh

    export NCCL_HOME=/usr/local/nccl
    export LD_LIBRARY_PATH=${NCCL_HOME}/lib:$LD_LIBRARY_PATH
  • /etc/profile.d/openmpi.sh

    export MPI_HOME=/usr/local/openmpi
    export LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH
    export PATH=${MPI_HOME}/bin:$PATH
  • /etc/profile.d/cuda.sh

    export PATH=/usr/local/cuda/bin:$PATH
    export CUDA_HOME=/usr/local/cuda
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Alibaba Cloud Linux 3 (NVIDIA open-source driver)

Supported instance types

  • ebmgn9g, ebmgn9gc, ebmgn9ge, ebmgn9t, gn9g, gn9t, and gn9ge

  • gn8t, gn8te, ebmgn8t, ebmgn8te, ebmgn8ts, and gn8ep

FAQ

Enable the keentune tool

Enable keentune with the following commands. Restart the operating system for changes to take effect.

systemctl stop tuned
systemctl disable tuned
systemctl start keentune-target
systemctl enable keentune-target
systemctl enable keentuned
systemctl start keentuned
keentune profile set ai_common.profile

To disable keentune, run keentune profile rollback and restart the operating system.

What should I note when using the Alibaba Cloud Linux 3 image with a pre-installed NVIDIA GPU driver in an ACK cluster?

See How to create a custom image based on an existing ECS instance and use the image to create nodes and Usage notes and instructions for risky operations in the Container Service for Kubernetes documentation.

Switch CUDA versions

Run the nvcc command to check the current CUDA version.

nvcc --version

The image uses update-alternatives to manage CUDA versions. To switch from CUDA 13.0 to 12.8, use either of the following methods:

  • Run update-alternatives --config cuda to switch interactively.

    update-alternatives --config cuda
    There are 2 choices for the alternative cuda (providing /usr/local/cuda).
    
      Selection    Path                  Priority   Status
    ------------------------------------------------------------
      0            /usr/local/cuda-13.0   20        auto mode
      1            /usr/local/cuda-12.8   10        manual mode
    * 2            /usr/local/cuda-13.0   20        manual mode
    
    Press <enter> to keep the current choice[*], or type selection number: 1  --> Enter the number for the version you want to use.
    update-alternatives: using /usr/local/cuda-12.8 to provide /usr/local/cuda (cuda) in manual mode
  • Use the --set option to set the CUDA version directly.

update-alternatives --set cuda /usr/local/cuda-12.8