Alibaba Cloud Linux 3 image with a pre-installed NVIDIA GPU driver-Elastic Compute Service(ECS)-阿里云帮助中心

Deploy a containerized GPU environment for model training and inference with pre-installed NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.

Pre-installed software

The image includes the following drivers and software:

Component	Alibaba Cloud Linux 3 (NVIDIA proprietary driver)	Alibaba Cloud Linux 3 (NVIDIA open-source driver)
Kernel version	5.10.134-19.2.al8.x86_64	5.10.134-19.2.al8.x86_64
NVIDIA GPU driver	580.126.09	580.126.09 (open-source kernel module)
CUDA	12.8 (default), 13.0	13.0 (default), 12.8
cuDNN	9.19.1.2	9.19.1.2
NCCL	v2.29.3-1	v2.29.3-1
nccl-test	v2.17.9	v2.17.9
OpenMPI	4.1.3	4.1.3
Docker	26.1.3	26.1.3
NVIDIA Container Toolkit	1.17.8	1.17.8
OFED and eRDMA	Supported	Supported
keentune (performance tuning) Disabled by default.	Supported	Supported
Python 3	3.6.8	3.6.8

Alibaba Cloud Linux 3 (NVIDIA proprietary driver)

Supported instance families

gn7e, gn7s, gn7i, gn6v, gn6i, gn6e, gn5, and gn5i
ebmgn7e, ebmgn7i, ebmgn6v, ebmgn6i, and ebmgn6e
ebmgn7ix, ebmgn7ex
gn8is, ebmgn8is, gn8v, and ebmgn8v

Environment variables

/etc/profile.d/nccl.sh

export NCCL_HOME=/usr/local/nccl
export LD_LIBRARY_PATH=${NCCL_HOME}/lib:$LD_LIBRARY_PATH

/etc/profile.d/openmpi.sh

export MPI_HOME=/usr/local/openmpi
export LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH
export PATH=${MPI_HOME}/bin:$PATH

/etc/profile.d/cuda.sh

export PATH=/usr/local/cuda/bin:$PATH
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Alibaba Cloud Linux 3 (NVIDIA open-source driver)

Supported instance types

ebmgn9g, ebmgn9gc, ebmgn9ge, ebmgn9t, gn9g, gn9t, and gn9ge
gn8t, gn8te, ebmgn8t, ebmgn8te, ebmgn8ts, and gn8ep

FAQ

Enable the keentune tool

Enable keentune with the following commands. Restart the operating system for changes to take effect.

systemctl stop tuned
systemctl disable tuned
systemctl start keentune-target
systemctl enable keentune-target
systemctl enable keentuned
systemctl start keentuned
keentune profile set ai_common.profile

To disable keentune, run keentune profile rollback and restart the operating system.

What should I note when using the Alibaba Cloud Linux 3 image with a pre-installed NVIDIA GPU driver in an ACK cluster?

See How to create a custom image based on an existing ECS instance and use the image to create nodes and Usage notes and instructions for risky operations in the Container Service for Kubernetes documentation.

Switch CUDA versions

Run the nvcc command to check the current CUDA version.

nvcc --version

The image uses update-alternatives to manage CUDA versions. To switch from CUDA 13.0 to 12.8, use either of the following methods:

Run update-alternatives --config cuda to switch interactively.

update-alternatives --config cuda
There are 2 choices for the alternative cuda (providing /usr/local/cuda).

  Selection    Path                  Priority   Status
------------------------------------------------------------
  0            /usr/local/cuda-13.0   20        auto mode
  1            /usr/local/cuda-12.8   10        manual mode
* 2            /usr/local/cuda-13.0   20        manual mode

Press <enter> to keep the current choice[*], or type selection number: 1  --> Enter the number for the version you want to use.
update-alternatives: using /usr/local/cuda-12.8 to provide /usr/local/cuda (cuda) in manual mode

Use the --set option to set the CUDA version directly.

update-alternatives --set cuda /usr/local/cuda-12.8