Deploy a containerized GPU environment for model training and inference with pre-installed NVIDIA drivers, CUDA, Docker, and NVIDIA Container Toolkit.
Pre-installed software
The image includes the following drivers and software:
|
Component |
Alibaba Cloud Linux 3 (NVIDIA proprietary driver) |
Alibaba Cloud Linux 3 (NVIDIA open-source driver) |
|
Kernel version |
5.10.134-19.2.al8.x86_64 |
5.10.134-19.2.al8.x86_64 |
|
NVIDIA GPU driver |
580.126.09 |
580.126.09 (open-source kernel module) |
|
CUDA |
12.8 (default), 13.0 |
13.0 (default), 12.8 |
|
cuDNN |
9.19.1.2 |
9.19.1.2 |
|
NCCL |
v2.29.3-1 |
v2.29.3-1 |
|
nccl-test |
v2.17.9 |
v2.17.9 |
|
OpenMPI |
4.1.3 |
4.1.3 |
|
Docker |
26.1.3 |
26.1.3 |
|
NVIDIA Container Toolkit |
1.17.8 |
1.17.8 |
|
OFED and eRDMA |
Supported |
Supported |
|
keentune (performance tuning) Disabled by default. |
Supported |
Supported |
|
Python 3 |
3.6.8 |
3.6.8 |
Alibaba Cloud Linux 3 (NVIDIA proprietary driver)
Supported instance families
-
gn7e, gn7s, gn7i, gn6v, gn6i, gn6e, gn5, and gn5i
-
ebmgn7e, ebmgn7i, ebmgn6v, ebmgn6i, and ebmgn6e
-
ebmgn7ix, ebmgn7ex
-
gn8is, ebmgn8is, gn8v, and ebmgn8v
Environment variables
-
/etc/profile.d/nccl.sh
export NCCL_HOME=/usr/local/nccl export LD_LIBRARY_PATH=${NCCL_HOME}/lib:$LD_LIBRARY_PATH -
/etc/profile.d/openmpi.sh
export MPI_HOME=/usr/local/openmpi export LD_LIBRARY_PATH=${MPI_HOME}/lib:$LD_LIBRARY_PATH export PATH=${MPI_HOME}/bin:$PATH -
/etc/profile.d/cuda.sh
export PATH=/usr/local/cuda/bin:$PATH export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Alibaba Cloud Linux 3 (NVIDIA open-source driver)
Supported instance types
-
ebmgn9g, ebmgn9gc, ebmgn9ge, ebmgn9t, gn9g, gn9t, and gn9ge
-
gn8t, gn8te, ebmgn8t, ebmgn8te, ebmgn8ts, and gn8ep
FAQ
Enable the keentune tool
Enable keentune with the following commands. Restart the operating system for changes to take effect.
systemctl stop tuned
systemctl disable tuned
systemctl start keentune-target
systemctl enable keentune-target
systemctl enable keentuned
systemctl start keentuned
keentune profile set ai_common.profile
To disable keentune, run keentune profile rollback and restart the operating system.
What should I note when using the Alibaba Cloud Linux 3 image with a pre-installed NVIDIA GPU driver in an ACK cluster?
See How to create a custom image based on an existing ECS instance and use the image to create nodes and Usage notes and instructions for risky operations in the Container Service for Kubernetes documentation.
Switch CUDA versions
Run the nvcc command to check the current CUDA version.
nvcc --version
The image uses update-alternatives to manage CUDA versions. To switch from CUDA 13.0 to 12.8, use either of the following methods:
-
Run
update-alternatives --config cudato switch interactively.update-alternatives --config cuda There are 2 choices for the alternative cuda (providing /usr/local/cuda). Selection Path Priority Status ------------------------------------------------------------ 0 /usr/local/cuda-13.0 20 auto mode 1 /usr/local/cuda-12.8 10 manual mode * 2 /usr/local/cuda-13.0 20 manual mode Press <enter> to keep the current choice[*], or type selection number: 1 --> Enter the number for the version you want to use. update-alternatives: using /usr/local/cuda-12.8 to provide /usr/local/cuda (cuda) in manual mode -
Use the
--setoption to set the CUDA version directly.
update-alternatives --set cuda /usr/local/cuda-12.8