In large-model AI parallel computing, you can optimize performance by reducing communication overhead, overlapping computation with communication, and improving communication efficiency. Configure high-performance networks on Lingjun resources to achieve these goals.
Limitations
These configurations apply only to training jobs that use Lingjun resources.
Configure high-performance network variables
PAI on Lingjun resources has RDMA enabled and optimal NCCL environment variables pre-configured. You can use these defaults or adjust them based on your training framework, communication framework, and model.
Default variables
The following table lists the pre-configured default variables for different Lingjun specifications.
|
Lingjun specification |
NCCL variable |
|
|
See Environment variable descriptions for details about each NCCL environment variable.
Environment variables
The following table describes key NCCL environment variables. For other variables, see the NCCL documentation.
|
Variable |
Description |
|
NCCL_IB_TC |
Matches Alibaba Cloud network mapping rules. Incorrect or missing values degrade performance. |
|
NCCL_IB_GID_INDEX |
Missing or incorrect values cause an NCCL error. |
|
NCCL_SOCKET_IFNAME |
The network interface used to establish connections, which varies by Lingjun specification. Missing or incorrect values can cause NCCL connection failures. |
|
NCCL_DEBUG |
Sets the log level. Set this to INFO for detailed NCCL logs that simplify troubleshooting. |
|
NCCL_IB_HCA |
The network interface controller (NIC) for RDMA communication. The required value varies by compute node. Incorrect or missing values degrade performance. |
|
NCCL_IB_TIMEOUT |
Increases the RDMA connection timeout to improve fault tolerance. Incorrect or missing values can interrupt training jobs. |
|
NCCL_IB_QPS_PER_CONNECTION |
Increasing the number of Queue Pairs (QPs) per connection can improve throughput. |
Configure an image
For training jobs on Lingjun resources, you can use an official DLC image or a custom image.
Official images
Three official GPU training images are available: deepspeed-training:23.06-gpu-py310-cu121-ubuntu22.04, megatron-training:23.06-gpu-py310-cu121-ubuntu22.04, and nemo-training:23.06-gpu-py310-cu121-ubuntu22.04. All are hosted in the China (Ulanqab) region and use Ubuntu 22.04, Python 3.10, and CUDA 12.1. They include PyTorch 2.1, Megatron-LM 23.06, DeepSpeed 0.9.5, Transformers 4.29.2, and Nemo 1.19.0.
Custom image
Environment requirements
-
CUDA >= 11.2
-
NCCL >= 2.12.10
-
Python 3
Install RDMA library
For custom images, you must manually install the RDMA library. Add the following commands to your Dockerfile:
RUN apt-get update && \
apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends libnl-3-dev libnl-route-3-dev libnl-3-200 libnl-route-3-200 iproute2 udev dmidecode ethtool && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN cd /tmp/ && \
wget http://pythonrun.oss-cn-zhangjiakou.aliyuncs.com/rdma/nic-libs-mellanox-rdma-5.2-2/nic-lib-rdma-core-installer-ubuntu.tar.gz && \
tar xzvf nic-lib-rdma-core-installer-ubuntu.tar.gz && \
cd nic-lib-rdma-core-installer-ubuntu && \
echo Y | /bin/bash install.sh && \
cd .. && \
rm -rf nic-lib-rdma-core-installer-ubuntu && \
rm -f nic-lib-rdma-core-installer-ubuntu.tar.gz
Related topics
To learn how to submit training jobs on Lingjun resources, see Create a training job.