PCCL: NCCL APIs及环境变量支持情况(v2.0)

更新时间:
复制为 MD 格式

1. 概述

本文档主要总结PCCL 和 NCCL 2.22.3 在 APIs 与环境变量方面的差异,并注明相应的版本信息。另外也对模型训练、推理场景下最佳环境变量配置的推荐做法。

2. APIs 汇总

2.1 NCCL APIs 兼容支持列表

API 名称

NCCL

PCCL

PCCL 支持版本

备注

ncclGetVersion

ncclGetUniqueId

ncclCommInitRank

ncclCommInitAll

ncclCommDestroy

ncclCommAbort

ncclGetErrorString

ncclGetLastError

ncclCommGetAsyncError

ncclCommCount

ncclCommCuDevice

ncclCommUserRank

ncclReduce

ncclBroadcast

ncclAllReduce

ncclReduceScatter

ncclAllGather

ncclSend

ncclRecv

ncclGroupStart

ncclGroupEnd

ncclCommInitRankConfig

Since PCCL 1.4.0

ncclCommFinalize

Since PCCL 1.4.0

ncclCommSplit

Since PCCL 1.4.0

Only support 'splitShare=0' & 'blocking' configs from pcclConfig_t on 1.4.0, Fully enabled and support on 1.5.0

ncclCommRegister

Since PCCL 1.4.0

API compatible only, doesn't take effect due to HW limitation

ncclCommDeregister

Since PCCL 1.4.0

API compatible only, doesn't take effect due to HW limitation

ncclMemAlloc

Since PCCL 1.4.0

Fallback to cudaMalloc() due to HW limitation

ncclMemFree

Since PCCL 1.4.0

Fallback to cudaFree() due to HW limitation

ncclCommWindowRegister

since PCCL 1.7.0

ncclCommWindowDeregister

since PCCL 1.7.0

ncclCommInitRankScalable

since PCCL 1.7.0

ncclRedOpCreatePreMulSum

ncclRedOpDestroy

ncclGroupSimulateEnd

ncclCommRevoke

ncclCommShrink

ncclAlltoAll

ncclGather

ncclScatter

ncclResetDebugInit

Deprecated and will be removed in the future

2.2 PCCL扩展API列表

N/A

3. 环境变量汇总

3.1 NCCL 环境变量兼容支持列表

Note: 统计内容来自于 NCCL 官方文档, NCCL 代码里面还有少数环境变量并没有在官方文档提及, 暂不列入统计.

环境变量名称

NCCL

PCCL

PCCL 支持版本

备注

NCCL_SOCKET_IFNAME

NCCL_SOCKET_FAMILY

NCCL_SOCKET_NTHREADS

NCCL_NSOCKS_PERTHREAD

NCCL_CROSS_NIC

NCCL_IB_HCA

NCCL_IB_TIMEOUT

NCCL_IB_RETRY_CNT

NCCL_IB_GID_INDEX

NCCL_IB_ADDR_FAMILY

NCCL_IB_ADDR_RANGE

NCCL_IB_ROCE_VERSION_NUM

NCCL_IB_PKEY

NCCL_IB_USE_INLINE

NCCL_IB_SL

NCCL_IB_TC

NCCL_TUNER_PLUGIN

NCCL_IGNORE_CPU_AFFINITY

NCCL_DEBUG

NCCL_DEBUG_SUBSYS

NCCL_COLLNET_ENABLE

NCCL_TOPO_FILE

NCCL_TOPO_DUMP_FILE

NCCL_P2P_DISABLE

NCCL_P2P_LEVEL

NCCL_SHM_DISABLE

NCCL_BUFFSIZE

NCCL_NTHREADS

NCCL_MAX_NCHANNELS

NCCL_MIN_NCHANNELS

NCCL_CHECK_POINTERS

NCCL_LAUNCH_MODE

NCCL_IB_DISABLE

NCCL_IB_AR_THRESHOLD

NCCL_IB_QPS_PER_CONNECTION

NCCL_NET_GDR_LEVEL (formerly NCCL_IB_GDR_LEVEL)

NCCL_NET_SHARED_BUFFERS

NCCL_NET_SHARED_COMMS

NCCL_ALGO

NCCL_PROTO

NCCL_NVB_DISABLE

NCCL_PXN_DISABLE

NCCL_P2P_PXN_LEVEL

NCCL_NET_GDR_READ

NCCL_MIN_P2P_NCHANNELS

NCCL_MAX_P2P_NCHANNELS

NCCL_TOPO_DUMP_FILE_RANK

NCCL_GRAPH_DUMP_FILE_RANK

NCCL_PROXY_DUMP_SIGNAL

NCCL_GDRCOPY_SYNC_ENABLE

NCCL_GDRCOPY_FLUSH_ENABLE

NCCL_GDR_FLUSH_DISABLE

NCCL_LL128_NTHREADS

NCCL_P2P_READ_ENABLE

NCCL_P2P_DIRECT_DISABLE

Since PCCL 1.4.0

NCCL_NET_PLUGIN

Since PCCL 1.4.0

NCCL_COMM_BLOCKING

Since PCCL 1.4.0

NCCL_LOCAL_REGISTER

Since PCCL 1.4.0

NCCL_REPORT_CONNECT_PROGRESS

Since PCCL 1.4.0

NCCL_SET_STACK_SIZE

Since PCCL 1.4.3

NCCL_NET

Since PCCL 1.5.0

NCCL_U2MM_LOG_LEVEL

Since PCCL 1.5.0

EIC 网卡平台的支持

NCCL_P2P_USE_CUDA_MEMCPY

Since PCCL 1.5.0

NCCL_COMM_SPLIT_SHARE_RESOURCES

Since PCCL 1.5.0

NCCL_IB_SPLIT_DATA_ON_QPS

Since PCCL 1.6.0

NCCL_GDRCOPY_ENABLE

Since PCCL 1.6.1

NCCL_GDRCOPY_FIFO_ENABLE

Since PCCL 1.6.1

NCCL_RAS_ENABLE

Since PCCL 1.7.0

NCCL_RAS_ADDR

Since PCCL 1.7.0

NCCL_RAS_TIMEOUT_FACTOR

Since PCCL 1.7.0

NCCL_SOCKET_RETRY_SLEEP_MSEC

Since PCCL 1.7.0

NCCL_GRAPH_REGISTER

Since PCCL 1.7.0

NCCL_WIN_ENABLE

Since PCCL 1.7.0

NCCL_WIN_STRIDE

Since PCCL 1.7.0

NCCL_SOCKET_RETRY_CNT

Since PCCL 1.7.0

NCCL_UID_STAGGER_THRESHOLD

Since PCCL 1.7.0

NCCL_UID_STAGGER_RATE

Since PCCL 1.7.0

NCCL_CUMEM_ENABLE

Since PCCL 1.7.0

NCCL_LEGACY_CUDA_REGISTER

Since PCCL 1.7.0

NCCL_CREATE_THREAD_CONTEXT

Since PCCL 1.7.0

NCCL_CTA_POLICY

Since PCCL 1.7.0

NCCL_CONF_FILE

Since PCCL 1.7.0

NCCL_SYM_CTAS

Since PCCL 1.7.0

NCCL_SYM_KERNEL

Since PCCL 1.7.0

NCCL_COLLNET_NODE_THRESHOLD

NCCL_NVB_PRECONNECT

NCCL_GRAPH_HELPER_DISABLE

NCCL_SET_THREAD_NAME

NCCL_IB_PCI_RELAXED_ORDERING

NCCL_IGNORE_DISABLED_P2P

NCCL_NCHANNELS_PER_NET_PEER

NCCL_GRAPH_MIXING_SUPPORT

NCCL_DMABUF_ENABLE

NCCL_SHM_USE_CUDA_MEMCPY

NCCL_SHM_MEMCPY_MODE

NCCL_SHM_LOCALITY

NCCL_P2P_NET_CHUNKSIZE

NCCL_NET_DISABLE_INTRA

NCCL_P2P_LL_THRESHOLD

NCCL_ALLOC_P2P_NET_LL_BUFFERS

NCCL_NET_FORCE_FLUSH

NCCL_IB_ADAPTIVE_ROUTING

NCCL_MEM_SYNC_DOMAIN

NCCL_CGA_CLUSTER_SIZE

NCCL_P2P_PCI_CHUNKSIZE

NCCL_P2P_NVL_CHUNKSIZE

NCCL_PROXY_APPEND_BATCH_SIZE

NCCL_PROGRESS_APPENDOP_FREQ

NCCL_L1_SHARED_MEMORY_CARVEOUT

NCCL_MAX_CTAS

NCCL_MIN_CTAS

NCCL_NVLS_ENABLE

NCCL_NVLS_NCHANNELS

NCCL_NET_OVERHEAD

NCCL_CUMEM_HOST_ENABLE

since nccl 2.23

NCCL_IB_MERGE_VFS

NCCL_CONNECT_ROUND_MAX_PEERS

NCCL_IB_MERGE_NICS

since nccl 2.20

NCCL_MNNVL_ENABLE

since nccl 2.21

NCCL_NVLS_CHUNKSIZE

NCCL_CHUNK_SIZE

NCCL_NVLSTREE_MAX_CHUNKSIZE

NCCL_UNPACK_DOUBLE_NCHANNELS

NCCL_RUNTIME_CONNECT

since nccl 2.22

NCCL_WORK_FIFO_BYTES

NCCL_WORK_ARGS_BYTES

NCCL_IB_ROUTABLE_FLID_GID_INDEX

NCCL_IB_FIFO_TC

since nccl 2.22

NCCL_IB_RETURN_ASYNC_EVENTS

since nccl 2.23

NCCL_OOB_NET_ENABLE

since nccl 2.23

NCCL_OOB_NET_IFNAME

since nccl 2.23

NCCL_IB_ECE_ENABLE

since nccl 2.23

NCCL_PROFILER_PLUGIN

NCCL_ENV_PLUGIN

NCCL_DEBUG_TIMESTAMP_FORMAT

NCCL_DEBUG_TIMESTAMP_LEVELS

NCCL_NETDEVS_POLICY

NCCL_NET_GDR_C2C

NCCL_PXN_C2C

NCCL_MNNVL_UUID

NCCL_MNNVL_CLIQUE_ID

NCCL_LAUNCH_ORDER_IMPLICIT

NCCL_LAUNCH_RACE_FATAL

3.2 PCCL扩展环境变量列表

PCCL特有的环境变量,我们采用PCCL_开头的名称定义以便于区分。

环境变量名称

PCCL 支持版本

备注

PCCL_ICN_SCALE_OUT_DISABLE

PCCL_NET_BYPASS

PCCL_FORCE_ICN_SCALE_OUT

PCCL_DISABLE_ADJUST_SYSTEM

PCCL_DUMP_CONN_INFO

PCCL_CU_MASK

PCCL_BLOCKS_PER_FC_LINK

PCCL_P2P_NCHANNELS_FACTOR_NPATHS

PCCL_DISABLE_ABORT

PCCL_CHANNEL_SHUFFLE

PCCL_DUMP_WORK_ELEMS

Dump each active ncclWork's ncclWorkElem information within the channel

PCCL_REVERSE_RING_INTERLEAVE

PCCL_ONEBATCH_PER_CE

PCCL_IB_DISABLE_ACC_BONDING_PORT

PCCL_BOOTSTRAP_DEBUG

PCCL_USE_STATIC_LAUNCH_GRID

PCCL_RING_DUMP_FILE

PCCL_DISABLE_TORUS_GRAPH

PCCL_CE_MASK

PCCL_GRAPH_TREE_PATTERN

PCCL_PROXY_DUMP_FILE

PCCL_DISP_MODE

PCCL_QUADRUPLE_CHANNELS

PCCL_PER_COMM_HASH_DEBUG_FILE

PCCL_COMM_DUMP_SIGNAL

PCCL_GLOBAL_TOPO_FILE

PCCL_PERFMODEL_SPEEDUP

PCCL_IGNORE_TOPO_DETECTION

PCCL_SINGLE_HOST_CROSS_NODE_TEST

PCCL_BLOCKING_BOOTSTRAP

PCCL_NO_MEMSET

PCCL_GET_MIN_PATH_NUM_BY_DRIVER

PCCL_MULTI_HOSTS_BULK_MODE

PCCL_GROUP_HGGC_STREAM

PCCL_DISABLE_FC_GRAPH

PCCL_ADJUST_SYSTEM_MIN_NET_NUM

PCCL_TB_PER_CU

PCCL_TRAVERSE_Y

PCCL_IGNORE_INTEL_P2P_OVERHEAD

PCCL_GRAPH_SEARCH_START_INTER_TYPE

configure the cross node start inter type, like PXN to speed up graph search

PCCL_INJECTION64_PATH

PCCL_DEBUG_DEV

PCCL_ERROR_YIELD

PCCL_GRAPH_SPEEDS

PCCL_COMM_DUMP_FILE

PCCL_ENABLE_ABORT_DUMP

PCCL_ENABLE_EXT_KERNEL

Since PCCL 1.4.0

PCCL_EXT_KERNEL_PLUGIN

Since PCCL 1.4.0

PCCL_COMM_DUMP_LEVEL

Since PCCL 1.4.0

PCCL_STATE_MONITOR_DISABLE

Since PCCL 1.4.0

PCCL feature to monitor kernel execution state

PCCL_STATE_MONITOR_TRIGGER_PIPE

Since PCCL 1.4.0

PCCL_STATE_MONITOR_SLEEP_US

Since PCCL 1.4.0

PCCL_STATE_MONITOR_QUIT_WHEN_EXCEPTION

Since PCCL 1.4.0

PCCL_STATE_MONITOR_DUMP_WHEN_EXCEPTION_DISABLE

Since PCCL 1.4.0

PCCL_STATE_SYNC_POLL_INTERVAL_MS

Since PCCL 1.4.0

PCCL_STATE_MONITOR_LOG_EVERY_MS

Since PCCL 1.4.0

PCCL_GDR_USE_DEV_MEM_FOR_RX_TAIL

Since PCCL 1.4.0

Use device memory for tail pointer during GDR path communication

PCCL_GDR_CPU_FLUSH

Since PCCL 1.4.0

Use CPU memory PCI-e read for GDR flush operation

PCCL_DEBUG_DUMP_DIR

Since PCCL 1.4.3

Configure debug dump foler path

PCCL_IB_NIC_SPEED_SCALING_FACTOR

Since PCCL 1.4.3

Configure IB NIC bandwidth factor if missing the bonding information and cause real speed mismatch with sys information

PCCL_NET_AFFINITY

Since PCCL 1.5.0

Configure PPU affinity NET id, format: "hgml_dev_id:net_id,hgml_dev_id:net_id"

PCCL_CHECK_ITEMS_ENABLE

Since PCCL 1.5.0

Configure if enable internal check for potential topo or environment problems

PCCL_DOUBLE_EIC_PCI_WIDTH_DISABLE

Since PCCL 1.5.0

Configure to 0 if run on EIC platform, and default is 0

PCCL_STATE_MONITOR_LEVEL

Since PCCL 1.5.0

Configure the monitor level

PCCL_D2D_DISP_PING_PONG_EN

Since PCCL 1.6.0

PCCL_JUMP_DIM

Since PCCL 1.6.0

PCCL_LAST_ROUND_EN

Since PCCL 1.6.0

PCCL_ONE_BATCH_PER_CE

Since PCCL 1.6.0

PCCL_SKEW_EN

Since PCCL 1.6.0

PCCL_TRAVERSE_Y_STEP

Since PCCL 1.6.0

PCCL_FB_STALL_ST_TYPE

Since PCCL 1.6.0

PCCL_TRAVERSE_X_STEP

Since PCCL 1.6.0

PCCL_LAST_ROUND_TB_PER_CE

Since PCCL 1.6.0

PCCL_TUNING_FALL_BACK

Since PCCL 1.6.0

PCCL_FB_STALL_ST_EN

Since PCCL 1.6.0

PCCL_FORCE_DEFAULT_DISP

Since PCCL 1.6.0

PCCL_START_CE_ID

Since PCCL 1.6.0

PCCL_LAST_ROUND_ONETB_PER_CE

Since PCCL 1.6.0

PCCL_WAIT_EN

Since PCCL 1.6.0

PCCL_LAST_ROUND_BEGIN_NUMBER

Since PCCL 1.6.0

PCCL_PRE_ALLOC_EN

Since PCCL 1.6.0

PCCL_BLOCK_AGE_EN

Since PCCL 1.6.0

PCCL_NDIES

Since PCCL 1.6.0

PCCL_START_CE_EN

Since PCCL 1.6.0

PCCL_FORCE_DEFAULT_CE_MASK

Since PCCL 1.6.0

4. Net APIs汇总

4.1 Net APIs

NCCL 2.22.3 + Net V8

PCCL 1.5 + Net V6

备注

const char* name

const char* name

Name of the network

ncclResult_t (*init)(ncclDebugLogger_t logFunction);

pcclResult_t (*init)(pcclDebugLogger_t logFunction);

Initialize the network.

ncclResult_t (*devices)(int* ndev);

pcclResult_t (*devices)(int* ndev);

Return the number of adapters.

ncclResult_t (*getProperties)(int dev, ncclNetProperties_v8_t* props);

pcclResult_t (*getProperties)(int dev, pcclNetProperties_v6_t* props);

Get various device properties.

ncclResult_t (*listen)(int dev, void* handle, void** listenComm);

pcclResult_t (*listen)(int dev, void* handle, void** listenComm);

Create a receiving object and provide a handle to connect to it.

ncclResult_t (*connect)(int dev, void* handle, void** sendComm, ncclNetDeviceHandle_v8_t** sendDevComm);

pcclResult_t (*connect)(int dev, void* handle, void** sendComm);

Connect to a handle and return a sending comm object for that peer.

ncclResult_t (*accept)(void* listenComm, void** recvComm, ncclNetDeviceHandle_v8_t** recvDevComm);

pcclResult_t (*accept)(void* listenComm, void** recvComm);

Finalize connection establishment after remote peer has called connect.

ncclResult_t (*regMr)(void* comm, void* data, size_t size, int type, void** mhandle);

pcclResult_t (*regMr)(void* comm, void* data, int size, int type, void** mhandle);

Register/Deregister memory.

ncclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);

pcclResult_t (*regMrDmaBuf)(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);

DMA-BUF support

ncclResult_t (*deregMr)(void* comm, void* mhandle);

pcclResult_t (*deregMr)(void* comm, void* mhandle);

Register/Deregister memory.

ncclResult_t (*isend)(void* sendComm, void* data, int size, int tag, void* mhandle, void** request);

pcclResult_t (*isend)(void* sendComm, void* data, int size, int tag, void* mhandle, void** request);

Asynchronous send to a peer.

ncclResult_t (*irecv)(void* recvComm, int n, void** data, int* sizes, int* tags, void** mhandles, void** request);

pcclResult_t (*irecv)(void* recvComm, int n, void** data, int* sizes, int* tags, void** mhandles, void** request);

Asynchronous recv from a peer.

ncclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);

pcclResult_t (*iflush)(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request);

Perform a flush/fence to make sure all data received

ncclResult_t (*test)(void* request, int* done, int* sizes);

pcclResult_t (*test)(void* request, int* done, int* size);

Test whether a request is complete.

ncclResult_t (*closeSend)(void* sendComm);

pcclResult_t (*closeSend)(void* sendComm);

Close and free send/recv comm objects

ncclResult_t (*closeRecv)(void* recvComm);

pcclResult_t (*closeRecv)(void* recvComm);

ncclResult_t (*closeListen)(void* listenComm);

pcclResult_t (*closeListen)(void* listenComm);

ncclResult_t (*getDeviceMr)(void* comm, void* mhandle, void** dptr_mhandle);

Copy the given mhandle to a dptr

ncclResult_t (*irecvConsumed)(void* recvComm, int n, void* request);

Notify the plugin that a recv has completed by the device

4.2 Coll Net APIs

NCCL 2.22.3 + Coll Net V8

PCCL 1.5 + Coll Net V6

备注

const char* name

const char* name

Name of the network

ncclResult_t (*init)(ncclDebugLogger_t logFunction);

pcclResult_t (*init)(pcclDebugLogger_t logFunction);

Initialize the network.

ncclResult_t (*devices)(int* ndev);

pcclResult_t (*devices)(int* ndev);

Return the number of adapters.

ncclResult_t (*getProperties)(int dev, ncclNetProperties_v8_t* props);

pcclResult_t (*getProperties)(int dev, pcclNetProperties_v4_t* props);

Get various device properties.

ncclResult_t (*listen)(int dev, void* handle, void** listenComm);

pcclResult_t (*listen)(int dev, void* handle, void** listenComm);

Create a receiving object and provide a handle to connect to it.

ncclResult_t (*connect)(void* handles[], int nranks, int rank, void* listenComm, void** collComm);

pcclResult_t (*connect)(void* handles[], int nranks, int rank, void* listenComm, void** collComm);

Create a group for collective operations.

ncclResult_t (*reduceSupport)(ncclDataType_t dataType, ncclRedOp_t redOp, int* supported);

pcclResult_t (*reduceSupport)(pcclDataType_t dataType, pcclRedOp_t redOp, int* supported);

Returns whether a reduction operation on a data type is supported.

ncclResult_t (*regMr)(void* collComm, void* data, size_t size, int type, void** mhandle);

pcclResult_t (*regMr)(void* collComm, void* data, int size, int type, void** mhandle);

Register/Deregister memory.

ncclResult_t (*regMrDmaBuf)(void* collComm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);

pcclResult_t (*regMrDmaBuf)(void* collComm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle);

DMA-BUF support

ncclResult_t (*deregMr)(void* collComm, void* mhandle);

pcclResult_t (*deregMr)(void* collComm, void* mhandle);

Register/Deregister memory.

ncclResult_t (*iallreduce)(void* collComm, void* sendData, void* recvData, int count,

ncclDataType_t dataType, ncclRedOp_t redOp, void* sendMhandle, void* recvMhandle, void** request);

pcclResult_t (*iallreduce)(void* collComm, void* sendData, void* recvData, int count,

pcclDataType_t dataType, pcclRedOp_t redOp, void* sendMhandle, void* recvMhandle, void** request);

Performs an asynchronous allreduce operation on the collective group.

ncclResult_t (*iallgather)(void* collComm, void* sendData, int nRecvParts, ncclNetSGE_v8_t* recvParts,

size_t bytesPerRank, size_t windowOffset, size_t windowBytes,

void* sendMhandle, void** request);

Performs an asynchronous allgather operation on the collective group.

ncclResult_t (*ireducescatter)(void* collComm, int nSendParts, ncclNetSGE_v8_t* sendParts, void* recvData,

size_t bytesPerRank, size_t windowOffset, size_t windowBytes,

ncclDataType_t dataType, ncclRedOp_t redOp,

void* recvMhandle, void** request);

Performs an asynchronous reduce_scatter operation on the collective group.

ncclResult_t (*iflush)(void* collComm, void* data, int size, void* mhandle, void** request);

pcclResult_t (*iflush)(void* collComm, void* data, int size, void* mhandle, void** request);

Perform a flush/fence to make sure all data received

ncclResult_t (*test)(void* request, int* done, int* size);

pcclResult_t (*test)(void* request, int* done, int* size);

Test whether a request is complete.

ncclResult_t (*closeColl)(void* collComm);

pcclResult_t (*closeColl)(void* collComm);

Close and free collective comm objects

ncclResult_t (*closeListen)(void* listenComm);

pcclResult_t (*closeListen)(void* listenComm);

5. 模型训练或推理时环境变量配置建议

经过一段时间对模型训练的性能问题的分析和积累,PCCL 有下面几个相关的环境变量设置可以在日常模型训练中,提升模型训练性能。

5.1 CUDA_VISIBLE_DEVICES

基于 Megatron core 的多种并行训练配置,可以参考下面的文件,进行提前扫描,并根据实际扫描出来的配置进行设置,扫描脚本会根据提供的并行参数搜索最佳 CUDA_VISIBLE_DEVICES 配置,相比默认会大幅提升通信效率,进一步提高模型训练性能。

5.2 真武810E 多机训练时的配置 (v1.5.2 及之后版本不再需要)

Note: 该配置在 PCCL 版本 1.5.2 之后不需要额外设置,PCCL 内部会自动检测 affinity 网卡并使用。

如果模型训练涉及多机通信,可以配置下述的环境变量列表,为 真武810E 配置最佳的网卡,确保走性能最佳的 GDR 通路,同时去掉机头网卡的参与,提高端到端训练性能。尤其是多机 allToAll 的性能目前依赖下面环境变量的设置才能达到最佳性能。

image

// PCCL_NET_AFFINITY format: "hgml_dev_id:net_id,hgml_dev_id:net_id"
// 通常会通过 NCCL_IB_HCA 过滤掉机头的网卡,所以 NET id 会有一个 -1 的 偏移
export PCCL_NET_AFFINITY=0:3,1:3,4:1,5:1,8:2,9:2,12:0,13:0
export NCCL_NET_GDR_LEVEL=PHBH
export NCCL_IB_HCA=mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4

5.3 PPU资源控制

如果模型训练中,存在计算和通信 overlap 场景,可以尝试下面的环境变量,调整减少通信库对计算资源占用来提高端到端训练性能。(实验性功能,具体有效与否取决于 workload 实际情况)。

export PCCL_CU_MASK=8
export NCCL_NTHREADS=256
export NCCL_LL128_NTHREADS=320
export PCCL_CU_MASK=12
export NCCL_NTHREADS=256
export NCCL_LL128_NTHREADS=320