在PAI上使用PPU开发训练模型
本文为您介绍如何在PAI-DSW上使用PPU训练模型。
前提条件
在您使用PAI开发训练模型之前,请您完成开通PAI并购买PPU资源。
PAI-真武810E镜像
为方便客户快速在PAI Serverless上启用ml.gp7vf.16.40xlarge资源(真武810E),发布PAI Serverless官方镜像,集成PPU、高网、PAI等各层能力,提供开箱即用的体验和最优的性能表现。
本镜像当前处于内测阶段,使用过程中请务必履行保密义务
本镜像仅支持在PAI Serverless平台内(包含DSW、DLC、EAS等模块)使用
训练镜像获取方式及详细说明,请参见训练镜像。
在PAI-DSW中训练模型
PAI-DSW是PAI的云端机器学习开发IDE,集成了Notebook、VSCode、Terminal多种开发环境,免去您手动购买、安装和启动云服务器ECS,使用DSW即可快速开始AI模型代码编写、调试和运行。以下为您介绍PAI-DSW实例创建的基本操作步骤,更多内容请参见PAI-DSW概述。
新建DSW实例
登录PAI控制台,左上角选择支持PPU资源的地域,本文以乌兰察布为例为您介绍操作步骤。
说明目前支持PPU资源的地域包括:乌兰察布、北京、上海、杭州。
在左侧菜单栏单击工作空间列表,选择进入具有PPU资源配额的工作空间。
在左侧菜单栏单击交互式建模(DSW)> 新建实例。

配置如下关键参数,其他参数按需配置即可,全量参数说明请参见创建DSW实例。
资源类型:选择资源配额。
资源配额:选择创建的PPU资源配额。
资源规格:按需配置GPU、CPU、内存等规格参数。示例如下:

镜像配置:选择官方镜像,搜索并选择
ppu-training:1.7.0-pytorch2.8-ppu-py312-cu129-ubuntu24.04。
完成参数配置后单击确定创建DSW实例。
单机PAI-Magatron-Patch任务训练
1. 登录DSW实例
找到相应实例,单击打开按钮进入DSW实例。

登录dsw实例之后,单击选择Terminal页签。

2. 准备代码、数据集和模型
执行下列命令,准备PAI-Megatron-Patch代码。
mkdir /mnt/workspace/megatron-patch cd /mnt/workspace/megatron-patch git clone -b v0.9.2 --single-branch --recurse-submodules https://github.com/alibaba/Pai-Megatron-Patch.git执行如下命令准备模型需要的数据集。
cd /mnt/workspace/megatron-patch mkdir qwen-datasets cd qwen-datasets wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.bin wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.idx wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-train.json wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-valid.json准备模型checkpoints,以QWEN2.5 7。
mkdir -p /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B cd /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B pip install modelscope modelscope download --model Qwen/Qwen2.5-7B --local_dir ./添加执行任务运行相关代码文件。
需要做两处修改:
在
/mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5目录下增加run_mcore_qwen-no-overlap.sh文件。需注意第20行针对真武810E做了修改,均默认使用16卡。#!/bin/bash set -e ENV=$1 CURRENT_DIR="$( cd "$( dirname "$0" )" && pwd )" MEGATRON_PATH=$( dirname $( dirname ${CURRENT_DIR})) export PYTHONPATH=${MEGATRON_PATH}:${MEGATRON_PATH}/PAI-Megatron-LM-240718:$PYTHONPATH export CUDA_DEVICE_MAX_CONNECTIONS=1 #export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # Here are some configs controled by env if [ -z ${MP_DATASET_TYPE} ];then MP_DATASET_TYPE="idxmap" fi if [ -z ${MP_AC_LAYERS} ];then MP_AC_LAYERS=1 fi if [ $ENV = dsw ]; then # Because it is 810E, we use 16 gpus and GPUS_PER_NODE=16 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 MASTER_ADDR=localhost MASTER_PORT=$(shuf -n 1 -i 10000-65535) NNODES=1 NODE_RANK=0 GPUS_PER_NODE=16 elif [ $ENV = dlc ]; then NNODES=${WORLD_SIZE} NODE_RANK=${RANK} GPUS_PER_NODE=16 fi if [ -z ${MP_VP} ]; then vp_options="" else vp_options=" \ --num-layers-per-virtual-pipeline-stage ${MP_VP}" fi if [ -z ${MP_SFT_PACKING} ]; then MP_SFT_PACKING=false fi DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT" ### BASE CONFIG ### MODEL_SIZE=$2 BATCH_SIZE=$3 GLOBAL_BATCH_SIZE=$4 LR=$5 MIN_LR=$6 SEQ_LEN=$7 PAD_LEN=$8 PR=$9 ### BASE CONFIG ### ### PARALLEL / BOOL OPTION ### TP=${10} PP=${11} CP=${12} SP=${13} DO=${14} FL=${15} SFT=${16} ### PARALLEL / BOOL OPTION ### ### OTHERS ### AC=${17} OPTIMIZER_OFFLOAD=${18} SAVE_INTERVAL=${19} DATASET_PATH=${20} VALID_DATASET_PATH=${21} PRETRAIN_CHECKPOINT_PATH=${22} # the following two values will not be used when SFT is true TRAIN_TOKENS=${23} WARMUP_TOKENS=${24} ############################### OUTPUT_BASEPATH=${25} ### OTHERS ### if [ $FL = true ]; then export NVTE_FLASH_ATTN=1 NVTE_FUSED_ATTN=0 elif [ $FL = false ]; then export NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN=1 fi if [ $MODEL_SIZE = 0.5B ]; then NUM_LAYERS=24 HIDDEN_SIZE=896 NUM_ATTN_HEADS=14 INTERMEDIATE_SIZE=4864 NUM_KEY_VALUE_HEADS=2 MAX_POSITION_EMBEDDINGS=32768 EXTRA_VOCAB_SIZE=293 RMS_NORM_EPS=1e-6 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option="" elif [ $MODEL_SIZE = 1.5B ]; then NUM_LAYERS=28 HIDDEN_SIZE=1536 NUM_ATTN_HEADS=12 INTERMEDIATE_SIZE=8960 NUM_KEY_VALUE_HEADS=2 MAX_POSITION_EMBEDDINGS=32768 EXTRA_VOCAB_SIZE=293 RMS_NORM_EPS=1e-6 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option="" elif [ $MODEL_SIZE = 3B ]; then NUM_LAYERS=36 HIDDEN_SIZE=2048 NUM_ATTN_HEADS=16 INTERMEDIATE_SIZE=11008 NUM_KEY_VALUE_HEADS=2 MAX_POSITION_EMBEDDINGS=32768 EXTRA_VOCAB_SIZE=293 RMS_NORM_EPS=1e-6 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option="" elif [ $MODEL_SIZE = 7B ]; then NUM_LAYERS=28 HIDDEN_SIZE=3584 NUM_ATTN_HEADS=28 INTERMEDIATE_SIZE=18944 NUM_KEY_VALUE_HEADS=4 MAX_POSITION_EMBEDDINGS=131072 EXTRA_VOCAB_SIZE=421 RMS_NORM_EPS=1e-6 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option=" \ --untie-embeddings-and-output-weights \ " elif [ $MODEL_SIZE = 14B ]; then NUM_LAYERS=48 HIDDEN_SIZE=5120 NUM_ATTN_HEADS=40 INTERMEDIATE_SIZE=13824 NUM_KEY_VALUE_HEADS=8 MAX_POSITION_EMBEDDINGS=131072 EXTRA_VOCAB_SIZE=421 RMS_NORM_EPS=1e-5 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option=" \ --untie-embeddings-and-output-weights \ " elif [ $MODEL_SIZE = 32B ]; then NUM_LAYERS=64 HIDDEN_SIZE=5120 NUM_ATTN_HEADS=40 INTERMEDIATE_SIZE=27648 NUM_KEY_VALUE_HEADS=8 MAX_POSITION_EMBEDDINGS=131072 EXTRA_VOCAB_SIZE=421 RMS_NORM_EPS=1e-5 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option=" \ --untie-embeddings-and-output-weights \ " elif [ $MODEL_SIZE = 72B ]; then NUM_LAYERS=80 HIDDEN_SIZE=8192 NUM_ATTN_HEADS=64 INTERMEDIATE_SIZE=29568 NUM_KEY_VALUE_HEADS=8 MAX_POSITION_EMBEDDINGS=131072 EXTRA_VOCAB_SIZE=421 RMS_NORM_EPS=1e-5 gqa_options=" \ --group-query-attention \ --num-query-groups ${NUM_KEY_VALUE_HEADS}" tie_option=" \ --untie-embeddings-and-output-weights \ " fi TP_COMM_OVERLAP=$(( ($TP > 1) ? 1 : 0 )) comm_overlap_option="\ --overlap-grad-reduce \ --overlap-param-gather" if [ $TP_COMM_OVERLAP -eq 1 ]; then comm_overlap_option="\ --overlap-grad-reduce \ --overlap-param-gather" fi if [ $AC = full ]; then _check=$(( ($NUM_LAYERS / $PP) % ${MP_AC_LAYERS} )) if [ $_check != 0 ]; then echo "the num layers per pp rank must be a multiple of the recompute layers." exit -1 fi activation_checkpoint_options=" \ --recompute-method uniform \ --recompute-num-layers ${MP_AC_LAYERS} \ --recompute-granularity full" elif [ $AC = sel ]; then activation_checkpoint_options=" \ --recompute-activations" elif [ $AC = none ]; then activation_checkpoint_options=" \ " elif [ $AC = offload ]; then activation_checkpoint_options=" \ --cpu-offloading \ --cpu-offloading-num-layers ${MP_AC_LAYERS}" if [ $TP_COMM_OVERLAP -eq 1 ]; then echo "Disable --overlap-grad-reduce and --overlap-param-gather when cpu offloading is on..." comm_overlap_option="\ --tp-comm-overlap" else echo "Disable --overlap-grad-reduce and --overlap-param-gather when cpu offloading is on..." comm_overlap_option="" fi fi if [ $PR = fp16 ]; then pr_options=" \ --fp16 \ --apply-query-key-layer-scaling" export NVTE_APPLY_QK_LAYER_SCALING=1 elif [ $PR = bf16 ]; then pr_options=" \ --bf16" elif [ $PR = fp8 ]; then pr_options=" \ --bf16 \ --fp8-format hybrid \ --fp8-amax-compute-algo max \ --fp8-amax-history-len 1024" fi if [ $OPTIMIZER_OFFLOAD != false ] && [ $DO = false ]; then echo "Offload optimizer is valid only if \$DO=true" DO=true fi if [ $DO = true ]; then do_options=" \ --use-distributed-optimizer" elif [ $DO = false ]; then do_options=" \ " fi te_options=" \ --transformer-impl transformer_engine" if [ $SP = true ] && [ $TP -gt 1 ]; then sp_options=" \ --sequence-parallel" elif [ $SP = false ]; then sp_options=" \ " fi if [ $PRETRAIN_CHECKPOINT_PATH != none ]; then load_options=" \ --load $PRETRAIN_CHECKPOINT_PATH" fi if [ $OPTIMIZER_OFFLOAD = 'static' ]; then offload_option=" \ --optimizer hybridadam \ --optimizer-offload-policy static \ --optimizer-offload-fraction 1.0" elif [ $OPTIMIZER_OFFLOAD = 'auto' ]; then offload_option=" \ --optimizer hybridadam \ --optimizer-offload-policy auto" else offload_option="" fi if [ $SFT = true ]; then TRAIN_ITERS=${23} LR_WARMUP_ITERS=${24} LR_DECAY_ITERS=$(( ${TRAIN_ITERS} - ${LR_WARMUP_ITERS})) PREFIX="finetune-mcore-qwen2.5-${MODEL_SIZE}-lr-${LR}-minlr-${MIN_LR}-bs-${BATCH_SIZE}-gbs-${GLOBAL_BATCH_SIZE}-seqlen-${SEQ_LEN}" sft_option=" \ --eod-mask-loss \ --train-mode finetune" else TRAIN_ITERS=$(( ${TRAIN_TOKENS} / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} )) LR_WARMUP_ITERS=$(( ${WARMUP_TOKENS} / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} )) LR_DECAY_ITERS=$(( ${TRAIN_TOKENS} / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} )) PREFIX="pretrain-mcore-qwen2.5-${MODEL_SIZE}-lr-${LR}-minlr-${MIN_LR}-bs-${BATCH_SIZE}-gbs-${GLOBAL_BATCH_SIZE}-seqlen-${SEQ_LEN}" sft_option=" \ --train-mode pretrain" fi if [ ${MP_DATASET_TYPE} = "raw" ]; then dataset_option=" \ --train-data-path ${DATASET_PATH} \ --valid-data-path ${VALID_DATASET_PATH} \ --dataloader-type cyclic \ --dataset LLama-SFT-Raw" else dataset_option=" \ --data-path ${DATASET_PATH} \ --split 99,1,0 \ --dataset LLama-Pretrain-Idxmap" fi if [ ${MP_SFT_PACKING} = true ]; then packing_options=" \ --reset-position-ids \ --no-create-attention-mask-in-dataloader " else packing_options="" fi ##### Prepare logdirs ####### NAME="${PREFIX}-pr-${PR}-tp-${TP}-pp-${PP}-cp-${CP}-ac-${AC}-do-${DO}-sp-${SP}-ti-${TRAIN_ITERS}-wi-${LR_WARMUP_ITERS}" mkdir -p "${OUTPUT_BASEPATH}/tensorboard/" mkdir -p "${OUTPUT_BASEPATH}/checkpoint/" mkdir -p "${OUTPUT_BASEPATH}/log/" current_time=$(date "+%Y.%m.%d-%H.%M.%S") TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${current_time}" mkdir -p ${TENSORBOARD_DIR} SAVED_PRETRAIN_CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}" mkdir -p ${SAVED_PRETRAIN_CHECKPOINT_PATH} find -L ${PRETRAIN_CHECKPOINT_PATH} -maxdepth 1 -type f -name "*.json" -print0 | xargs -0 cp -t ${SAVED_PRETRAIN_CHECKPOINT_PATH} find -L ${PRETRAIN_CHECKPOINT_PATH} -maxdepth 1 -type f -name "merges.txt" -print0 | xargs -0 cp -t ${SAVED_PRETRAIN_CHECKPOINT_PATH} megatron_options=" \ --save ${SAVED_PRETRAIN_CHECKPOINT_PATH} \ --lr ${LR} \ --min-lr ${MIN_LR} \ --lr-decay-style cosine \ --weight-decay 0.1 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --clip-grad 1.0 \ --init-method-std 0.008 \ --attention-dropout 0.0 \ --hidden-dropout 0.0 \ --lr-decay-iters ${LR_DECAY_ITERS} \ --lr-warmup-iters ${LR_WARMUP_ITERS} \ --train-iters ${TRAIN_ITERS} \ --micro-batch-size ${BATCH_SIZE} \ --global-batch-size ${GLOBAL_BATCH_SIZE} \ --num-layers ${NUM_LAYERS} \ --hidden-size ${HIDDEN_SIZE} \ --num-attention-heads ${NUM_ATTN_HEADS} \ --ffn-hidden-size ${INTERMEDIATE_SIZE} \ --seq-length ${SEQ_LEN} \ --max-position-embeddings ${MAX_POSITION_EMBEDDINGS} \ --max-padding-length ${PAD_LEN} \ --log-interval 1 \ --log-throughput \ --eval-interval 10000 \ --eval-iters 10 \ --save-interval ${SAVE_INTERVAL} \ --tensorboard-queue-size 1 \ --tensorboard-dir ${TENSORBOARD_DIR} \ --log-timers-to-tensorboard \ --log-batch-size-to-tensorboard \ --log-validation-ppl-to-tensorboard \ --tensor-model-parallel-size ${TP} \ --pipeline-model-parallel-size ${PP} \ --context-parallel-size ${CP} \ --no-load-optim \ --no-load-rng \ --num-workers 8 \ --extra-vocab-size ${EXTRA_VOCAB_SIZE} \ --patch-tokenizer-type Qwen2Tokenizer \ --swiglu \ --normalization RMSNorm \ --norm-epsilon ${RMS_NORM_EPS} \ --use-rotary-position-embeddings \ --position-embedding-type rope \ --disable-bias-linear \ --add-qkv-bias \ --rotary-percent 1.0 \ --rotary-base 1000000 \ --rotary-seq-len-interpolation-factor 1 \ --no-save-optim \ --timing-log-level 2 \ --timing-log-option minmax \ " run_cmd="torchrun $DISTRIBUTED_ARGS ../qwen2/pretrain_qwen.py ${megatron_options} ${dataset_option} ${pr_options} ${load_options} ${te_options} ${activation_checkpoint_options} \ ${do_options} ${sp_options} ${gqa_options} ${offload_option} ${sft_option} ${tie_option} ${vp_options} ${packing_options}" echo ${run_cmd} eval ${run_cmd} set +x在
/mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5目录下增加run_dsw_7b.sh文件,内容如下。unset NCCL_QUADRUPLE_CHANNELS #export CUDA_VISIBLE_DEVICES=4,5,1,0,9,8,12,13,7,6,2,3,10,11,15,14 #export CUDA_VISIBLE_DEVICES=4,5,7,6,0,1,3,2,9,8,10,11,13,12,14,15 # export CUDA_VISIBLE_DEVICES=4,7,5,6,1,2,0,3,12,15,13,14,9,10,8,11 # export NCCL_SOCKET_IFNAME=eth0 #export MP_VP=2 sh run_mcore_qwen-no-overlap.sh \ dsw \ 7B \ 1 \ 256 \ 1e-5 \ 1e-6 \ 4096 \ 4096 \ bf16 \ 1 \ 1 \ 1 \ true \ true \ true \ false \ false \ false \ 100000 \ /mnt/workspace/megatron-patch/qwen-datasets/wudao_qwenbpe_text_document \ /mnt/workspace/megatron-patch/qwen-datasets/wudao_qwenbpe_text_document \ /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B \ 10000000000 \ 1000000 \ /mnt/workspace/megatron-patch/test_qwen2_5增加文件如下图所示。

3. 执行训练任务
执行如下命令运行训练任务。
cd /mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5
sh run_dsw_7b.sh4. 查看结果
查看日志。有相应日志展示,则说明任务正常执行。

查看监控。有相应数据展示,则说明任务在运行中。

性能指标
LLaMA-3.1-8B/70B、Qwen2.5 72B
训练指南:examples/llama3_1/README.md, examples/qwen2_5/README.md
注:需要关闭训练脚本中的--tp-comm-overlap选项
以下为LLaMA3.1和Qwen2.5系列模型性能测试结果
|
|
|
Qwen-VL
训练配置文件:finetune/finetune_lora_ds.sh
参数配置:mbs 2 / ga 4 / seqlen 2048 / zero 2 / bf16 / lora
节点数 | 卡数 | 参考吞吐 (samples/sec) | MFU(TFLOPS / 规格值123) |
1 | 16 | 9.64 | 58.40% |
2 | 32 | 19.38 | 55.40% |
4 | 64 | 36.05 | 57.60% |
8 | 128 | 77.46 | 55.30% |
16 | 256 | 152.39 | 54.40% |
Bunny-LLaMA-3-V
训练配置文件:https://github.com/BAAI-DCAI/Bunny/blob/main/script/train/tutorials/Bunny-Llama-3-8B-V.md
数据集:https://huggingface.co/datasets/BoyaWu10/Bunny-v1_1-data/tree/main/pretrain
节点数 | 卡数 | 参考吞吐 (samples/sec) | MFU(TFLOPS / 规格值123) |
1 | 16 | 37.01 | 78.54% |
2 | 32 | 68.71 | 73.80% |
4 | 64 | 130.86 | 71.80% |
8 | 128 | 260.32 | 69.70% |
16 | 256 | 522.48 | 73.20% |
32 | 512 | 1043.8 | 70.40% |




