在PAI上使用PPU开发训练模型

更新时间:
复制为 MD 格式

本文为您介绍如何在PAI-DSW上使用PPU训练模型。

前提条件

在您使用PAI开发训练模型之前,请您完成开通PAI并购买PPU资源

PAI-真武810E镜像

说明
  • 为方便客户快速在PAI Serverless上启用ml.gp7vf.16.40xlarge资源(真武810E),发布PAI Serverless官方镜像,集成PPU、高网、PAI等各层能力,提供开箱即用的体验和最优的性能表现。

  • 本镜像当前处于内测阶段,使用过程中请务必履行保密义务

  • 本镜像仅支持在PAI Serverless平台内(包含DSW、DLC、EAS等模块)使用

训练镜像获取方式及详细说明,请参见训练镜像

PAI-DSW中训练模型

PAI-DSWPAI的云端机器学习开发IDE,集成了Notebook、VSCode、Terminal多种开发环境,免去您手动购买、安装和启动云服务器ECS,使用DSW即可快速开始AI模型代码编写、调试和运行。以下为您介绍PAI-DSW实例创建的基本操作步骤,更多内容请参见PAI-DSW概述

新建DSW实例

  1. 登录PAI控制台左上角选择支持PPU资源的地域,本文以乌兰察布为例为您介绍操作步骤。

    说明

    目前支持PPU资源的地域包括:乌兰察布、北京、上海、杭州。

  2. 在左侧菜单栏单击工作空间列表,选择进入具有PPU资源配额的工作空间。

  3. 在左侧菜单栏单击交互式建模(DSW)> 新建实例

    image

  4. 配置如下关键参数,其他参数按需配置即可,全量参数说明请参见创建DSW实例

    • 资源类型:选择资源配额

    • 资源配额:选择创建的PPU资源配额。

    • 资源规格:按需配置GPU、CPU、内存等规格参数。示例如下:

      image

    • 镜像配置:选择官方镜像,搜索并选择ppu-training:1.7.0-pytorch2.8-ppu-py312-cu129-ubuntu24.04

    完成参数配置后单击确定创建DSW实例。

单机PAI-Magatron-Patch任务训练

1. 登录DSW实例

找到相应实例,单击打开按钮进入DSW实例。

image

登录dsw实例之后,单击选择Terminal页签。

PG1_DSW登录

2. 准备代码、数据集和模型

  1. 执行下列命令,准备PAI-Megatron-Patch代码。

    mkdir /mnt/workspace/megatron-patch
    cd /mnt/workspace/megatron-patch
    git clone -b v0.9.2 --single-branch --recurse-submodules https://github.com/alibaba/Pai-Megatron-Patch.git
  2. 执行如下命令准备模型需要的数据集。

    cd /mnt/workspace/megatron-patch
    mkdir qwen-datasets
    cd qwen-datasets
    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.bin
    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_text_document.idx
    
    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-train.json
    wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu-internal.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-valid.json
  3. 准备模型checkpoints,以QWEN2.5 7。

    mkdir -p /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B
    cd /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B
    pip install modelscope
    modelscope download --model Qwen/Qwen2.5-7B --local_dir ./
  4. 添加执行任务运行相关代码文件。

    需要做两处修改:

    • /mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5 目录下增加run_mcore_qwen-no-overlap.sh文件。需注意第20行针对真武810E做了修改,均默认使用16卡。

      #!/bin/bash
      set -e
      ENV=$1
      CURRENT_DIR="$( cd "$( dirname "$0" )" && pwd )"
      MEGATRON_PATH=$( dirname $( dirname ${CURRENT_DIR}))
      export PYTHONPATH=${MEGATRON_PATH}:${MEGATRON_PATH}/PAI-Megatron-LM-240718:$PYTHONPATH
      export CUDA_DEVICE_MAX_CONNECTIONS=1
      #export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      
      # Here are some configs controled by env
      if [ -z ${MP_DATASET_TYPE} ];then
          MP_DATASET_TYPE="idxmap"
      fi
      
      if [ -z ${MP_AC_LAYERS} ];then
          MP_AC_LAYERS=1
      fi
      
      if [ $ENV = dsw ]; then
      # Because it is 810E, we use 16 gpus and GPUS_PER_NODE=16
          export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
          MASTER_ADDR=localhost
          MASTER_PORT=$(shuf -n 1 -i 10000-65535)
          NNODES=1
          NODE_RANK=0
          GPUS_PER_NODE=16
      elif [ $ENV = dlc ]; then
          NNODES=${WORLD_SIZE}
          NODE_RANK=${RANK}
          GPUS_PER_NODE=16
      fi
      
      if [ -z ${MP_VP} ]; then
          vp_options=""
      else
          vp_options=" \
              --num-layers-per-virtual-pipeline-stage ${MP_VP}"
      fi
      
      if [ -z ${MP_SFT_PACKING} ]; then
          MP_SFT_PACKING=false
      fi
      
      
      DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
      
      ### BASE CONFIG ###
      MODEL_SIZE=$2
      BATCH_SIZE=$3
      GLOBAL_BATCH_SIZE=$4
      LR=$5
      MIN_LR=$6
      SEQ_LEN=$7
      PAD_LEN=$8
      PR=$9
      ### BASE CONFIG ###
      
      ### PARALLEL / BOOL OPTION ###
      TP=${10}
      PP=${11}
      CP=${12}
      SP=${13}
      DO=${14}
      FL=${15}
      SFT=${16}
      ### PARALLEL / BOOL OPTION ###
      
      ### OTHERS ###
      AC=${17}
      OPTIMIZER_OFFLOAD=${18}
      SAVE_INTERVAL=${19}
      DATASET_PATH=${20}
      VALID_DATASET_PATH=${21}
      PRETRAIN_CHECKPOINT_PATH=${22}
      
      # the following two values will not be used when SFT is true
      TRAIN_TOKENS=${23}
      WARMUP_TOKENS=${24}
      ###############################
      
      OUTPUT_BASEPATH=${25}
      ### OTHERS ###
      
      if [ $FL = true ]; then
          export NVTE_FLASH_ATTN=1 NVTE_FUSED_ATTN=0
      elif [ $FL = false ]; then
          export NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN=1
      fi
      
      if [ $MODEL_SIZE = 0.5B ]; then
      
      NUM_LAYERS=24
      HIDDEN_SIZE=896
      NUM_ATTN_HEADS=14
      INTERMEDIATE_SIZE=4864
      NUM_KEY_VALUE_HEADS=2
      MAX_POSITION_EMBEDDINGS=32768
      EXTRA_VOCAB_SIZE=293
      RMS_NORM_EPS=1e-6
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      
      tie_option=""
      
      elif [ $MODEL_SIZE = 1.5B ]; then
      
      NUM_LAYERS=28
      HIDDEN_SIZE=1536
      NUM_ATTN_HEADS=12
      INTERMEDIATE_SIZE=8960
      NUM_KEY_VALUE_HEADS=2
      MAX_POSITION_EMBEDDINGS=32768
      EXTRA_VOCAB_SIZE=293
      RMS_NORM_EPS=1e-6
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=""
      
      elif [ $MODEL_SIZE = 3B ]; then
      
      NUM_LAYERS=36
      HIDDEN_SIZE=2048
      NUM_ATTN_HEADS=16
      INTERMEDIATE_SIZE=11008
      NUM_KEY_VALUE_HEADS=2
      MAX_POSITION_EMBEDDINGS=32768
      EXTRA_VOCAB_SIZE=293
      RMS_NORM_EPS=1e-6
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=""
      
      elif [ $MODEL_SIZE = 7B ]; then
      
      NUM_LAYERS=28
      HIDDEN_SIZE=3584
      NUM_ATTN_HEADS=28
      INTERMEDIATE_SIZE=18944
      NUM_KEY_VALUE_HEADS=4
      MAX_POSITION_EMBEDDINGS=131072
      EXTRA_VOCAB_SIZE=421
      RMS_NORM_EPS=1e-6
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=" \
              --untie-embeddings-and-output-weights \
              "
      
      elif [ $MODEL_SIZE = 14B ]; then
      
      NUM_LAYERS=48
      HIDDEN_SIZE=5120
      NUM_ATTN_HEADS=40
      INTERMEDIATE_SIZE=13824
      NUM_KEY_VALUE_HEADS=8
      MAX_POSITION_EMBEDDINGS=131072
      EXTRA_VOCAB_SIZE=421
      RMS_NORM_EPS=1e-5
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=" \
              --untie-embeddings-and-output-weights \
              "
      elif [ $MODEL_SIZE = 32B ]; then
      
      NUM_LAYERS=64
      HIDDEN_SIZE=5120
      NUM_ATTN_HEADS=40
      INTERMEDIATE_SIZE=27648
      NUM_KEY_VALUE_HEADS=8
      MAX_POSITION_EMBEDDINGS=131072
      EXTRA_VOCAB_SIZE=421
      RMS_NORM_EPS=1e-5
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=" \
              --untie-embeddings-and-output-weights \
              "
      elif [ $MODEL_SIZE = 72B ]; then
      
      NUM_LAYERS=80
      HIDDEN_SIZE=8192
      NUM_ATTN_HEADS=64
      INTERMEDIATE_SIZE=29568
      NUM_KEY_VALUE_HEADS=8
      MAX_POSITION_EMBEDDINGS=131072
      EXTRA_VOCAB_SIZE=421
      RMS_NORM_EPS=1e-5
      gqa_options=" \
      		    --group-query-attention \
      		    --num-query-groups ${NUM_KEY_VALUE_HEADS}"
      
      tie_option=" \
              --untie-embeddings-and-output-weights \
              "
      
      fi
      
      TP_COMM_OVERLAP=$(( ($TP > 1) ? 1 : 0 ))
      comm_overlap_option="\
          --overlap-grad-reduce \
          --overlap-param-gather"
       
      
      if [ $TP_COMM_OVERLAP -eq 1 ]; then
          comm_overlap_option="\
              --overlap-grad-reduce \
              --overlap-param-gather"
      fi
      
      if [ $AC = full ]; then
          _check=$(( ($NUM_LAYERS / $PP) % ${MP_AC_LAYERS} ))
          if [ $_check != 0 ]; then
              echo "the num layers per pp rank must be a multiple of the recompute layers."
              exit -1
          fi
          activation_checkpoint_options=" \
      		    --recompute-method uniform \
                  --recompute-num-layers ${MP_AC_LAYERS} \
      		    --recompute-granularity full"
      elif [ $AC = sel ]; then
          activation_checkpoint_options=" \
              --recompute-activations"
      elif [ $AC = none ]; then
          activation_checkpoint_options=" \
          "
      elif [ $AC = offload ]; then
          activation_checkpoint_options=" \
      		    --cpu-offloading \
      		    --cpu-offloading-num-layers ${MP_AC_LAYERS}"
          if [ $TP_COMM_OVERLAP -eq 1 ]; then
              echo "Disable --overlap-grad-reduce and --overlap-param-gather when cpu offloading is on..."
              comm_overlap_option="\
                  --tp-comm-overlap"
          else
              echo "Disable --overlap-grad-reduce and --overlap-param-gather when cpu offloading is on..."
              comm_overlap_option=""
          fi
      fi
      
      if [ $PR = fp16 ]; then
          pr_options=" \
      		    --fp16 \
                  --apply-query-key-layer-scaling"
          export NVTE_APPLY_QK_LAYER_SCALING=1
      elif [ $PR = bf16 ]; then
          pr_options=" \
              --bf16"
      elif [ $PR = fp8 ]; then
          pr_options=" \
              --bf16 \
              --fp8-format hybrid \
              --fp8-amax-compute-algo max \
              --fp8-amax-history-len 1024"
      fi
      
      if [ $OPTIMIZER_OFFLOAD != false ] && [ $DO = false ]; then
          echo "Offload optimizer is valid only if \$DO=true"
          DO=true
      fi
      
      if [ $DO = true ]; then
          do_options=" \
      		    --use-distributed-optimizer"
      
      elif [ $DO = false ]; then
          do_options=" \
                          "
      fi
      
      te_options=" \
              --transformer-impl transformer_engine"
      
      if [ $SP = true ] && [ $TP -gt 1 ]; then
          sp_options=" \
      		    --sequence-parallel"
      
      elif [ $SP = false ]; then
          sp_options=" \
                          "
      fi
      
      if [ $PRETRAIN_CHECKPOINT_PATH != none ]; then
          load_options=" \
                  --load $PRETRAIN_CHECKPOINT_PATH"
      fi
      
      if [ $OPTIMIZER_OFFLOAD = 'static' ]; then
          offload_option=" \
              --optimizer hybridadam \
              --optimizer-offload-policy static \
              --optimizer-offload-fraction 1.0"
      elif [ $OPTIMIZER_OFFLOAD = 'auto' ]; then
          offload_option=" \
              --optimizer hybridadam \
              --optimizer-offload-policy auto"
      else
          offload_option=""
      fi
      
      if [ $SFT = true ]; then
          TRAIN_ITERS=${23}
          LR_WARMUP_ITERS=${24}
          LR_DECAY_ITERS=$(( ${TRAIN_ITERS} - ${LR_WARMUP_ITERS}))
          PREFIX="finetune-mcore-qwen2.5-${MODEL_SIZE}-lr-${LR}-minlr-${MIN_LR}-bs-${BATCH_SIZE}-gbs-${GLOBAL_BATCH_SIZE}-seqlen-${SEQ_LEN}"
          sft_option=" \
               --eod-mask-loss \
               --train-mode finetune"
      else
          TRAIN_ITERS=$(( ${TRAIN_TOKENS} / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
          LR_WARMUP_ITERS=$(( ${WARMUP_TOKENS}  / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
          LR_DECAY_ITERS=$(( ${TRAIN_TOKENS} /  ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))
          PREFIX="pretrain-mcore-qwen2.5-${MODEL_SIZE}-lr-${LR}-minlr-${MIN_LR}-bs-${BATCH_SIZE}-gbs-${GLOBAL_BATCH_SIZE}-seqlen-${SEQ_LEN}"
          sft_option=" \
              --train-mode pretrain"
      fi
      
      if [ ${MP_DATASET_TYPE} = "raw" ]; then
          dataset_option=" \
              --train-data-path ${DATASET_PATH} \
              --valid-data-path ${VALID_DATASET_PATH} \
              --dataloader-type cyclic \
              --dataset LLama-SFT-Raw"
      else 
          dataset_option=" \
              --data-path ${DATASET_PATH} \
              --split 99,1,0 \
              --dataset LLama-Pretrain-Idxmap"
      fi
      
      if [ ${MP_SFT_PACKING} = true ]; then
          packing_options=" \
              --reset-position-ids \
              --no-create-attention-mask-in-dataloader
          "
      else
          packing_options=""
      fi
      
      ##### Prepare logdirs #######
      NAME="${PREFIX}-pr-${PR}-tp-${TP}-pp-${PP}-cp-${CP}-ac-${AC}-do-${DO}-sp-${SP}-ti-${TRAIN_ITERS}-wi-${LR_WARMUP_ITERS}"
      mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
      mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
      mkdir -p "${OUTPUT_BASEPATH}/log/"
      current_time=$(date "+%Y.%m.%d-%H.%M.%S")
      TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/${NAME}_${current_time}"
      mkdir -p ${TENSORBOARD_DIR}
      SAVED_PRETRAIN_CHECKPOINT_PATH="${OUTPUT_BASEPATH}/checkpoint/${NAME}"
      
      mkdir -p ${SAVED_PRETRAIN_CHECKPOINT_PATH}
      find -L ${PRETRAIN_CHECKPOINT_PATH} -maxdepth 1 -type f -name "*.json" -print0 | xargs -0 cp -t ${SAVED_PRETRAIN_CHECKPOINT_PATH}
      find -L ${PRETRAIN_CHECKPOINT_PATH} -maxdepth 1 -type f -name "merges.txt" -print0 | xargs -0 cp -t ${SAVED_PRETRAIN_CHECKPOINT_PATH}
      
      megatron_options="  \
              --save ${SAVED_PRETRAIN_CHECKPOINT_PATH} \
              --lr ${LR} \
              --min-lr ${MIN_LR} \
              --lr-decay-style cosine \
              --weight-decay 0.1 \
              --adam-beta1 0.9 \
              --adam-beta2 0.95 \
              --clip-grad 1.0 \
              --init-method-std 0.008 \
              --attention-dropout 0.0 \
              --hidden-dropout 0.0 \
              --lr-decay-iters ${LR_DECAY_ITERS} \
              --lr-warmup-iters ${LR_WARMUP_ITERS} \
              --train-iters ${TRAIN_ITERS} \
              --micro-batch-size ${BATCH_SIZE} \
              --global-batch-size ${GLOBAL_BATCH_SIZE} \
              --num-layers ${NUM_LAYERS} \
              --hidden-size ${HIDDEN_SIZE} \
              --num-attention-heads ${NUM_ATTN_HEADS} \
              --ffn-hidden-size ${INTERMEDIATE_SIZE} \
              --seq-length ${SEQ_LEN} \
              --max-position-embeddings ${MAX_POSITION_EMBEDDINGS} \
              --max-padding-length ${PAD_LEN} \
              --log-interval 1 \
              --log-throughput \
              --eval-interval 10000 \
              --eval-iters 10 \
              --save-interval ${SAVE_INTERVAL} \
              --tensorboard-queue-size 1 \
              --tensorboard-dir ${TENSORBOARD_DIR} \
              --log-timers-to-tensorboard \
              --log-batch-size-to-tensorboard \
              --log-validation-ppl-to-tensorboard \
              --tensor-model-parallel-size ${TP} \
              --pipeline-model-parallel-size ${PP} \
              --context-parallel-size ${CP} \
              --no-load-optim \
              --no-load-rng \
              --num-workers 8 \
              --extra-vocab-size ${EXTRA_VOCAB_SIZE} \
              --patch-tokenizer-type Qwen2Tokenizer \
              --swiglu \
              --normalization RMSNorm \
              --norm-epsilon ${RMS_NORM_EPS} \
              --use-rotary-position-embeddings \
              --position-embedding-type rope \
              --disable-bias-linear \
              --add-qkv-bias \
              --rotary-percent 1.0 \
              --rotary-base 1000000 \
              --rotary-seq-len-interpolation-factor 1 \
              --no-save-optim \
              --timing-log-level 2 \
              --timing-log-option minmax \
              "
      
      run_cmd="torchrun $DISTRIBUTED_ARGS ../qwen2/pretrain_qwen.py
       ${megatron_options} ${dataset_option} ${pr_options} ${load_options} ${te_options} ${activation_checkpoint_options} \
       ${do_options} ${sp_options} ${gqa_options} ${offload_option} ${sft_option}  ${tie_option} ${vp_options} ${packing_options}"
      
      echo ${run_cmd}
      eval ${run_cmd}
      set +x
      
    • /mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5 目录下增加run_dsw_7b.sh文件,内容如下。

      unset NCCL_QUADRUPLE_CHANNELS
      #export CUDA_VISIBLE_DEVICES=4,5,1,0,9,8,12,13,7,6,2,3,10,11,15,14
      #export CUDA_VISIBLE_DEVICES=4,5,7,6,0,1,3,2,9,8,10,11,13,12,14,15
      # export CUDA_VISIBLE_DEVICES=4,7,5,6,1,2,0,3,12,15,13,14,9,10,8,11
      # export NCCL_SOCKET_IFNAME=eth0
      #export MP_VP=2
      sh run_mcore_qwen-no-overlap.sh \
      dsw  \
      7B   \
      1    \
      256 \
      1e-5   \
      1e-6   \
      4096  \
      4096  \
      bf16  \
      1  \
      1 \
      1 \
      true \
      true   \
      true \
      false \
      false   \
      false \
      100000  \
      /mnt/workspace/megatron-patch/qwen-datasets/wudao_qwenbpe_text_document  \
      /mnt/workspace/megatron-patch/qwen-datasets/wudao_qwenbpe_text_document  \
      /mnt/workspace/megatron-patch/qwen-ckpts/Qwen2.5-7B  \
      10000000000  \
      1000000   \
      /mnt/workspace/megatron-patch/test_qwen2_5

      相应参数说明

      ENV=$1                          # 运行环境配置开关: dsw单机训练,dlc表示多机训练环境
      MODEL_SIZE=$2                   # 模型结构参数量级: 8B, 70B
      BATCH_SIZE=$3                   # 一次迭代一个数据并行内的样本数
      GLOBAL_BATCH_SIZE=$4            # 一次迭代多个数据并行的总样本数
      LR=$5                           # 学习率
      MIN_LR=$6                       # 最小学习率
      SEQ_LEN=$7                      # 序列长度
      PAD_LEN=$8                      # Padding长度
      PR=${9}                         # 训练精度: fp16, bf16, fp8
      TP=${10}                        # 模型并行度
      PP=${11}                        # 流水并行度
      CP=${12}                        # 上下文并行度
      SP=${13}                        # 是否使用序列并行: true, false
      DO=${14}                        # 是否使用MegatronZero-1降显存优化器: true, false
      FL=${15}                        # 是否优先使用Flash Attention: true, false
      SFT=${16}                       # 是否执行微调训练: true, false
      AC=${17}                        # 激活检查点模式: sel, full, offload, false
      OPTIMIZER_OFFLOAD=${18}         # 是否启用Offload optimizer: false, static, auto
      SAVE_INTERVAL=${19}             # 保存ckpt的间隔
      DATASET_PATH=${20}              # 训练数据集路径
      VALID_DATASET_PATH=${21}        # 验证数据集路径
      PRETRAIN_CHECKPOINT_PATH=${22}  # 预训练模型路径
      TRAIN_TOKENS_OR_ITERS=${23}     # 训练TOKEN或者Iter数
      WARMUP_TOKENS_OR_ITERS=${24}    # 预热TOKEN或者Iter数        
      OUTPUT_BASEPATH=${25}           # 训练输出日志文件路径

      增加文件如下图所示。

      PG1_DSW_增加日志

3. 执行训练任务

执行如下命令运行训练任务。

cd /mnt/workspace/megatron-patch/Pai-Megatron-Patch/examples/qwen2_5
sh run_dsw_7b.sh

4. 查看结果

  1. 查看日志。有相应日志展示,则说明任务正常执行。

    PG1_DSW_日志

  2. 查看监控。有相应数据展示,则说明任务在运行中。

    PG1——DSW_任务监控

性能指标

LLaMA-3.1-8B/70B、Qwen2.5 72B

注:需要关闭训练脚本中的--tp-comm-overlap选项

以下为LLaMA3.1Qwen2.5系列模型性能测试结果

image.png

image.png

image.png

Qwen-VL

  • 开源代码库地址:https://github.com/QwenLM/Qwen-VL

  • 训练配置文件:finetune/finetune_lora_ds.sh

  • 参数配置:mbs 2 / ga 4 / seqlen 2048 / zero 2 / bf16 / lora

节点数

卡数

参考吞吐 (samples/sec)

MFU(TFLOPS / 规格值123)

1

16

9.64

58.40%

2

32

19.38

55.40%

4

64

36.05

57.60%

8

128

77.46

55.30%

16

256

152.39

54.40%

Bunny-LLaMA-3-V

  • 开源代码库地址:https://github.com/BAAI-DCAI/Bunny

  • 训练配置文件:https://github.com/BAAI-DCAI/Bunny/blob/main/script/train/tutorials/Bunny-Llama-3-8B-V.md

  • 数据集:https://huggingface.co/datasets/BoyaWu10/Bunny-v1_1-data/tree/main/pretrain

节点数

卡数

参考吞吐 (samples/sec)

MFU(TFLOPS / 规格值123)

1

16

37.01

78.54%

2

32

68.71

73.80%

4

64

130.86

71.80%

8

128

260.32

69.70%

16

256

522.48

73.20%

32

512

1043.8

70.40%