文档

应用实践:Transformer模型训练加速

更新时间:

PAI-Rapidformer提供了丰富的模型训练加速方法,您只需要安装Rapidformer专属镜像,即可通过黑盒或者白盒化的方式对模型训练进行优化。本文为您介绍如何使用Rapidformer优化PyTorch版的Transformer模型训练。

前提条件

背景信息

Rapidformer可通过黑盒或者白盒化的方式对模型训练进行加速:

黑盒化加速:加速微调Huggingface模型

  1. 将您的数据集注册进HuggingFace,或查找使用已有的数据集,后续通过--dataset-name开关传递给Rapidformer。

  2. 将您的模型注册进HuggingFace,或使用已有的模型,后续通过--pretrained-model-name-or-path开关传递给Rapidformer。

  3. 配置Rapidformer的启动训练CLI,示例如下。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task sequence_classification \ #任务名
                --pretrained-model-name-or-path 'bert-base-cased' \  #已注册模型名
                --data-path glue \                      #已注册的数据路径名
                --data-name mrpc \                      #已注册的数据文件名
                --epochs 3 \                               #训练迭代轮次
                --micro-batch-size 16 \                    #每个gpu上的batch size
                --global-batch-size 64 \                   #分布式训练总的batch size
                --lr 2e-5 \                                #学习率
                --lr-decay-style linear \                  #学习率衰减策略
                --lr-warmup-iters 100 \                    #学习率warmup步数
                --weight-decay 1e-2 \                      #lr系数
                --clip-grad 1.0 \                          #梯度clip系数
                --seed 42 \                                #随机种子
                --mixed-precision \                        #开启混合精度训练
                --onnx-runtime-training \                  #开启计算图优化
                --zero-1-memory-optimization \             #开启优化器状态切分优化

    各参数的详细介绍请参见参数配置指导

黑盒化加速:加速预训练Huggingface模型

  1. 制作mmap类型的预训练数据集。

    操作详情请参见Megatron数据处理脚本,mmap数据集制作脚本请参考如下命令示例。

    python preprocess_data.py \
      --input book_wiki_owtv2_small.json  \
      --output-prefix gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. 将您的模型注册进HuggingFace,或使用已有的模型,后续通过--pretrained-model-name-or-path开关传递给Rapidformer。

  3. 配置Rapidformer的启动训练CLI,示例如下。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task pretraining \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \               #开启梯度累积
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path book_wiki_owtv2_small_text_sentence \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --min-lr 0.0 \
           --lr-decay-iters 2000 \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --mixed-precision \                    #开启混合精度训练
           --onnx-runtime-training \              #开启计算图优化
           --fsdp-memory-optimization \           #开启模型状态切分优化

    各参数的详细介绍请参见参数配置指导

白盒化加速:基于Finetuner代码模版的Huggingface模型微调

下面介绍利用Rapidformer提供的Finetuner代码模版快速构建Huggingface微调任务。在代码模版中有四个函数需要关注:

  • 制作数据的train_valid_test_datasets_provider

  • 构造模型、优化器、学习率调节器的model_optimizer_lr_scheduler_provider

  • 前向运算逻辑的run_forward_step

  • 进行边train边eval计算精度的run_compute_metrics

这四个函数详细介绍请参见Rapidformer API,以下对这四个函数的输入输出做简要的介绍。

class MyFintuner(Finetuner):

    def __init__(self, engine):
        super().__init__(engine=engine)

    # 获取训练/验证/测试数据集
    # 输入:无
    # 输出:三个对象以及一个对象函数
    def train_valid_test_datasets_provider(self):

        return train_dataset, valid_dataset, test_dataset, collate_f

    # 创建模型/优化器/学习率规划器
    # 输入:无
    # 输出:三个对象
    def model_optimizer_lr_scheduler_provider(self):

        return model, optimizer, lr_scheduer

    #编写前向逻辑
    # 输入:batch 或者 iterator,model
    # 输出:loss
    def run_forward_step(self, batch_or_iterator, model):
        return loss

    #编写验证集评估逻辑, 微调专用
    # 输入:model,验证集数据加载器
    # 输出:metric对象
    def run_compute_metrics(self, model, eval_dataloader):
        return metric
                

熟悉以上自定义的代码模版后,请先参考黑盒化加速:加速微调Huggingface模型示例,准备好数据集和模型,再进行以下步骤。

  1. 导入Rapidformer以及Huggingface的接口。

    from transformers/easytexmier import AutoConfig,BertForSequenceClassification
    from datasets import load_dataset, load_metric
    from rapidformer import RapidformerEngine
    from rapidformer import get_args
    from rapidformer import get_logger
    from rapidformer import get_timers
    from rapidformer import Finetuner
    from rapidformer import Pretrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. 完善代码模版中的四个函数,如下所示。

    class MyFintuner(Finetuner):
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self):
            tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
            def tokenize_function(examples):
                # max_length=None => use the model max length (it's actually the default)
                outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
                return outputs
    
            datasets = load_dataset(args.dataset_path, args.dataset_name)
            # Apply the method we just defined to all the examples in all the splits of the dataset
            tokenized_datasets = datasets.map(
                tokenize_function,
                batched=True,
                remove_columns=["idx", "sentence1", "sentence2"],
            )
            tokenized_datasets.rename_column_("label", "labels")
    
            train_dataset = tokenized_datasets["train"]
            valid_dataset = tokenized_datasets['validation']
            test_dataset = tokenized_datasets['test']
    
            def collate_fn(examples):
                return tokenizer.pad(examples, padding="longest", return_tensors="pt")
    
            return train_dataset, valid_dataset, test_dataset, collate_fn
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = BertForSequenceClassification.from_pretrained(args.load)
            return model, None, None
    
        def run_forward_step(self, batch, model):
            output_tensor = model(**batch)
            return output_tensor.loss
    
        # after each epoch run metric on eval dataset
        def run_compute_metrics(self, model, eval_dataloader):
            model = model[0]
            metric = load_metric(args.dataset_path, args.dataset_name)
            for step, batch in enumerate(eval_dataloader):
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)
    
                metric.add_batch(
                    predictions=self.gather(predictions),
                    references=self.gather(batch["labels"]),
                )
    
            eval_metric = metric.compute()
            return eval_metric
                            
  3. 初始化Rapidformer引擎,创建trainer对象,调用finetune()方法,然后保存成文件并命名为rapidformer_finetune_huggingface_bert_trainer.py

    engine = RapidformerEngine()
    trainer = MyFintuner(engine=engine)
    trainer.train()
  4. 基于CLI准备启动脚本,设置--user-scriptrapidformer_finetune_huggingface_bert_trainer.py,并设置加速开关。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py
                --task sequence_classification \
                --pretrained-model-name-or-path 'bert-base-cased' \
                --data-path glue \
                --data-name mrpc \
                --epochs 3 \
                --micro-batch-size 16 \
                --global-batch-size 16 \
                --lr 2e-5 \
                --lr-decay-style linear \
                --lr-warmup-iters 100 \
                --weight-decay 1e-2 \
                --clip-grad 1.0 \
                --mixed-precision                                 #开启混合精度训练
                --zero-3-memory-optimization \                    #开启模型状态切分
                --onnx-runtime-training \                         #开启计算图优化

白盒化加速:基于Pretrainer代码模版的Huggingface模型预训练

利用Rapidformer提供的Pretrainer代码模版快速构建Huggingface模型预训练任务时,在代码模版中有以下几个函数需要关注:

  • 制作数据的train_valid_test_datasets_provider

  • 构造模型、优化器、学习率调节器的model_optimizer_lr_scheduler_provider

  • 前向运算逻辑的run_forward_step

这几个函数详细介绍请参见Rapidformer API,输入输出的简要介绍请参见白盒化加速:基于Finetuner代码模版的Huggingface模型微调

熟悉以上自定义的代码模版后,请先参考黑盒化加速:加速微调Huggingface模型示例,准备好数据集和模型,再进行以下步骤。

  1. 导入Rapidformer以及Huggingface的接口。

    说明

    由于预训练利用iterator读取数据,这里需要导入mpu来做数据并行。

    from megatron import mpu
    from transformers import BertConfig, BertForPreTraining
    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. 继承Pretrainer,完善预训练的代码,如下所示。

    class MyBertPreTrainer(PreTrainer):
    
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                max_seq_length=args.seq_length,
                masked_lm_prob=args.mask_prob,
                short_seq_prob=args.short_seq_prob,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup),
                binary_head=True)
    
            return train_ds, valid_ds, test_ds
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path)
            return model, None, None
    
        def run_forward_step(self, data_iterator, model):
            # Items and their type.
            keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
            input_ids = data_b['input_ids'].long()
            attention_mask = data_b['attention_mask'].long()
            token_type_ids = data_b['token_type_ids'].long()
            labels = data_b['labels'].long()
            next_sentence_label = data_b['next_sentence_label'].long()
            output_tensor = model(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label)
    
            return output_tensor['loss']
  3. 初始化Rapidformer引擎,创建trainer对象,调用pretrain()方法,然后保存成文件并命名为rapidformer_pretrain_huggingface_bert_trainer.py

    engine = RapidformerEngine()
    trainer = MyBertPreTrainer(engine=engine)
    trainer.train()
  4. 基于CLI准备启动脚本,并设置加速开关。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    
    rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 64 \
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \                               #开启数据加速
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --zero-3-memory-optimization \                    #开启模型状态切分
           --onnx-runtime-training \                         #开启计算图优化
           --mixed-precision                                 #混合精度训练

白盒化加速:用户自定义Trainer的Huggingface模型微调

针对用户自定义Trainer的程序,Rapidformer提供非常有限的加速能力,比如Apex优化器、模型状态切分、计算图优化等。由于混合精度训练涉及到对用户训练过程较多的修改,因此我们推荐您使用上面提供的基于代码模版的方法来实施对训练程序的加速。以下针对一个典型的huggingface微调代码进行侵入式的加速。

huggingface微调代码示例如下。

import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    BertForSequenceClassification,

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")

def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
    return outputs

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)

device = torch.device("cuda", args.local_rank)

for epoch in range(args.epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(eval_dataloader):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            metric.add_batch(
                    predictions=engine.gather(predictions),
                    references=engine.gather(batch["labels"]))

     eval_metric = metric.compute()
     print("epoch {}: {}".format(epoch, eval_metric))

这段代码存在一些问题,比如不支持数据并行训练、优化器也比较慢、不支持混合精度训练等。以下借助Rapidformer提供的API来对这段示例自定义代码进行改造。

  1. 支持数据并行。

    首先创建一个finetuner对象,然后调用finetuner.build_data_loader方法返回数据加载器。该加载器支持数据并行并自动将data发送到GPU设备,这意味着可以在原始代码中去掉batch.to(device)

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - train_dataloader = DataLoader(tokenized_datasets["train"])
    - eval_dataloader = DataLoader(tokenized_datasets["train"])
    
    + train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"])
    + eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"])
  2. 在数据并行的基础上,使用Apex优化器。

    将优化器换成更快的apex fused adam,去掉原来的optimizer,换成rapidformer提供的fused adam。具体方法是调用engine.compose来对模型、优化器、学习率规划器进行封装。

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
    - lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
        num_warmup_steps=args.lr_warmup_iters,
        num_training_steps=args.train_iters
    )
    
    
    + lr_scheduler = partial(
            get_linear_schedule_with_warmup,
            num_warmup_steps=args.lr_warmup_iters,
            num_training_steps=args.train_iters
        )
    
    + model, optimizer, lr_scheduler = engine.compose(model_obj=model,
          lr_scheduler_fn=lr_scheduler)
    说明

    在数据并行的基础上,使用Apex优化器和混合精度时,混合精度训练涉及到对训练流程的修改、model切换到fp16、loss scaling等。对无trainer的前端程序改造成本比较大,因此可使用基于Trainer的解决方案。有Rapidformer的fintuner的加持,能做的加速方案就比较多了,除了整合前面的数据并行和apex、pytorch混合精度训练,还提供了megatron optimizer混合精度训练、fairscale和deepspeed的显存优化加速等。

白盒化加速:基于Pretrainer代码模版的Megatron模型预训练

熟悉了上面的白盒化加速:用户自定义Trainer的Huggingface模型微调实践,您可以进一步更加灵活的绕过Data、Model Hub,在函数train_valid_test_datasets_provider中编写自定义数据的创建逻辑, 在函数model_optimizer_lr_scheduler_provider中编写自定义创建模型的逻辑,同时在run_forward_step中自定义的前向逻辑。

  1. 制作mmap类型的预训练数据集。

    操作详情请参见Megatron数据处理脚本,mmap数据集制作脚本请参考如下命令示例。

    python preprocess_data.py \
      --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json  \
      --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. 继承Pretrainer,完善预训练的代码中的数据自定义函数train_valid_test_datasets_provider

    您可以不依赖于任何第三方库来编写自定义的逻辑,用来生成train、valid、test数据集,您的数据集应该继承自torch.utils.data.Dataset

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                seq_length=args.seq_length,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup))
    
            return train_ds, valid_ds, test_ds
  3. 继承Pretrainer,完善预训练的代码中的模型自定义函数model_optimizer_lr_scheduler_provider

    您可以不依赖于任何第三方库来编写自定义的逻辑,用来生成自定义模型对象。您的模型应该是继承自torch.nn.Module

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from yourmodel import GPTModel
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def model_optimizer_lr_scheduler_provider(self):
            model = GPTModel()
            return model, None, None
  4. 继承Pretrainer,完善预训练的代码中的前向自定义函数run_forward_step

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MyGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
    
        def run_forward_step(self, data_iterator, model):
            """Forward step."""
            args = get_args()
    
            tokenizer = get_tokenizer()
    
            # Items and their type.
            keys = ['text']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
    
            # Unpack.
            tokens_ = data_b['text'].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
    
            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens,
                tokenizer.eod,
                args.reset_position_ids,
                args.reset_attention_mask,
                args.eod_mask_loss)
    
            output_tensor = model(tokens, position_ids, attention_mask,
                                  labels=labels)
    
            losses = output_tensor.float()
            loss_mask = loss_mask.view(-1).float()
            loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
    
            return loss
    
    
                            
  5. 初始化Rapidformer引擎,创建trainer对象,调用pretrain()方法。然后保存成文件并命名为rapidformer_pretrain_megatron_gpt_trainer.py

    engine = RapidformerEngine()
    trainer = MyGPTPreTrainer(engine=engine)
    trainer.train()
  6. 准备启动脚本,设置加速开关。

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    PRETRAINED_CHECKPOINT=
    
    rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \
           --tensor-model-parallel-size 2 \          #开启算子拆分优化
           --pipeline-model-parallel-size 2 \        #开启流水并行优化
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \                  #开启梯度累积优化
           --seq-length 512 \
           --tokenizer-type GPT2BPETokenizer \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file gpt2-vocab.json \
           --merge-file gpt2-merges.txt \
           --data-impl mmap \                         #开启数据加速
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --log-interval 1 \
           --zero-2-memory-optimization \              #开启模型状态切分
           --checkpoint-activations \                  #开启梯度检查点
           --mixed-precision                           #开启混合精度训练