基于KSpeed的ResNet50训练

更新时间:
复制为 MD 格式

本文以ResNet50的图片分类模型训练为例,为您介绍KSpeedCV领域加速图片数据的加载实践。ResNet50模型是基于NVIDIA官方开源代码DeepLearningExamples中的实现。使用KSpeed需要在原来的代码上做一点改动,改动的地方可以通过git patch的方式适配到ResNet50模型中,改动细节在文末接入KSpeed关键模块说明进行了简要说明。

代码准备

运行环境配置

启动训练容器命令如下:

docker run -it --gpus all --name=resnet50_kspeed_test --net=host --ipc host --device=/dev/infiniband/ --ulimit memlock=-1:-1 -v /{path-to-imagenet}:/{path-to-imagenet-in-docker} -v /{path-to-DeepLearningExamples}:/{path-to-DeepLearningExamples-in-docker} eflo-registry.cn-beijing.cr.aliyuncs.com/eflo/ngc-pytorch-kspeed-22.05-py38:v2.2.0
说明

上述命令中

  • {path-to-imagenet} 表示物理机中imagenet数据集所在路径;

  • {path-to-imagenet-in-docker} 表示用户将数据集映射到容器中的路径;

  • {path-to-DeepLearningExamples}表示物理机中模型训练代码所在路径;

  • {path-to-DeepLearningExamples-in-docker}表示模型训练代码映射到容器中的路径;

以上路径需要用户自己设置。

imagenet数据集目录结构如下所示:

imagenet
├── train
│   ├── n01440764
│   │  ├── n01440764_10026.JPEG
│   │  ├── n01440764_10027.JPEG
│   │  └── ......
│   ├── n01443537
│   └── ......         
└── val                
    ├── n01440764
    │  ├── ILSVRC2012_val_00000293.JPEG
    │  ├── ILSVRC2012_val_00002138.JPEG
    │  └── ......
    ├── n01443537
    └── ......       

数据集获取方式参考:https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/ConvNets/resnet50v1.5

运行模型训练

#保持在DeepLearningExamples目录下
cd ./PyTorch/Classification/ConvNets

#单机八卡 baseline
bash ./resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_multi.sh pytorch {path-to-imagenet-in-docker}

#单机八卡 kspeed
bash ./resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_multi.sh kspeed {path-to-imagenet-in-docker}

#单机八卡 dali+kspeed
bash ./resnet50v1.5/training/AMP/DGXA100_resnet50_AMP_multi.sh dali-kspeed {path-to-imagenet-in-docker}
说明
  • 上述命令中

{path-to-imagenet-in-docker}表示imagenet数据集在容器中的路径,需要与启动容器时设置的路径保持一致。

  • 执行KSpeed测试前,需要确保已经部署好kspeed服务。

接入KSpeed关键模块说明

增加kspeeddataloader模块文件

新增文件DeepLearningExamples/PyTorch/Classification/ConvNets/image_classification/kspeeddataloader.py,主要实现了包括基于KSpeedPytorch Dataloader和基于KSpeedDali Dataloader。

基于KSpeedPytorch Dataloader

实现基于KSpeedPytorch Dataloader,只需修改Dataset,然后结合Pytorch原生的SamplerDataloader即可。核心代码如下:

  • 导入kspeeddataset模块

    import kspeed.utils.data.kspeeddataset as KSpeedDataset
  • torchvison.datasets.ImageFolder替换为KSpeedDataset.KSpeedImageFolder,从而可以使用KSpeed数据加载加速能力

    train_dataset = KSpeedDataset.KSpeedImageFolder(
            traindir, None, workers, kspeed_iplist,
            "admin", "admin", transforms.Compose(transforms_list),
        )
    
    val_dataset = KSpeedDataset.KSpeedImageFolder(
            valdir, None, workers, kspeed_iplist,
            "admin", "admin",
            transforms.Compose(
                [
                    transforms.Resize(
                        image_size + crop_padding, interpolation=interpolation
                    ),
                    transforms.CenterCrop(image_size),
                ]
            ),
        )
  • 实现get_kspeed_train_loaderget_kspeed_val_loader方法,详见kspeeddataloader.py 16~72行和74~128

基于KSpeedDali Dataloader

实现基于KSpeedDali Dataloader,只需修改Dali pipeline的输入数据源为一个外部数据源KSpeedCallable即可。核心代码如下:

  • KSpeedCallable

    KSpeedCallable对象继承KSpeedDataset.KSpeedFolder,在kspeeddataloader.py 164~179行中176行,通过self.dataset.getBIN(path)读取imagenet数据集样本。

    def __call__(self, sample_info):
            
        if self.dataset is None:
            self.load()
        if sample_info.iteration >= self.full_iters:
            raise StopIteration()
        if self.last_seen_epoch != sample_info.epoch_idx:
            self.last_seen_epoch = sample_info.epoch_idx
            self.perm = np.random.default_rng(seed=42 + sample_info.epoch_idx).permutation(len(self.files))
        idx = self.perm[sample_info.idx_in_epoch + self.shard_offset]
            
        path = os.path.join(self.root, self.files[idx])
        dout = self.dataset.getBIN(path)
        sample = np.frombuffer(dout, dtype=np.uint8)
        label = np.int32([self.labels[idx]])
        return sample, label
  • 基于KSpeedCallableDali Pipeline

    kspeeddataloader.py223~229行中,使用KSpeedCallable作为Dali Pipeline的外部数据源获取数据集样本。

    if kspeed:
        images, labels = fn.external_source(source=kscallable,
                            num_outputs=2,
                            batch=False, 
                            parallel=True, 
                            dtype=[types.UINT8, types.INT32], 
                            device='cpu')

增加DATA_BACKEND_CHOICES选项

DeepLearningExamples/PyTorch/Classification/ConvNets/image_classification/dataloaders.py 40行,将原来的DATA_BACKEND_CHOICES = ["pytorch", "syntetic"], 修改如下:

DATA_BACKEND_CHOICES = ["pytorch", "syntetic", "kspeed", "dali-kspeed",  "dali"]

增加args.data_backend选项

在文件DeepLearningExamples/PyTorch/Classification/ConvNets/main.py512~520行,将如下代码添加到args.data_backend的分支当中:

elif args.data_backend == "kspeed":
    get_train_loader = get_kspeed_train_loader
    get_val_loader = get_kspeed_val_loader
elif args.data_backend == "dali":
    get_train_loader = get_dali_kspeed_train_loader(dali_cpu=True, kspeed=False)
    get_val_loader = get_dali_kspeed_val_loader(dali_cpu=True, kspeed=False)
elif args.data_backend == "dali-kspeed":
    get_train_loader = get_dali_kspeed_train_loader(dali_cpu=True)
    get_val_loader = get_dali_kspeed_val_loader(dali_cpu=True)