What is Deepytorch Inference (inference acceleration)?

更新时间:
复制 MD 格式

Deepytorch Inference is an AI inference accelerator from Alibaba Cloud that provides high-performance inference for PyTorch models. It significantly improves inference performance by partitioning the model's computation graph, fusing execution layers, and implementing high-performance operators (OPs). This topic describes the concepts, benefits, and supported models of Deepytorch Inference.

Introduction to Deepytorch Inference

Deepytorch Inference provides inference acceleration using just-in-time (JIT) compilation to optimize deep learning models in the PyTorch framework. This process enables fast and efficient inference without requiring you to specify precision or input sizes.

The following figure shows the architecture of Deepytorch Inference.

image

Architecture layer

Description

Framework layer

  • Pytorch Framework: The PyTorch framework component used to connect to your models.

  • Pytorch Custom Ops: Other third-party PyTorch operators. These operators are not optimized and are retained in the framework layer.

Deepytorch Inference acceleration

Deepytorch Inference component

  • Torchscript Graph Optimization PipeLines: Graph optimization tools and operator fusion techniques on TorchScript.

  • Environment Manager: Controls the execution features and optimization levels of Deepytorch Inference.

  • Deepytorch Engine: The core execution engine. It includes key components such as Build Helper Ops, Operation Parser, Shape Tracker, Accuracy Checker, and Engine Rebuilder.

Operator layer

  • High Performance Kernel Libs: High-performance operator libraries that provide high-performance features.

  • Custom Plugins: Implementations of other functional operators.

Benefits

  • Significantly improves inference performance

    Deepytorch Inference uses compilation to reduce model inference latency, which improves the model's real-time performance and response speed.

    The following table compares the inference performance of different models.

    Note

    The following data shows the inference performance on a single A10 card. Compared to the default inference configuration of the model, using Deepytorch Inference for optimization significantly improves inference performance.

    model

    input-size

    deepytorch inference (ms)

    pytorch float (ms)

    Inference speed increase

    source

    Resnet50

    1 × 3 × 224 × 224

    0.47

    2.92

    84%

    torchvision

    Mobilenet-v2-100

    1 × 3 × 224 × 224

    0.24

    2.01

    88%

    torchvision

    SRGAN-X4

    1 × 3 × 272 × 480

    23.07

    132.00

    83%

    SRGAN

    YOLO-V3

    1 × 3 × 640 × 640

    3.87

    15.70

    75%

    yolov3

    Bert-base-uncased

    1 × 128, 1 × 128

    0.94

    3.76

    75%

    transformers

    Bert-large-uncased

    1 × 128, 1 × 128

    1.33

    7.11

    81%

    transformers

    GPT2

    1 × 128

    1.49

    3.82

    71%

    transformers

  • Ease of use

    Deepytorch Inference does not require you to specify precision or input sizes. It is easy to use because it leverages JIT compilation with minimal code intrusion. This approach reduces code complexity and maintenance costs.

Model support

Deepytorch Inference currently supports optimization for the models listed below.

Models that support inference acceleration

Scenario

Supported model name

Vision scenarios

  • alexnet

  • dcgan

  • mnasnet1_0

  • mobilenet_v2

  • mobilenet_v3_large

  • pytorch_stargan

  • resnet18

  • resnet50

  • resnext50_32x4d

  • shufflenet_v2_x1_0

  • squeezenet1_1

  • timm_efficientnet

  • timm_nfnet

  • timm_regnet

  • timm_resnest

  • timm_vision_transformer

  • timm_vovnet

  • vgg16

  • SRGAN-X4

  • YOLO-V3

NLP scenarios

  • BERT_pytorch

  • attention_is_all_you_need_pytorch

  • GPT2

  • bert-base-uncased

  • bert-large-uncased

Models that do not support inference acceleration

  • Operations such as weight demodulation in the StyleGan2 model are not supported because weight demodulation dynamically generates weights based on the input.

  • Models that dynamically set attributes are not supported. For example, the fasterrcnn_resnet50_fpn model in torchvision.models.detection causes an error when you execute torch.jit.freeze.

  • The conv1d operator is not supported in models such as demucs.

References