Deepytorch Inference is an AI inference accelerator from Alibaba Cloud that provides high-performance inference for PyTorch models. It significantly improves inference performance by partitioning the model's computation graph, fusing execution layers, and implementing high-performance operators (OPs). This topic describes the concepts, benefits, and supported models of Deepytorch Inference.
Introduction to Deepytorch Inference
Deepytorch Inference provides inference acceleration using just-in-time (JIT) compilation to optimize deep learning models in the PyTorch framework. This process enables fast and efficient inference without requiring you to specify precision or input sizes.
The following figure shows the architecture of Deepytorch Inference.
Architecture layer | Description | |
Framework layer |
| |
Deepytorch Inference acceleration | Deepytorch Inference component |
|
Operator layer |
| |
Benefits
Significantly improves inference performance
Deepytorch Inference uses compilation to reduce model inference latency, which improves the model's real-time performance and response speed.
The following table compares the inference performance of different models.
NoteThe following data shows the inference performance on a single A10 card. Compared to the default inference configuration of the model, using Deepytorch Inference for optimization significantly improves inference performance.
model
input-size
deepytorch inference (ms)
pytorch float (ms)
Inference speed increase
source
Resnet50
1 × 3 × 224 × 224
0.47
2.92
84%
Mobilenet-v2-100
1 × 3 × 224 × 224
0.24
2.01
88%
SRGAN-X4
1 × 3 × 272 × 480
23.07
132.00
83%
YOLO-V3
1 × 3 × 640 × 640
3.87
15.70
75%
Bert-base-uncased
1 × 128, 1 × 128
0.94
3.76
75%
Bert-large-uncased
1 × 128, 1 × 128
1.33
7.11
81%
GPT2
1 × 128
1.49
3.82
71%
Ease of use
Deepytorch Inference does not require you to specify precision or input sizes. It is easy to use because it leverages JIT compilation with minimal code intrusion. This approach reduces code complexity and maintenance costs.
Model support
Deepytorch Inference currently supports optimization for the models listed below.
Models that support inference acceleration
Scenario | Supported model name |
Vision scenarios |
|
NLP scenarios |
|
Models that do not support inference acceleration
Operations such as weight demodulation in the StyleGan2 model are not supported because weight demodulation dynamically generates weights based on the input.
Models that dynamically set attributes are not supported. For example, the fasterrcnn_resnet50_fpn model in torchvision.models.detection causes an error when you execute
torch.jit.freeze.The conv1d operator is not supported in models such as demucs.
References
Using the Deepytorch Inference tool can significantly improve model inference performance compared to the default configuration. For instructions, see Install and use Deepytorch Inference.
You can use the Deepytorch Training tool to optimize model training and significantly improve end-to-end training performance. For more information, see What is Deepytorch Training? and Install and use Deepytorch Training.