Deepytorch Inference is an AI inference accelerator developed by Alibaba Cloud. It significantly accelerates inference for Torch models. This document shows how to install and use Deepytorch Inference and benchmarks its performance.
Background
Deepytorch Inference accelerates inference by calling thedeepytorch_inference.compile(model) API. Before using Deepytorch Inference, you must convert your PyTorch model to a TorchScript model using either thetorch.jit.script ortorch.jit.trace interface. For more information, see the official PyTorch documentation.
This document provides examples using bothtorch.jit.script andtorch.jit.trace to accelerate inference. For more information, see Performance showcase.
Install Deepytorch Inference
Before installing Deepytorch Inference, ensure you have a GPU instance with a supported NVIDIA GPU card, such as an A10, V100, or T4.
After connecting to your GPU instance, use pip to install a specific version of torch (for example, version 2.0.1) and the Deepytorch Inference package. The Deepytorch Inference package is distributed on PyPI, allowing for simple installation with command-line tools.
To select a specific version of the Deepytorch Inference package, you must choose the corresponding whl package from deepytorch inference. For example, if you require the package for Python 3.8, PyTorch 1.13.1, and CUDA 11.7, download deepytorch_inference (for Python 3.8, pt 1.13.1, and cuda 117).
pip install torch==2.0.1 deepytorch-inference -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-torch/stable-diffusion/aiacctorch_stable-diffusion.html
Use Deepytorch Inference
To optimize inference with Deepytorch Inference, add the following lines to your model's script:
import deepytorch_inference # Import the deepytorch_inference packagedeepytorch_inference.compile(mod_jit) # Compile the model
Performance showcase
This section demonstrates the inference performance of Deepytorch Inference on different models. The actual acceleration effect depends on factors such as the model and GPU instance type. The following tests use an A10 GPU instance type (for example, gn7i, ebmgn7i, or ebmgn7ix). For information about supported models, see Supported models.
Inference on ResNet50
The following example performs inference on a ResNet50 model by using thetorch.jit.script interface. With Deepytorch Inference, the average latency over 1,000 runs is reduced from 3.686 ms to 0.396 ms.
-
Baseline
The baseline code is as follows:
import time import torch import torchvision.models as models mod = models.resnet50(pretrained=True).eval() mod_jit = torch.jit.script(mod) mod_jit = mod_jit.cuda() in_t = torch.randn([1, 3, 224, 224]).float().cuda() # Warming up for _ in range(10): mod_jit(in_t) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): mod_jit(in_t) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")The baseline execution result is as follows:
The output shows an average latency of about 3.69 ms per inference and a throughput of about 271 steps/s.
/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) use 3.6863913536071777 ms each inference 271.26799973659031 step/s -
Accelerated version
To accelerate inference, add the following lines to the baseline script:
-
import deepytorch_inference
-
deepytorch_inference.compile(mod_jit)
The updated code is as follows:
import time import deepytorch_inference # Import the deepytorch_inference package import torch import torchvision.models as models mod = models.resnet50(pretrained=True).eval() mod_jit = torch.jit.script(mod) mod_jit = mod_jit.cuda() mod_jit = deepytorch_inference.compile(mod_jit) # Compile the model in_t = torch.randn([1, 3, 224, 224]).float().cuda() # Warming up for _ in range(10): mod_jit(in_t) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): mod_jit(in_t) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")The result below shows an inference latency of 0.396 ms, a significant improvement from the 3.686 ms baseline.
/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) use 0.39614391326904297 ms each inference 2524.335138076059 step/s -
Inference on Bert-Base
The following example performs inference on a Bert-Base model by using thetorch.jit.trace interface. Deepytorch Inference reduces the inference latency from 4.955 ms to 0.418 ms.
-
Run the following command to install the
transformerspackage.pip install transformers -
Run the baseline and accelerated demos and compare the outputs.
-
Baseline
The baseline code is as follows:
from transformers import BertModel, BertTokenizer, BertConfig import torch import time enc = BertTokenizer.from_pretrained("bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = enc.tokenize(text) # Masking one of the input tokens masked_index = 8 tokenized_text[masked_index] = '[MASK]' indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ] # Creating a dummy input tokens_tensor = torch.tensor([indexed_tokens]).cuda() segments_tensors = torch.tensor([segments_ids]).cuda() dummy_input = [tokens_tensor, segments_tensors] # Initializing the model with the torchscript flag # Flag set to True even though it is not necessary as this model does not have an LM Head. config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True) # Instantiating the model model = BertModel(config) # The model needs to be in evaluation mode model.eval() # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) model = model.eval().cuda() # Creating the trace traced_model = torch.jit.trace(model, dummy_input) # Warming up for _ in range(10): all_encoder_layers, pooled_output = traced_model(*dummy_input) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): traced_model(*dummy_input) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")The baseline result shows an average inference latency of about 4.955 ms.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). use 4.95526909828186 ms each inference 201.8053873899058 step/s -
Accelerated version
To accelerate inference, add the following lines to the baseline script:
-
import deepytorch_inference
-
deepytorch_inference.compile(traced_model)
The updated code is as follows:
from transformers import BertModel, BertTokenizer, BertConfig import torch import deepytorch_inference # Import the deepytorch-inference package import time enc = BertTokenizer.from_pretrained("bert-base-uncased") # Tokenizing input text text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = enc.tokenize(text) # Masking one of the input tokens masked_index = 8 tokenized_text[masked_index] = '[MASK]' indexed_tokens = enc.convert_tokens_to_ids(tokenized_text) segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ] # Creating a dummy input tokens_tensor = torch.tensor([indexed_tokens]).cuda() segments_tensors = torch.tensor([segments_ids]).cuda() dummy_input = [tokens_tensor, segments_tensors] # Initializing the model with the torchscript flag # Flag set to True even though it is not necessary as this model does not have an LM Head. config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True) # Instantiating the model model = BertModel(config) # The model needs to be in evaluation mode model.eval() # If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag model = BertModel.from_pretrained("bert-base-uncased", torchscript=True) model = model.eval().cuda() # Creating the trace traced_model = torch.jit.trace(model, dummy_input) traced_model = deepytorch_inference.compile(traced_model) # Compile the model # Warming up for _ in range(10): all_encoder_layers, pooled_output = traced_model(*dummy_input) inference_count = 1000 # inference test start = time.time() for _ in range(inference_count): traced_model(*dummy_input) end = time.time() print(f"use {(end-start)/inference_count*1000} ms each inference") print(f"{inference_count/(end-start)} step/s")The result below shows an inference latency of 0.418 ms, a significant improvement from the 4.955 ms baseline.
The output is as follows.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). use 0.4180655479431523 ms each inference 2391.9694050849334 step/s -
-
Dynamic-shape inference on ResNet50
Deepytorch Inference automatically handles dynamic shapes, eliminating the need to manage different input shapes. The following example demonstrates this process on a ResNet50 model using three different input shapes.
import time
import torch
import deepytorch_inference # Import the deepytorch-inference package
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()
mod_jit = deepytorch_inference.compile(mod_jit) # Compile the model
in_t = torch.randn([1, 3, 224, 224]).float().cuda()
in_2t = torch.randn([1, 3, 448, 448]).float().cuda()
in_3t = torch.randn([16, 3, 640, 640]).float().cuda()
# Warming up
for _ in range(10):
mod_jit(in_t)
mod_jit(in_3t)
inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
mod_jit(in_t)
mod_jit(in_2t)
mod_jit(in_3t)
end = time.time()
print(f"use {(end-start)/(inference_count*3)*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")
The execution result shows that the average inference latency is about 9.85 ms.
/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return forward_call(*input, **kwargs)
use 9.846995433171589 ms each inference
33.85127327371685 step/s
To reduce model compilation time, warm up the model with tensors representing the smallest and largest expected input shapes. This practice prevents repeated compilation during execution. For example, if input shapes will vary between 1x3x224x224 and 16x3x640x640, you should warm up the model with tensors of both sizes.