Optimize inference with Deepytorch Inference-Elastic GPU Service(EGS)-阿里云帮助中心

Deepytorch Inference is an AI inference accelerator developed by Alibaba Cloud. It significantly accelerates inference for Torch models. This document shows how to install and use Deepytorch Inference and benchmarks its performance.

Background

Deepytorch Inference accelerates inference by calling thedeepytorch_inference.compile(model) API. Before using Deepytorch Inference, you must convert your PyTorch model to a TorchScript model using either thetorch.jit.script ortorch.jit.trace interface. For more information, see the official PyTorch documentation.

This document provides examples using bothtorch.jit.script andtorch.jit.trace to accelerate inference. For more information, see Performance showcase.

Install Deepytorch Inference

Important

Before installing Deepytorch Inference, ensure you have a GPU instance with a supported NVIDIA GPU card, such as an A10, V100, or T4.

After connecting to your GPU instance, use pip to install a specific version of torch (for example, version 2.0.1) and the Deepytorch Inference package. The Deepytorch Inference package is distributed on PyPI, allowing for simple installation with command-line tools.

Note

To select a specific version of the Deepytorch Inference package, you must choose the corresponding whl package from deepytorch inference. For example, if you require the package for Python 3.8, PyTorch 1.13.1, and CUDA 11.7, download deepytorch_inference (for Python 3.8, pt 1.13.1, and cuda 117).

pip install torch==2.0.1 deepytorch-inference -f https://aiacc-inference-public-v2.oss-cn-hangzhou.aliyuncs.com/aiacc-inference-torch/stable-diffusion/aiacctorch_stable-diffusion.html

Use Deepytorch Inference

To optimize inference with Deepytorch Inference, add the following lines to your model's script:

import deepytorch_inference # Import the deepytorch_inference package

deepytorch_inference.compile(mod_jit) # Compile the model

Performance showcase

This section demonstrates the inference performance of Deepytorch Inference on different models. The actual acceleration effect depends on factors such as the model and GPU instance type. The following tests use an A10 GPU instance type (for example, gn7i, ebmgn7i, or ebmgn7ix). For information about supported models, see Supported models.

Inference on ResNet50

The following example performs inference on a ResNet50 model by using thetorch.jit.script interface. With Deepytorch Inference, the average latency over 1,000 runs is reduced from 3.686 ms to 0.396 ms.

Baseline

The baseline code is as follows:

import time
import torch
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()

in_t = torch.randn([1, 3, 224, 224]).float().cuda()

# Warming up
for _ in range(10):
    mod_jit(in_t)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    mod_jit(in_t)
end = time.time()
print(f"use {(end-start)/inference_count*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The baseline execution result is as follows:

The output shows an average latency of about 3.69 ms per inference and a throughput of about 271 steps/s.

/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
    return forward_call(*input, **kwargs)
use 3.6863913536071777 ms each inference
271.26799973659031 step/s

Accelerated version

To accelerate inference, add the following lines to the baseline script:

import deepytorch_inference
deepytorch_inference.compile(mod_jit)

The updated code is as follows:

import time
import deepytorch_inference  # Import the deepytorch_inference package
import torch
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()
mod_jit = deepytorch_inference.compile(mod_jit)  # Compile the model

in_t = torch.randn([1, 3, 224, 224]).float().cuda()

# Warming up
for _ in range(10):
    mod_jit(in_t)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    mod_jit(in_t)
end = time.time()
print(f"use {(end-start)/inference_count*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The result below shows an inference latency of 0.396 ms, a significant improvement from the 3.686 ms baseline.

/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
use 0.39614391326904297 ms each inference
2524.335138076059 step/s

Inference on Bert-Base

The following example performs inference on a Bert-Base model by using thetorch.jit.trace interface. Deepytorch Inference reduces the inference latency from 4.955 ms to 0.418 ms.

Run the following command to install the transformers package.
```
pip install transformers
```

Run the baseline and accelerated demos and compare the outputs.

Baseline

The baseline code is as follows:

from transformers import BertModel, BertTokenizer, BertConfig
import torch
import time

enc = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens]).cuda()
segments_tensors = torch.tensor([segments_ids]).cuda()
dummy_input = [tokens_tensor, segments_tensors]

# Initializing the model with the torchscript flag
# Flag set to True even though it is not necessary as this model does not have an LM Head.
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
    num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)

# Instantiating the model
model = BertModel(config)

# The model needs to be in evaluation mode
model.eval()

# If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

model = model.eval().cuda()

# Creating the trace
traced_model = torch.jit.trace(model, dummy_input)

# Warming up
for _ in range(10):
    all_encoder_layers, pooled_output = traced_model(*dummy_input)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    traced_model(*dummy_input)
end = time.time()
print(f"use {(end-start)/inference_count*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The baseline result shows an average inference latency of about 4.955 ms.

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
use 4.95526909828186 ms each inference
201.8053873899058 step/s

Accelerated version

To accelerate inference, add the following lines to the baseline script:

import deepytorch_inference
deepytorch_inference.compile(traced_model)

The updated code is as follows:

from transformers import BertModel, BertTokenizer, BertConfig
import torch
import deepytorch_inference  # Import the deepytorch-inference package
import time

enc = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing input text
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = enc.tokenize(text)

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
indexed_tokens = enc.convert_tokens_to_ids(tokenized_text)
segments_ids = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens]).cuda()
segments_tensors = torch.tensor([segments_ids]).cuda()
dummy_input = [tokens_tensor, segments_tensors]

# Initializing the model with the torchscript flag
# Flag set to True even though it is not necessary as this model does not have an LM Head.
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
   num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, torchscript=True)

# Instantiating the model
model = BertModel(config)

# The model needs to be in evaluation mode
model.eval()

# If you are instantiating the model with `from_pretrained` you can also easily set the TorchScript flag
model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

model = model.eval().cuda()

# Creating the trace
traced_model = torch.jit.trace(model, dummy_input)
traced_model = deepytorch_inference.compile(traced_model)  # Compile the model

# Warming up
for _ in range(10):
    all_encoder_layers, pooled_output = traced_model(*dummy_input)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    traced_model(*dummy_input)
end = time.time()
print(f"use {(end-start)/inference_count*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The result below shows an inference latency of 0.418 ms, a significant improvement from the 4.955 ms baseline.

The output is as follows.

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
use 0.4180655479431523 ms each inference
2391.9694050849334 step/s

Dynamic-shape inference on ResNet50

Deepytorch Inference automatically handles dynamic shapes, eliminating the need to manage different input shapes. The following example demonstrates this process on a ResNet50 model using three different input shapes.

import time
import torch
import deepytorch_inference  # Import the deepytorch-inference package
import torchvision.models as models
mod = models.resnet50(pretrained=True).eval()
mod_jit = torch.jit.script(mod)
mod_jit = mod_jit.cuda()
mod_jit = deepytorch_inference.compile(mod_jit)  # Compile the model

in_t = torch.randn([1, 3, 224, 224]).float().cuda()
in_2t = torch.randn([1, 3, 448, 448]).float().cuda()
in_3t = torch.randn([16, 3, 640, 640]).float().cuda()

# Warming up
for _ in range(10):
    mod_jit(in_t)
    mod_jit(in_3t)

inference_count = 1000
# inference test
start = time.time()
for _ in range(inference_count):
    mod_jit(in_t)
    mod_jit(in_2t)
    mod_jit(in_3t)
end = time.time()
print(f"use {(end-start)/(inference_count*3)*1000} ms each inference")
print(f"{inference_count/(end-start)} step/s")

The execution result shows that the average inference latency is about 9.85 ms.

/workspace/miniconda/envs/3.8/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
use 9.846995433171589 ms each inference
33.85127327371685 step/s

Note

To reduce model compilation time, warm up the model with tensors representing the smallest and largest expected input shapes. This practice prevents repeated compilation during execution. For example, if input shapes will vary between 1x3x224x224 and 16x3x640x640, you should warm up the model with tensors of both sizes.