Optimizing inference performance with TensorRT-Container Service for Kubernetes(ACK)-阿里云帮助中心

To optimize a model with TensorRT, you compile a model trained using a framework like PyTorch or TensorFlow into the TensorRT format, then execute it with the TensorRT inference engine. This process accelerates model execution on NVIDIA GPUs, making it ideal for applications that require high real-time performance. This article explains the model training and compilation process and offers best practices for optimizing model inference performance with TensorRT.

Before you begin

To better understand how to optimize model inference performance with TensorRT:

Review the Introduction to TensorRT and the TensorRT Cookbook source code to understand the basics of TensorRT's architecture and usage.
Verify that your CUDA version is compatible with TensorRT. As TensorRT is designed for NVIDIA GPUs, it requires NVIDIA hardware. For more information, see the official TensorRT documentation.
This article uses Nsight Systems to profile system-wide activity, including kernel read/write operations, scheduling between kernels, SM occupancy, and asynchronous execution between the CPU and GPU.

Actual performance gains depend on the model's type and size, and the specific graphics card.

Model compilation example

This example demonstrates a simple training workflow using an existing ResNet-18 model. You can follow this example to learn the concepts and techniques for model performance analysis and optimization.
- TensorRT version: v8.6.1. For other versions, see the TensorRT Download page.
- PyTorch version: 2.2.0.
- GPU: V100-SXM2-32GB. This tutorial uses the official NVIDIA PyTorch image.
  
  Pull the official NVIDIA PyTorch Docker image: docker pull nvcr.io/nvidia/pytorch:24.01-py3. Note: When starting the Docker container, mount shared memory (docker run --shm-size=) and share the host machine's IPC namespace (--ipc=host).

Train the model and generate an ONNX model file.

The following code pulls a pre-trained ResNet-18 model, performs simple training, and saves the model in ONNX format.

Show Sample Code

import torch
import torch.nn
import torch.optim
import torch.utils.data
import torchvision.datasets
import torchvision.models
import torchvision.transforms as T
'''
Defines a sequence of transformations with three steps:
  (1) T.Resize(224): Resizes the input image to 224x224 pixels.
  (2) T.ToTensor(): Converts the image data to a PyTorch tensor and linearly normalizes pixel values from the [0, 255] range to [0, 1].
  (3) T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)): Applies standardization to the image tensor using the specified mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). This centers and scales the data.
'''
transform = T.Compose(
    [T.Resize(224),
     T.ToTensor(),
     T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
# Loads the CIFAR-10 training dataset, downloads it to the local ./data directory, and applies the image transformations defined above.
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
# Creates a DataLoader to load the training data in batches. Each batch has a size of 32, and the data is shuffled at the start of each epoch.
train_loader = torch.utils.data.DataLoader(train_set, batch_size=32, shuffle=True)
# Defines the device variable, specifying the first CUDA-compatible GPU for training.
device = torch.device("cuda:0")
# Loads a pre-trained ResNet-18 model and moves its parameters to the specified GPU.
model = torchvision.models.resnet18(weights='IMAGENET1K_V1').cuda(device)
# Creates a cross-entropy loss function object for the classification task and moves it to the GPU.
criterion = torch.nn.CrossEntropyLoss().cuda(device)
# Creates a Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.001 and momentum of 0.9 to optimize the model's parameters.
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
# Sets the model to training mode. This enables training-specific behaviors for layers like dropout.
model.train()
# Defines a training function `train` that takes a batch of data as an argument.
def train(data):
    # Extracts inputs and labels from the data and moves them to the GPU.
    inputs, labels = data[0].to(device=device), data[1].to(device=device)
    # Performs a forward propagation pass through the model to get the outputs.
    outputs = model(inputs)
    # Uses the loss function to calculate the loss between the model's outputs and the true labels.
    loss = criterion(outputs, labels)
    # Clears the gradients of all parameters before calculating new ones.
    optimizer.zero_grad()
    # Performs backpropagation to compute the gradients of the loss with respect to the model parameters.
    loss.backward()
    # Updates the model's parameters based on the computed gradients.
    optimizer.step()
# Calls the train function for each batch to train the model.
for step, batch_data in enumerate(train_loader):
    train(batch_data)
# Exports the trained PyTorch model to the ONNX (Open Neural Network Exchange) format. This format allows model exchange between deep learning frameworks.
# Creates a randomly generated tensor `dummy_input` with the shape (1, 3, 224, 224). This represents a single image (batch size of 1) with 3 color channels (C) and 224x224 pixels.
# This tensor is used as a sample input during model export, helping ONNX determine the input shape and layout. The tensor is then moved to the previously defined device (in this case, the GPU).
dummy_input = torch.randn(1, 3, 224, 224).to(device)
# Defines `input_names` for the input tensor in the exported ONNX model.
input_names = [ "input0" ]
# Defines `output_names` for the output tensor in the ONNX model.
output_names = [ "output0" ]
'''
This calls PyTorch's `torch.onnx.export` function to export the PyTorch model to the ONNX format. The parameters are as follows:
  (1) model: The PyTorch model to export.
  (2) dummy_input: Sample input data for the model.
  (3) 'resnet18.ONNX': The filename for the exported ONNX model.
  (4) verbose=True: Prints detailed conversion logs.
  (5) input_names=input_names: Specifies the names for the inputs in the ONNX model.
  (6) output_names=output_names: Specifies the names for the outputs in the ONNX model.
  (7) dynamic_axes={'input0': {0: "nBatchSize"}}: Specifies that the 0-th dimension of the model's input tensor 'input0' is a dynamic axis. This allows the model to process variable batch sizes. "nBatchSize" is the name given to this dynamic axis.
'''
torch.onnx.export(model, dummy_input, 'resnet18.ONNX', verbose=True, input_names=input_names, output_names=output_names,dynamic_axes={'input0': {0: "nBatchSize"}})

Save the TensorRT compilation script.

The following script compiles the model. Save it as 0_build.py.

Show Sample Code

import argparse  # Import the library for parsing command-line arguments.
import os  # Import the library for file and directory operations.
import tensorrt as trt
# Builds a TensorRT engine.
def build(logger, ONNX_file, min_shape, optim_shape, max_shape, num_aux_stream, share_profile, fp16):
    errors = []
    builder = trt.Builder(logger)
    # Create a network definition and explicitly specify the batch dimension.
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    profile = builder.create_optimization_profile()
    config = builder.create_builder_config()
    # Enable profile sharing if specified.
    if share_profile:
        print("enable share profile")
        config.set_preview_feature(trt.PreviewFeature.PROFILE_SHARING_0806, True)
    # If the number of auxiliary streams is set (for stream-parallel execution), configure it.
    if num_aux_stream > 0:
        print("set aux stream " + str(num_aux_stream))
        config.max_aux_streams = num_aux_stream
    # If FP16 mode is enabled, set the FP16 flag in the builder configuration.
    if fp16:
        config.set_flag(trt.BuilderFlag.FP16)
    # Create an ONNX parser and associate it with the network definition and logger.
    parser = trt.OnnxParser(network, logger)
    if not os.path.exists(ONNX_file):
        errors.append("Failed to find onnx File!")
        return None, errors
    # Open the ONNX file and parse its contents using the parser.
    with open(ONNX_file, "rb") as model:
        if not parser.parse(model.read()):
            errors.append("failed to parse .onnx file: ")
            for error in range(parser.num_errors):
                errors.append(parser.get_error(error))
            return None, errors
    # Get the network's input tensor and set its shapes in the optimization profile.
    inputTensor = network.get_input(0)
    profile.set_shape(inputTensor.name, min_shape, optim_shape, max_shape)
    config.add_optimization_profile(profile)  # Add the optimization profile to the builder configuration.
    # Set the profiling verbosity level to detailed.
    config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
    # Build and serialize the TensorRT engine.
    engine_string = builder.build_serialized_network(network, config)
    if engine_string == None:
        errors.append("Failed to build engine")
        return None, errors
    return engine_string, errors
# Saves the engine string to a file.
def save_engine(engine_string, planFile):
    with open(planFile, "wb") as f:
        f.write(engine_string)
    return 0
def main():
    parser = argparse.ArgumentParser(description='ResNet18 TensorRT Builder')
    parser.add_argument('--aux-stream', type=int, default=0, metavar='N',
                        help='specify the aux stream (default: 0)')
    parser.add_argument('--share-profile', action='store_true', default=False,
                        help='enable share profile')
    parser.add_argument('--fp16', action='store_true', default=False,
                        help='enable fp16 mode')
    parser.add_argument('--output', type=str, default='resnet18.plan', metavar='N',
                        help='specify the plan file')
    parser.add_argument('--ONNX-file', type=str, default='resnet18.ONNX', metavar='N',
                        help='specify the onnx file')
    args = parser.parse_args()
    logger = trt.Logger(trt.Logger.ERROR)
    # Call the build function to build the TensorRT engine.
    # Specify the minimum profile shape: [1, 3, 224, 224]
    # Specify the optimal profile shape: [128, 3, 224, 224]
    # Specify the maximum profile shape: [256, 3, 224, 224]
    engine_string, errors = build(logger, args.ONNX_file, [1, 3, 224, 224], [128, 3, 224, 224], [256, 3, 224, 224], args.aux_stream, args.share_profile, args.fp16)
    if len(errors) != 0:
        print(errors)
        return 1
    save_engine(engine_string, args.output)
    return 0
if __name__ == "__main__":
    main()

Important

During ONNX model export, only the batch size was specified as dynamic. The image channel count (3 in this example), width, and height (224x224) are fixed. Therefore, when specifying the minimum [1, 3, 224, 224], optimal [128, 3, 224, 224], and maximum [256, 3, 224, 224] shapes for the profile, only the batch size value changes.

Save the baseline model script.

Save the following baseline model script as 1_baseline.py.

Show Sample Code

import nvtx  # Import the NVIDIA Tools Extension library (NVTX) for GPU performance analysis.
import numpy as np
import tensorrt as trt
from cuda import cudart  # From the cuda module, import cudart, which is the CUDA Runtime API.
np.random.seed(10088)  # Set NumPy's random seed to ensure reproducible results.
# Performs the softmax operation.
def softmax(x, axis=1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))   # For numerical stability, subtract the maximum value of each sample.
    return e_x / np.sum(e_x, axis=axis, keepdims=True)  # Calculate and return the softmax result.
# Generates random input data.
def data_generation(shape, batches):
    data = []
    for i in range(batches):
        # Generate random data with the given shape and convert it to float32 type.
        data.append(np.random.randn(*shape).astype(np.float32))
    return data
# Loads a TensorRT engine from a file.
def load_engine(logger, plan_file):
    with open(plan_file, "rb") as plan:
        # Use the TensorRT Runtime to deserialize the engine.
        engine = trt.Runtime(logger).deserialize_cuda_engine(plan.read())
    return engine
# Gets information about the engine's input and output tensors.
def get_io_tensors(engine):
    num_io_tensors = engine.num_io_tensors
    # Get the names of all input and output tensors.
    io_tensor_names = [engine.get_tensor_name(i) for i in range(num_io_tensors)]
    # Calculate the number of input tensors.
    num_input_io_tensors = [engine.get_tensor_mode(io_tensor_names[i]) for i in range(num_io_tensors)].count(trt.TensorIOMode.INPUT)
    return num_io_tensors, io_tensor_names, num_input_io_tensors
# Performs the inference operation.
def infer(engine, data):
    context = engine.create_execution_context()
    tet = None
    for i in range(len(data)):
        if i == 7:  # Start profiling after the warm-up batches.
            # Start recording an NVTX range for profiling total elapsed time.
            tet = nvtx.start_range(message="Total Elapsed Time(3 batchs)", color="orange")
        nvtx.push_range(message="infer", color="purple")  # Start an NVTX range for the inference operation.
        infer_once(engine, context, data[i])  # Call the function to execute a single inference.
        nvtx.pop_range()  # End the NVTX range.
    nvtx.end_range(tet)
# Executes a single inference operation.
def infer_once(engine, context, data):
    # Get the engine's input and output tensor information.
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)
    # Set the shape of the input tensor.
    context.set_input_shape(io_tensor_names[0], data.shape)
    bufferH, bufferD = [], []  # Initialize host and device buffers.
    bufferH.append(data)
    # Allocate space for the output tensors and add them to the host buffer list.
    for i in range(num_input_io_tensors, num_io_tensors):
        bufferH.append(np.empty(context.get_tensor_shape(io_tensor_names[i]), dtype=trt.nptype(engine.get_tensor_dtype(io_tensor_names[i]))))
    # Allocate memory on the device for each I/O tensor.
    for i in range(num_io_tensors):
        bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
    # Copy the input data to the device buffers.
    for i in range(num_input_io_tensors):
        cudart.cudaMemcpy(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)
    # Set the tensor addresses for the execution context.
    for i in range(num_io_tensors):
        context.set_tensor_address(io_tensor_names[i], int(bufferD[i]))
    context.execute_async_v3(0)  # Perform asynchronous execution of the inference.
    # Copy the inference results from the device buffers back to the host buffers.
    for i in range(num_input_io_tensors, num_io_tensors):
        cudart.cudaMemcpy(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)
    cudart.cudaStreamSynchronize(0)  # Synchronize the CUDA stream to ensure all operations are complete.
    # Process the output with softmax and print the results.
    for i in range(num_input_io_tensors, num_io_tensors):
        softmax_scores = softmax(bufferH[i])
        # Use the np.argmax function to get the class with the highest probability.
        predicted_classes = np.argmax(softmax_scores, axis=1)
        max_probs_np = np.max(softmax_scores, axis=1)
        print("Output Tensor Name: ", io_tensor_names[i])
        print("Maximum probability for each image in the batch:\n", max_probs_np)
        print("Index of predicted class for each image in the batch:\n", predicted_classes)
    # Free the memory in the device buffers.
    for b in bufferD:
        cudart.cudaFree(b)
    print("Succeeded running model in TensorRT!")
# Generates data, loads the engine, and performs inference.
def main():
    data = data_generation([128, 3, 224, 224], 10)
    logger = trt.Logger(trt.Logger.ERROR)
    engine = load_engine(logger, "resnet18.plan")
    infer(engine, data)
if __name__ == "__main__":
    main()

Note

The script uses data_generation to generate 10 batches, each containing 128 images of size 224x224.
Only the last 3 batches are measured; the preceding batches serve as a warm-up.

Run inference optimization and view the process.

After completing the setup, run the following shell commands.
```
python 0_build.py 
mkdir -pv reports
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/1_baseline \
	python 1_baseline.py
```
When the run completes, it generates a 1_baseline.nsys-rep file in the ./reports directory. Import this file into Nsight Systems to view the timeline, as shown below.

The timeline shows the following:
- The total time for the last three batches is approximately 133.577 ms.
- Between batches, the GPU is briefly idle (labeled 4 in the figure), caused by batch data transfer and the printing of results on the host.

Model optimization

Approach 1: Reuse allocated GPU memory

Problem analysis.

In the baseline code, GPU memory is allocated for each batch and then deallocated after the batch is processed. Frequent memory allocation and deallocation is time-consuming and can create a performance bottleneck.

Solution design.

Reusing allocated GPU memory can reduce batch processing time. Modify the baseline code to allocate GPU memory when processing the first batch, and then reuse it for subsequent batches.

The complete code is shown below. Note: Only the infer and infer_once functions have been modified; the rest of the code is identical to the baseline. Save the code as 2_reuse_buffers.py.

Full code

import nvtx  # Import the NVIDIA Tools Extension (nvtx) library for GPU performance profiling
import numpy as np  # Import the NumPy library for array and matrix operations
import tensorrt as trt  # Import the TensorRT library
from cuda import cudart  # Import cudart from the cuda module for the CUDA Runtime API
np.random.seed(10088)  # Set the NumPy random seed for reproducible results
# Implements the softmax function.
def softmax(x, axis=1):
    # Subtract the max value from each sample for numerical stability
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)  # Calculate and return the softmax result
# Generates random input data for a specified number of batches
def data_generation(shape, batches):
    data = []  # Initialize the data list
    for i in range(batches):  # Loop to generate the specified number of batches
        # Generate random data with the given shape and cast it to float32
        data.append(np.random.randn(*shape).astype(np.float32))
    return data  # Return the list of generated data
# Loads a pre-built TensorRT inference engine from a file
def load_engine(logger, plan_file):
    with open(plan_file, "rb") as plan:  # Open the engine file in binary read mode
        # Deserializes the engine using the TensorRT Runtime.
        engine = trt.Runtime(logger).deserialize_cuda_engine(plan.read())
    return engine  # Return the loaded engine
# Gets the engine's input and output tensor information.
def get_io_tensors(engine):
    num_io_tensors = engine.num_io_tensors  # Get the number of I/O tensors in the engine
    # Get the names of all I/O tensors
    io_tensor_names = [engine.get_tensor_name(i) for i in range(num_io_tensors)]
    # Count the number of input tensors
    num_input_io_tensors = [engine.get_tensor_mode(io_tensor_names[i]) for i in range(num_io_tensors)].count(trt.TensorIOMode.INPUT)
    return num_io_tensors, io_tensor_names, num_input_io_tensors  # Return tensor information
# Performs inference on a series of batches, reusing GPU memory
def infer(engine, data):
    context = engine.create_execution_context()  # Create a TensorRT execution context
    # Gets the engine's I/O tensor information.
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)
    # Set the shape of the first input tensor based on the shape of the first batch
    context.set_input_shape(io_tensor_names[0], data[0].shape)
    bufferH, bufferD = [], []  # Initialize host and device buffer lists
    bufferH.append(data[0])  # Add the first batch of data to the host buffer list
    # Allocate space on the host for the engine's output tensors
    for i in range(num_input_io_tensors, num_io_tensors):
        bufferH.append(np.empty(context.get_tensor_shape(io_tensor_names[i]), dtype=trt.nptype(engine.get_tensor_dtype(io_tensor_names[i]))))
    # Allocate memory on the device for all tensors
    for i in range(num_io_tensors):
        bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
    # Set the address for each tensor in the execution context
    for i in range(num_io_tensors):
        context.set_tensor_address(io_tensor_names[i], int(bufferD[i]))
    tet = None  # Initialize a variable to track the NVTX range
    for i in range(len(data)):  # Iterate through all batches of data
        if i == 7:  # When processing the 7th batch
            # Start an NVTX range to measure the total time for the next 3 batches
            tet = nvtx.start_range(message="Total Elapsed Time(3 batchs)", color="orange")
        nvtx.push_range(message="infer", color="purple")  # Start an NVTX range to measure a single inference operation
        infer_once(engine, context, bufferH, bufferD, data[i])  # Call infer_once to perform a single inference pass
        nvtx.pop_range()  # End the NVTX range
    nvtx.end_range(tet)  # End the NVTX range measuring the total time for the 3 batches
    for b in bufferD:  # Iterate through the device buffer list
        cudart.cudaFree(b)  # Free the device memory
# Performs a single inference pass for one batch
def infer_once(engine, context, bufferH, bufferD, data):
    # Gets the engine's I/O tensor information.
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)
    bufferH[0] = data  # Update the host input buffer with the current batch's data
    # Copy the input data from host memory to device memory
    for i in range(num_input_io_tensors):
        cudart.cudaMemcpy(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice)
    context.execute_async_v3(0)  # Asynchronously execute inference on the default CUDA stream
    # Copy the output data from device memory back to host memory
    for i in range(num_input_io_tensors, num_io_tensors):
        cudart.cudaMemcpy(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost)
    cudart.cudaStreamSynchronize(0)  # Synchronize the CUDA stream to ensure all previous operations are complete
    nvtx.push_range(message="Print Result", color="green")  # Start an NVTX range to measure result printing time
    print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH)  # Call print_result to output the results
    nvtx.pop_range()  # End the NVTX range
    print("Succeeded running model in TensorRT!")  # Print a success message
# Prints the inference results
def print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH):
    for i in range(num_input_io_tensors, num_io_tensors):  # Iterate through the output tensors
        softmax_scores = softmax(bufferH[i])  # Apply the softmax function to the output data
        predicted_classes = np.argmax(softmax_scores, axis=1)  # Get the predicted class index for each sample
        max_probs_np = np.max(softmax_scores, axis=1)  # Get the highest probability for each sample
        print("Output Tensor Name: ", io_tensor_names[i])  # Print the name of the output tensor
        print("Maximum probability for each image in the batch:\n", max_probs_np)  # Print the highest probability for each sample
        print("Index of predicted class for each image in the batch:\n", predicted_classes)  # Print the predicted class index for each sample
# The main function, serving as the program's entry point
def main():
    # Generate 10 batches of data, each with 128 samples of size 3x224x224
    data = data_generation([128, 3, 224, 224], 10)
    logger = trt.Logger(trt.Logger.ERROR)  # Create a TensorRT logger that only records errors
    engine = load_engine(logger, "resnet18.plan")  # Load the serialized TensorRT engine
    infer(engine, data)  # Call the infer function to perform inference
if __name__ == "__main__":
    main()

In the code above, the print_result function handles result printing.

Run the following shell command to profile the script.

nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/2_reuse-buffers \
	python 2_reuse-buffers.py

Running the command creates the file 2_reuse-buffers.nsys-rep in the ./reports directory. Import this file into Nsight Systems. As shown in the timeline below, the total time for the last three batches is approximately 128.196 ms, which is a 5.381 ms reduction from the baseline (133.577 ms - 128.196 ms).

Approach 2: Use pinned memory

Problem analysis

Building on Approach 1, we continue to look for optimization opportunities. The timeline from the previous step shows that during the host-to-device data transfer, which takes approximately 13 ms, the GPU is idle. Shortening this data transfer time is therefore key to reducing the overall batch processing time.

Solution design

To accelerate data transfers, we will use pinned memory. When generating random data, store it in pinned memory using the data_generation_with_pin_memory function. This requires some modifications to the main function, but the rest of the code remains largely unchanged. The code is saved in 3_use-pin-memory.py.

Full code

import os  # Import the library for operating system interfaces
import nvtx  # Import the NVIDIA Tools Extension (NVTX) library for GPU performance profiling
import ctypes  # Import the ctypes module to call C library functions
import numpy as np  # Import the NumPy library for numerical computation
import tensorrt as trt  # Import the TensorRT library
from cuda import cudart  # Import the CUDA Runtime API library
np.random.seed(10088)  # Set the NumPy random seed for reproducible results
# Generates random data in pinned memory on the host to accelerate host-to-device data transfers.
def data_generation_with_pin_memory(shape, batches):
    data = []  # Create a data list
    pbuffers = []  # Create a list for pinned buffer pointers
    for i in range(batches):  # For each batch
        d = np.random.randn(*shape).astype(np.float32)  # Generate a batch of random data
        nElement = d.size  # Get the number of elements in the array
        nByteSize = d.nbytes  # Get the byte size of the data
        _, pBuffer = cudart.cudaHostAlloc(nByteSize, cudart.cudaHostAllocDefault)  # Allocate pinned memory using the CUDA API
        pBufferCtype = ctypes.cast(pBuffer, ctypes.POINTER(ctypes.c_float * nElement))  # Create a ctypes pointer for interoperability with the CUDA API
        nd = np.ndarray(shape=d.shape, dtype=d.dtype, buffer=pBufferCtype.contents)  # Create a NumPy array as a view over the pinned memory
        nd[:] = d  # Copy the generated data into the pinned memory
        data.append(nd)  # Add the pinned memory array to the data list
        pbuffers.append(pBuffer)  # Add the pinned memory pointer to the list
    return data, pbuffers  # Return the data list and the list of pinned buffer pointers
# Define the main function, which serves as the program's entry point
def main():
    data, pBuffers = data_generation_with_pin_memory([128, 3, 224, 224], 10)  # Generate 10 batches of data using pinned memory
    logger = trt.Logger(trt.Logger.ERROR)  # Create a TensorRT logger object
    engine = load_engine(logger, "resnet18.plan")  # Load the TensorRT inference engine
    infer(engine, data)  # Perform inference
    for p in pBuffers:  # Iterate through and free all pinned memory
        cudart.cudaFreeHost(p)
if __name__ == "__main__":
    main()

Run the following shell command.

nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/3_use-pin-memory \
	python 3_use-pin-memory.py

After the command finishes running, it generates a 3_use-pin-memory.nsys-rep file in the ./reports directory. Import this file into Nsight Systems. The timeline is shown below.The figure shows the following:
- The total time for the last three batches is now approximately 108.348 ms, a reduction of 19.848 ms (down from 128.196 ms).
- The batch transfer time is reduced from 13.429 ms to 6.912 ms.

Approach 3: Use FP16 (or INT8) precision

Problem analysis

The timeline from Approach 2 shows a computation time of approximately 27.230 ms per batch and memory consumption of 4.7 GB.
Solution design

You can reduce batch computation time by enabling a lower precision, such as FP16 or INT8, when you compile the model. To enable FP16, add the following line to the BuilderConfig.
```
config.set_flag(trt.BuilderFlag.FP16)
```

The 0_build.py script includes an --fp16 option to enable FP16 mode during compilation. Copy 3_use-pin-memory.py to 4_use-fp16.py and run the following shell script.

python 0_build.py --fp16  # Enable FP16 mode
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/4_use-fp16 \
	python 4_use-fp16.py

After the run completes, a file named 4_use-fp16.nsys-rep is generated in the ./reports directory. Import this file into Nsight Systems to view the timeline.The results are as follows:
- The total processing time for the three batches is reduced from 108.348 ms to 49.309 ms.
- The per-batch computation time is reduced from 27.230 ms to 7.957 ms.
- GPU memory usage is reduced from 4.7 GB to 2.39 GB.
  
  Important
  In production, you must also run a calibration process to ensure the accuracy of the quantized model. For more information, see the official TensorRT documentation.

Approach 4: Overlapping data transfer and data computation

Problem analysis

After quantization, data transfer time becomes significant relative to computation time. Consequently, simply reducing transfer time is no longer a sufficient optimization.

Solution design

Achieving overlapping data transfer and data computation requires using CUDA streams.

Modify the code as follows:

Create two CUDA streams: one to transfer data from the host to the device, and another for computation and returning results.
Create three CUDA events for synchronization between the CUDA streams, and between the GPU and the host.
Pre-transfer the data for the first batch. Then, while the GPU computes the first batch, transfer the data for the second batch. This pipelined approach continues for subsequent batches.

Save the following code as 5_multi-streams.py.

Full code

import os  # Provides operating system-related functions
import nvtx  # NVIDIA Tools Extension for GPU profiling
import time  # Provides time-related functions
import ctypes  # For interoperability with C libraries
import numpy as np  # For high-performance numerical computing
import tensorrt as trt  # For deep learning inference optimization
from cuda import cudart  # Provides Python bindings for the CUDA Runtime API
np.random.seed(10088)  # Set the NumPy random seed to make random operations reproducible
# Computes the softmax of an array.
def softmax(x, axis=1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))  # Subtract max value per sample for numeric stability
    return e_x / np.sum(e_x, axis=axis, keepdims=True)
# Generates input data using pinned memory to optimize data transfers
def data_generation_with_pin_memory(shape, batches):
    data = []  # Create an empty list to store data
    pbuffers = []  # Create an empty list to store pinned-memory pointers
    for i in range(batches):  # Generate the specified number of batches
        d = np.random.randn(*shape).astype(np.float32)  # Generate a random array
        nElement = d.size  # Compute the total number of elements
        nByteSize = d.nbytes  # Compute the size in bytes
        _, pBuffer = cudart.cudaHostAlloc(nByteSize, cudart.cudaHostAllocDefault)  # Allocate pinned memory using the CUDA Runtime API
        pBufferCtype = ctypes.cast(pBuffer, ctypes.POINTER(ctypes.c_float * nElement))  # Convert the pinned-memory pointer to a C type
        # Create a NumPy array that uses pinned memory as its buffer
        nd = np.ndarray(shape=d.shape, dtype=d.dtype, buffer=pBufferCtype.contents)
        nd[:] = d  # Copy the generated data into pinned memory
        data.append(nd)  # Add the pinned-memory array to the data list
        pbuffers.append(pBuffer)  # Add the pinned-memory pointer to the list
    return data, pbuffers
# Loads a TensorRT inference engine from a file.
def load_engine(logger, plan_file):
    with open(plan_file, "rb") as plan:  # Open the serialized inference engine file in binary read mode
        # Use the TensorRT runtime and the provided logger to deserialize the inference engine
        engine = trt.Runtime(logger).deserialize_cuda_engine(plan.read())
    return engine  # Return the deserialized inference engine
# Retrieves I/O tensor information from the engine.
def get_io_tensors(engine):
    num_io_tensors = engine.num_io_tensors  # Get the total number of I/O tensors
    io_tensor_names = [engine.get_tensor_name(i) for i in range(num_io_tensors)]  # Get all tensor names
    # Compute the number of input tensors
    num_input_io_tensors = [engine.get_tensor_mode(io_tensor_names[i]) for i in range(num_io_tensors)].count(trt.TensorIOMode.INPUT)
    return num_io_tensors, io_tensor_names, num_input_io_tensors  # Return tensor count, names, and input tensor count
# Runs TensorRT inference with pipelined data transfers.
def infer(engine, data):
    context = engine.create_execution_context()  # Create an execution context
    # Create two CUDA streams: one for copies and one for computation
    _, computeStream = cudart.cudaStreamCreate()
    _, copyStream = cudart.cudaStreamCreate()
    # Create three CUDA events for synchronization
    _, inputConsumedEvent = cudart.cudaEventCreate()
    _, inputCopyedEvent = cudart.cudaEventCreate()
    _, outputReadyEvent = cudart.cudaEventCreate()
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)  # Get I/O tensor information
    context.set_input_shape(io_tensor_names[0], data[0].shape)  # Set the input tensor shape
    bufferH, bufferD = [], []  # Initialize host and device buffer lists
    bufferH.append(data[0])  # Add the first batch to the host buffer list
    for i in range(num_input_io_tensors, num_io_tensors):  # Allocate host-side space for output tensors
        bufferH.append(np.empty(context.get_tensor_shape(io_tensor_names[i]), dtype=trt.nptype(engine.get_tensor_dtype(io_tensor_names[i]))))
    for i in range(num_io_tensors):  # Allocate device-side space for each tensor
        bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
    for i in range(num_io_tensors):  # Set tensor addresses in the execution context
        context.set_tensor_address(io_tensor_names[i], int(bufferD[i]))
    tet = None  # Initialize the variable for the NVTX range
    # Pre-transfer the first batch of input data
    for i in range(num_input_io_tensors):
        cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, copyStream)
    cudart.cudaEventRecord(inputCopyedEvent, copyStream)  # Record an event after the input copy.
    for i in range(len(data)):  # Run inference for each batch
        if i == 7:  # If this is batch 7
            tet = nvtx.start_range(message="Total Elapsed Time(3 batchs)", color="orange")  # Mark the start of the performance measurement using NVTX
        d = None  # Initialize a pointer to the next batch's data
        if i + 1 < len(data):  # If there is a next batch
            d = data[i + 1]
        nvtx.push_range(message="infer", color="purple")  # Mark the start of a single inference using NVTX
        # Run a single inference
        infer_once(engine, context, bufferH, bufferD, d, copyStream, computeStream, inputConsumedEvent, inputCopyedEvent, outputReadyEvent)
        nvtx.pop_range()  # Mark the end of a single inference using NVTX
    nvtx.end_range(tet)  # Mark the end of the performance measurement using NVTX
    for b in bufferD:  # Free device-side memory
        cudart.cudaFree(b)
    # Destroy CUDA streams and events
    cudart.cudaStreamDestroy(copyStream)
    cudart.cudaStreamDestroy(computeStream)
    cudart.cudaEventDestroy(inputConsumedEvent)
    cudart.cudaEventDestroy(inputCopyedEvent)
    cudart.cudaEventDestroy(outputReadyEvent)
# Executes a single inference step in the pipeline.
def infer_once(engine, context, bufferH, bufferD, data, copyStream, computeStream, inputConsumedEvent, inputCopyedEvent, outputReadyEvent):
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)  # Get I/O tensor information
    cudart.cudaEventSynchronize(inputCopyedEvent)  # Wait for the previous input copy to complete
    if data is not None:  # If new data is provided
        bufferH[0] = data  # Update the host-side input buffer
    # The compute stream waits for the input copy to finish.
    cudart.cudaStreamWaitEvent(computeStream, inputCopyedEvent, cudart.cudaEventWaitDefault)
    context.execute_async_v3(computeStream)  # Run inference asynchronously in computeStream
    # Set an event to be recorded when the input is consumed.
    context.set_input_consumed_event(inputConsumedEvent)
    # The copy stream waits until the input is consumed before starting the next transfer.
    cudart.cudaStreamWaitEvent(copyStream, inputConsumedEvent, cudart.cudaEventWaitDefault)
    if bufferH[0] is not None:  # If there is new input data
        # Copy the next input data to the device asynchronously
        for i in range(num_input_io_tensors):
            cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, copyStream)
        cudart.cudaEventRecord(inputCopyedEvent, copyStream)  # Mark the completion of the next input copy.
    # Copy inference outputs back to the host asynchronously in computeStream
    for i in range(num_input_io_tensors, num_io_tensors):
        cudart.cudaMemcpyAsync(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, computeStream)
    cudart.cudaEventRecord(outputReadyEvent, computeStream)  # Mark the completion of the output copy.
    cudart.cudaEventSynchronize(outputReadyEvent)  # Wait for output copy to complete
    nvtx.push_range(message="Print Result")  # Mark the start of result printing using NVTX
    print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH)  # Print inference results
    nvtx.pop_range()  # Mark the end of result printing using NVTX
    print("Succeeded running model in TensorRT!")  # Print a success message
# Defines print_result to print inference results
def print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH):
    for i in range(num_input_io_tensors, num_io_tensors):  # Iterate over output tensors
        softmax_scores = softmax(bufferH[i])  # Apply softmax to the output tensor
        predicted_classes = np.argmax(softmax_scores, axis=1)  # Get the predicted class
        max_probs_np = np.max(softmax_scores, axis=1)  # Get the maximum probability
        print("Output Tensor Name: ", io_tensor_names[i])  # Print the output tensor name
        print("Maximum probability for each image in the batch:\n", max_probs_np)  # Print the maximum probability per image
        print("Index of predicted class for each image in the batch:\n", predicted_classes)  # Print the predicted class index per image
# Program entry point
def main():
    data, pBuffers = data_generation_with_pin_memory([128, 3, 224, 224], 10)  # Generate 10 batches using pinned memory
    logger = trt.Logger(trt.Logger.ERROR)  # Create a TensorRT logger
    engine = load_engine(logger, "resnet18.plan")  # Load the TensorRT inference engine
    infer(engine, data)  # Run inference using the inference engine
    for p in pBuffers:  # Iterate over pinned-memory pointers
        cudart.cudaFreeHost(p)  # Free each pinned memory block
if __name__ == "__main__":
    main()  # Run main if the script is executed as the main program

Run the following shell command.

python 0_build.py --fp16
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/5_multi-streams \
	python 5_multi-streams.py

The results are shown in the timeline below.

The timeline shows the following:
- The total time for three batches is reduced from 49.309 ms to 29.650 ms.
- In the timeline for Approach 3, data transfer (from the host to the device, labeled Memcpy HtoD) and computation (the blue sections) run serially. This optimization enables them to run in parallel, as indicated by label 3.
- A noticeable gap remains between batches, which the timeline reveals is caused by the print_result function.

Approach 5: Multi-threading for output processing

Problem analysis

In the timeline for Approach 4, there are still gaps between batches. These gaps are caused by the print_result function, which leaves the GPU idle while printing the output.

Solution design

You can use Python multi-threading to offload output processing to a separate thread, freeing the main thread to continue execution.

Modify the infer function to create a thread that executes the print_result function. The complete code is shown below, saved in 6_multi-threads.py.

Full code

import nvtx  # Import the NVIDIA Tools Extension (NVTX) library for GPU performance analysis
import queue  # Import Python's standard queue module
import threading  # Import Python's standard threading module
import ctypes  # Import the ctypes module to call C library functions
import numpy as np  # Import the NumPy library for numerical computation
import tensorrt as trt  # Import the TensorRT library
from cuda import cudart  # Import cudart from the cuda module for the CUDA Runtime API
np.random.seed(10088)  # Set the NumPy random seed
# Define the softmax function to perform normalization
def softmax(x, axis=1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))  # Subtract the maximum value for numerical stability
    return e_x / np.sum(e_x, axis=axis, keepdims=True)  # Normalize e_x using broadcasting
# Define data_generation_with_pin_memory to generate random data using pinned memory
def data_generation_with_pin_memory(shape, batches):
    data = []  # Create a data list
    pbuffers = []  # Create a list for pinned memory buffers
    for i in range(batches):  # For each batch
        d = np.random.randn(*shape).astype(np.float32)  # Generate a batch of random data
        nElement = d.size  # Get the total number of elements in the data
        nByteSize = d.nbytes  # Get the total byte size of the data
        _, pBuffer = cudart.cudaHostAlloc(nByteSize, cudart.cudaHostAllocDefault)  # Allocate pinned memory using the CUDA API
        pBufferCtype = ctypes.cast(pBuffer, ctypes.POINTER(ctypes.c_float * nElement))  # Cast the pinned memory pointer to a ctypes type
        nd = np.ndarray(shape=d.shape, dtype=d.dtype, buffer=pBufferCtype.contents)  # Wrap the memory with a NumPy ndarray
        nd[:] = d  # Copy the data into pinned memory
        data.append(nd)  # Add the NumPy array wrapping the pinned memory to the data list
        pbuffers.append(pBuffer)  # Add the pinned memory pointer to the pbuffers list
    return data, pbuffers  # Return the data list and the list of pinned memory pointers
# Define load_engine to load a TensorRT inference engine
def load_engine(logger, plan_file):
    with open(plan_file, "rb") as plan:  # Open the inference engine file in binary read mode
        engine = trt.Runtime(logger).deserialize_cuda_engine(plan.read())  # Deserialize the inference engine
    return engine  # Return the deserialized inference engine
# Define get_io_tensors to get I/O tensor information from the inference engine
def get_io_tensors(engine):
    num_io_tensors = engine.num_io_tensors  # Get the total number of I/O tensors in the engine
    io_tensor_names = [engine.get_tensor_name(i) for i in range(num_io_tensors)]  # Get the names of all tensors
    num_input_io_tensors = [engine.get_tensor_mode(io_tensor_names[i]) for i in range(num_io_tensors)].count(trt.TensorIOMode.INPUT)  # Count the number of input tensors
    return num_io_tensors, io_tensor_names, num_input_io_tensors  # Return the tensor count, names, and input tensor count
# Define infer to run inference and create a thread to print results
def infer(engine, data):
    context = engine.create_execution_context()  # Create a TensorRT execution context
    _, computeStream = cudart.cudaStreamCreate()  # Create a CUDA compute stream
    _, copyStream = cudart.cudaStreamCreate()  # Create a CUDA copy stream
    # Create CUDA events
    _, inputConsumedEvent = cudart.cudaEventCreate()
    _, inputCopyedEvent = cudart.cudaEventCreate()
    _, outputReadyEvent = cudart.cudaEventCreate()
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)  # Get I/O tensor information
    context.set_input_shape(io_tensor_names[0], data[0].shape)  # Set the input tensor shape
    bufferH, bufferD = [], []  # Create host and device buffer lists
    bufferH.append(data[0])  # Add the first batch of data to the host buffer list
    for i in range(num_input_io_tensors, num_io_tensors):  # Allocate host-side memory for output tensors
        bufferH.append(np.empty(context.get_tensor_shape(io_tensor_names[i]), dtype=trt.nptype(engine.get_tensor_dtype(io_tensor_names[i]))))
    for i in range(num_io_tensors):  # Allocate device-side memory for each tensor
        bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
    for i in range(num_io_tensors):  # Set tensor addresses in the execution context
        context.set_tensor_address(io_tensor_names[i], int(bufferD[i]))
    q = queue.Queue(maxsize=1)  # Create a queue for inter-thread communication
    p = threading.Thread(target=print_result_in_thread, args=(q, io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH))  # Create a thread for printing results
    p.start()  # Start the thread
    tet = None  # Initialize an NVTX range handle
    # Transfer the input data for the first batch
    for i in range(num_input_io_tensors):
        cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, copyStream)
    cudart.cudaEventRecord(inputCopyedEvent, copyStream)  # Record the input copy completion event
    for i in range(len(data)):  # Perform inference on each batch of data
        if i == 7:  # Special handling for the 7th batch
            tet = nvtx.start_range(message="Total Elapsed Time(3 batchs)", color="orange")  # Use NVTX to start recording a time range
        d = None  # Initialize a variable to store the next batch of data
        if i + 1 < len(data):  # If a next batch exists
            d = data[i + 1]
        nvtx.push_range(message="infer", color="purple")  # Use NVTX to mark the inference process
        infer_once(engine, context, bufferH, bufferD, d, copyStream, computeStream, inputConsumedEvent, inputCopyedEvent, outputReadyEvent, q)  # Call infer_once to perform a single inference pass
        nvtx.pop_range()  # End the NVTX marker for the inference process
    nvtx.end_range(tet)  # End the NVTX time range recording
    q.put(None)  # Put None into the queue to signal termination
    for b in bufferD:  # Free the device memory
        cudart.cudaFree(b)
    # Destroy the CUDA streams and events
    cudart.cudaStreamDestroy(copyStream)
    cudart.cudaStreamDestroy(computeStream)
    cudart.cudaEventDestroy(inputConsumedEvent)
    cudart.cudaEventDestroy(inputCopyedEvent)
    cudart.cudaEventDestroy(outputReadyEvent)
# Define infer_once to perform a single inference pass
def infer_once(engine, context, bufferH, bufferD, data, copyStream, computeStream, inputConsumedEvent, inputCopyedEvent, outputReadyEvent, q):
    num_io_tensors, io_tensor_names, num_input_io_tensors = get_io_tensors(engine)  # Get I/O tensor information
    cudart.cudaEventSynchronize(inputCopyedEvent)  # Synchronize the copy event to ensure the input data has been copied
    if data is not None:  # If new data is provided
        bufferH[0] = data  # Update the input buffer on the host
    cudart.cudaStreamWaitEvent(computeStream, inputCopyedEvent, cudart.cudaEventWaitDefault)  # Make the compute stream wait for the input data copy to complete
    context.execute_async_v3(computeStream)  # Asynchronously execute inference in the compute stream
    context.set_input_consumed_event(inputConsumedEvent)  # Set the input-consumed event
    cudart.cudaStreamWaitEvent(copyStream, inputConsumedEvent, cudart.cudaEventWaitDefault)  # Make the copy stream wait for the input data to be consumed
    if data is not None:  # If there is new data
        bufferH[0] = data  # Update the input buffer on the host
        for i in range(num_input_io_tensors):  # Copy the new input data to the device
            cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, copyStream)
        cudart.cudaEventRecord(inputCopyedEvent, copyStream)  # Record the input copy completion event
    for i in range(num_input_io_tensors, num_io_tensors):  # Copy the output data to the host
        cudart.cudaMemcpyAsync(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, computeStream)
    cudart.cudaEventRecord(outputReadyEvent, computeStream)  # Record the output-ready event
    cudart.cudaEventSynchronize(outputReadyEvent)  # Synchronize the output-ready event to ensure the copy operation is complete
    q.put("OutputReady")  # Send a signal to the queue that the output data is ready
# Define print_result_in_thread, which runs in a separate thread to read signals from the queue and print inference results
def print_result_in_thread(queue, io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH):
    while True:  # Loop indefinitely
        get_result = queue.get()  # Read an item from the queue, blocking if empty
        if get_result is None:  # If a None signal is received, terminate
            break
        nvtx.push_range(message="Print Result")  # Start an NVTX range for performance analysis
        print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH)  # Call print_result to print the inference results
        nvtx.pop_range()  # End the NVTX range
# Define print_result to print the softmax results of the inference
def print_result(io_tensor_names, num_input_io_tensors, num_io_tensors, bufferH):
    for i in range(num_input_io_tensors, num_io_tensors):  # Iterate through all output tensors
        softmax_scores = softmax(bufferH[i])  # Apply the softmax function to calculate probabilities
        predicted_classes = np.argmax(softmax_scores, axis=1)  # Use np.argmax to find the class with the highest probability
        max_probs_np = np.max(softmax_scores, axis=1)  # Find the highest probability value
        print("Output Tensor Name: ", io_tensor_names[i])  # Print the name of the output tensor
        print("Maximum probability for each image in the batch:\n", max_probs_np)  # Print the highest probability for each image
        print("Index of predicted class for each image in the batch:\n", predicted_classes)  # Print the predicted class index
# Define the main function as the program's entry point
def main():
    data, pBuffers = data_generation_with_pin_memory([128, 3, 224, 224], 10)  # Generate input data using pinned memory
    logger = trt.Logger(trt.Logger.ERROR)  # Create a TensorRT logger object
    engine = load_engine(logger, "resnet18.plan")  # Load the TensorRT inference engine
    infer(engine, data)  # Perform the inference process
    for p in pBuffers:  # Iterate through the list of pinned memory pointers
        cudart.cudaFreeHost(p)  # Free the pinned memory
if __name__ == "__main__":
    main()  # If the script is run as the main program, execute the main function

Run the following shell commands.

python 0_build.py --fp16
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/6_multi-threads \
	python 6_multi-threads.py

The timeline shows that the print_result function now overlaps with GPU computation and data transfer instead of executing serially. After running 6_multi-threads.py, the Nsight Systems timeline shows that in the GPU hardware row, Stream 16 (kernel computation) and Stream 17 (HtoD memory copy) run in parallel, with 52.4% and 43.5% utilization, respectively. The Threads area shows multiple Python threads working simultaneously, and the NVTX row shows multiple Print Result markers. This confirms that output processing completes asynchronously in a separate thread and no longer blocks the main inference pipeline. As a result, the total time for the last three batches decreases by 2.86 ms, from 29.650 ms to 26.787 ms. Performance analysis of 6_multi-threads.py in Nsight Systems shows that each TensorRT ExecutionContext::enqueue call takes about 8.5–8.7 ms, the NVTX infer range takes about 8.6–8.75 ms, and the NVTX Total Elapsed Time(3 batchs) ranges are 26.787 ms and 27.173 ms. The parallel execution of the compute and copy streams, visible at the CUDA HW level, validates that this multi-threading approach is effective.

Approach 6: Using CUDA Graph

Problem analysis.

In the timeline for Approach 5, each batch computation involves 25 separate kernel launches, each incurring overhead. A performance analysis of the TensorRT inference task in Nsight Systems on a Tesla V100-SXM2-32GB GPU shows that a single inference takes about 9.107 ms, while three batches take a total of 27.173 ms. Searching for the launch keyword reveals 25 kernel launches. For one of these kernels, volta_first_layer_filter7x7_fwd_execute_flt_k_64_kernel_trt, the actual execution time is only 6.576 μs, but its launch latency is as high as 133.405 μs. This indicates that the launch overhead for many small kernels far exceeds their execution time, making this workload an ideal candidate for optimization with CUDA Graph.

Solution design.

CUDA Graph, a feature introduced in CUDA 10.0, allows developers to capture a series of CUDA operations, such as memory transfers and kernel executions, and organize them into a directed acyclic graph. This graph acts as a snapshot of a sequence of operations on CUDA streams. Once captured, the graph can be replayed multiple times with minimal CPU overhead, which reduces CPU-GPU interaction and improves overall efficiency. You can use CUDA Graph to capture the operations a TensorRT engine requires for inference, including memory copies and kernel executions. This reduces CPU scheduling overhead, which can significantly improve performance, especially for repetitive inference tasks.

Important

Using CUDA Graph can be complex. It typically requires a solid understanding of CUDA programming and how to configure and execute inference in TensorRT. Furthermore, the effectiveness of CUDA Graph depends on the use case, as not all CUDA operations support graph capture. The model in this article is an example of a workload that cannot use CUDA Graph because it contains unsupported operations.

The following example shows how to use CUDA Graph in TensorRT by modifying the infer_once function.

Full code

# Define a function named infer_once that executes a single inference using the TensorRT engine, CUDA streams, and events.
def infer_once(engine,context,bufferH,bufferD,data,copyStream,computeStream,inputConsumedEvent,inputCopyedEvent,outputReadyEvent,q):
    # Call get_io_tensors to get the number of I/O tensors, their names, and the number of input tensors.
	num_io_tensors,io_tensor_names,num_input_io_tensors = get_io_tensors(engine)
    # Synchronize on inputCopyedEvent to ensure the previous memory copy completes.
	cudart.cudaEventSynchronize(inputCopyedEvent)
    # Assign new input data to the first element in the host buffer bufferH.
	bufferH[0] = data
	# Wait in computeStream until the input transfer completes.
	cudart.cudaStreamWaitEvent(computeStream, inputCopyedEvent, cudart.cudaEventWaitDefault)
    # Begin capture on computeStream to create a CUDA Graph. Use global capture mode.
	cudart.cudaStreamBeginCapture(computeStream, cudart.cudaStreamCaptureMode.cudaStreamCaptureModeGlobal)
	# Run inference asynchronously on the compute stream. This becomes part of the graph capture.
    context.execute_async_v3(computeStream)
    # End the stream capture and create a graph object.
	status, graph = cudart.cudaStreamEndCapture(computeStream)
    # Print the capture status.
	print(status)
    # Instantiate the graph for execution.
	status, graphExec = cudart.cudaGraphInstantiate(graph, 0)
    # Print the instantiation status.
	print(status)
    # Launch the CUDA Graph on the compute stream. This replays all captured CUDA operations.
	cudart.cudaGraphLaunch(graphExec, computeStream)
	# Mark the input as consumed.
	context.set_input_consumed_event(inputConsumedEvent)
	# Copy outputs back after the computation finishes.
	for i in range(num_input_io_tensors, num_io_tensors):
		cudart.cudaMemcpyAsync(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,computeStream)
	cudart.cudaEventRecord(outputReadyEvent, computeStream)
	# Wait in copyStream until the input is consumed.
	cudart.cudaStreamWaitEvent(copyStream, inputConsumedEvent, cudart.cudaEventWaitDefault)
	# Copy the next input.
	if data is not None:
		bufferH[0] = data
		for i in range(num_input_io_tensors):
			cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,copyStream)
		cudart.cudaEventRecord(inputCopyedEvent, copyStream)
	cudart.cudaEventSynchronize(outputReadyEvent)
	q.put("OutputReady")

After running the code, the following error occurs. This likely means the model contains CUDA operations that do not support graph capture.

[03/04/2024-12:34:39] [TRT] [E] 1: [caskUtils.cpp::createCaskHardwareInfo::852] Error Code 1: Cuda Runtime (operation not permitted when stream is capturing)
cudaError_t.cudaErrorStreamCaptureInvalidated
cudaError_t.cudaErrorInvalidValue

For a CUDA Graph example, see CUDA Graph.

Approach 7: Multiple CPU threads and contexts

Problem Analysis

Previous approaches processed data using a single execution context. A single context cannot process multiple batches in parallel because each context maintains intermediate caches for inference, and these caches can only serve one batch at a time.

Solution Design

Multiple CPU threads on the host can submit batch processing requests for TensorRT to process in parallel. TensorRT supports this by allowing the creation of multiple, independent contexts.

The following example creates two contexts that use the same optimization profile. This requires enabling the shared profile option when building the engine. The 10 generated batches are split into two groups of five, and each context processes one group.

The complete code is shown below, saved as 8_multi_cpu_threads.py.

Full code

import nvtx  # Import the NVIDIA Tools Extension (NVTX) library for profiling
import queue  # Import Python's queue module for inter-thread communication
import threading  # Import Python's threading module for multithreaded programming
import ctypes  # Import the ctypes module to call C libraries in Python
import numpy as np  # Import the NumPy library for numerical computing
import tensorrt as trt  # Import the TensorRT library for deep learning inference optimization
from cuda import cudart  # Import cudart from the cuda module, which provides Python bindings for the CUDA Runtime API
np.random.seed(10088)
def softmax(x, axis=1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True)) # For numerical stability
    return e_x / np.sum(e_x, axis=axis, keepdims=True)
def data_generation_with_pin_memory(shape, batches):
    data1, data2 = [], []  # Create two lists to store data for two different datasets
    pbuffers = []  # Create a list to store pointers to pinned memory
    for i in range(batches):  # Generate and store data for each batch
        d = np.random.randn(*shape).astype(np.float32)  # Generate random data
        nElement = d.size  # Get the total number of elements in the array
        nByteSize = d.nbytes  # Get the size of the array in bytes
        _, pBuffer = cudart.cudaHostAlloc(nByteSize, cudart.cudaHostAllocDefault)  # Allocate pinned memory using CUDA
        pBufferCtype = ctypes.cast(pBuffer, ctypes.POINTER(ctypes.c_float * nElement))  # Cast the pinned memory to a ctypes type
        nd = np.ndarray(shape=d.shape, dtype=d.dtype, buffer=pBufferCtype.contents)  # Create a NumPy array that wraps the pinned memory
        nd[:] = d  # Copy the generated data into the pinned memory
        if i % 2 == 0:  # Alternately add data to the two data lists
            data1.append(nd)
        else:
            data2.append(nd)
        pbuffers.append(pBuffer)  # Add the pinned memory pointer to the list
    return data1, data2, pbuffers  # Return the two data lists and the list of pinned memory pointers
def load_engine(logger,plan_file):
	with open(plan_file, "rb") as plan:
			engine = trt.Runtime(logger).deserialize_cuda_engine(plan.read())
	return engine
def get_io_tensors(engine):
	num_io_tensors = engine.num_io_tensors
	io_tensor_names = [engine.get_tensor_name(i) for i in range(num_io_tensors)]
	num_input_io_tensors = [engine.get_tensor_mode(io_tensor_names[i]) for i in range(num_io_tensors)].count(trt.TensorIOMode.INPUT)
	return num_io_tensors,io_tensor_names,num_input_io_tensors
def infer(engine,data):
	context = engine.create_execution_context()
	# Create two CUDA streams
	_, computeStream = cudart.cudaStreamCreate()
	_, copyStream = cudart.cudaStreamCreate()
	# Create three CUDA events
	_, inputConsumedEvent = cudart.cudaEventCreate()
	_, inputCopyedEvent = cudart.cudaEventCreate()
	_, outputReadyEvent = cudart.cudaEventCreate()
	num_io_tensors,io_tensor_names,num_input_io_tensors = get_io_tensors(engine)
	context.set_input_shape(io_tensor_names[0], data[0].shape)
	bufferH,bufferD = [],[]
	bufferH.append(data[0])
	for i in range(num_input_io_tensors, num_io_tensors):
		bufferH.append(np.empty(context.get_tensor_shape(io_tensor_names[i]), dtype=trt.nptype(engine.get_tensor_dtype(io_tensor_names[i]))))
	for i in range(num_io_tensors):
		bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])
	for i in range(num_io_tensors):
		context.set_tensor_address(io_tensor_names[i], int(bufferD[i]))
	q = queue.Queue(maxsize=1)
	p = threading.Thread(target=print_result_in_thread,args=(q,io_tensor_names,num_input_io_tensors,num_io_tensors,bufferH))
	p.start()
	tet = None
	# Transfer the first batch of data
	for i in range(num_input_io_tensors):
		cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,copyStream)
	cudart.cudaEventRecord(inputCopyedEvent, copyStream)
	for i in range(len(data)):
		if i == 2:
			tet = nvtx.start_range(message="Total Elapsed Time (3 batches)", color="orange")
		d = None
		if i+1 <= len(data) -1:
			d = data[i+1]
		nvtx.push_range(message="infer",color="purple")
		infer_once(engine,context,bufferH,bufferD,d,copyStream,computeStream,inputConsumedEvent,inputCopyedEvent,outputReadyEvent,q)
		nvtx.pop_range()
	nvtx.end_range(tet)
	q.put(None)
	for b in bufferD:
		cudart.cudaFree(b)
	cudart.cudaStreamDestroy(copyStream)
	cudart.cudaStreamDestroy(computeStream)
	cudart.cudaEventDestroy(inputConsumedEvent)
	cudart.cudaEventDestroy(inputCopyedEvent)
	cudart.cudaEventDestroy(outputReadyEvent)
def infer_once(engine,context,bufferH,bufferD,data,copyStream,computeStream,inputConsumedEvent,inputCopyedEvent,outputReadyEvent,q):
	num_io_tensors,io_tensor_names,num_input_io_tensors = get_io_tensors(engine)
	cudart.cudaEventSynchronize(inputCopyedEvent)
	bufferH[0] = data
	# Wait in the compute stream for the data transfer to complete
	cudart.cudaStreamWaitEvent(computeStream, inputCopyedEvent, cudart.cudaEventWaitDefault)
	context.execute_async_v3(computeStream)
	# Set the event to mark the input as consumed
	context.set_input_consumed_event(inputConsumedEvent)
	# Wait in the copy stream until the input is consumed
	cudart.cudaStreamWaitEvent(copyStream, inputConsumedEvent, cudart.cudaEventWaitDefault)
	# Transfer the next batch of data
	if data is not None:
		bufferH[0] = data
		for i in range(num_input_io_tensors):
			cudart.cudaMemcpyAsync(bufferD[i], bufferH[i].ctypes.data, bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,copyStream)
		cudart.cudaEventRecord(inputCopyedEvent, copyStream)
	# After computation, copy the data back
	for i in range(num_input_io_tensors, num_io_tensors):
		cudart.cudaMemcpyAsync(bufferH[i].ctypes.data, bufferD[i], bufferH[i].nbytes, cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,computeStream)
	cudart.cudaEventRecord(outputReadyEvent, computeStream)
	cudart.cudaEventSynchronize(outputReadyEvent)
	q.put("OutputReady")
def print_result_in_thread(queue,io_tensor_names,num_input_io_tensors,num_io_tensors,bufferH):
	while True:
		get_result = queue.get()
		if get_result is None:
			break
		nvtx.push_range(message="Print Result")
		print_result(io_tensor_names,num_input_io_tensors,num_io_tensors,bufferH)
		nvtx.pop_range()
def print_result(io_tensor_names,num_input_io_tensors,num_io_tensors,bufferH):
	for i in range(num_input_io_tensors,num_io_tensors):
		softmax_scores = softmax(bufferH[i])
		# Use np.argmax to get the class with the highest probability
		predicted_classes = np.argmax(softmax_scores, axis=1)
		max_probs_np = np.max(softmax_scores, axis=1)
		print("Output Tensor Name: ",io_tensor_names[i])
		print("Maximum probability for each image in the batch:\n", max_probs_np)
		print("Index of predicted class for each image in the batch:\n", predicted_classes)
def main():
    data1, data2, pBuffers = data_generation_with_pin_memory([128, 3, 224, 224], 10)  # Generate two datasets using pinned memory
    logger = trt.Logger(trt.Logger.ERROR)  # Create a TensorRT logger object
    engine = load_engine(logger, "resnet18.plan")  # Load the TensorRT inference engine
    threads = []  # Create a thread list
    p = threading.Thread(target=infer, args=(engine, data1))  # Create a new thread to run inference on the first dataset
    threads.append(p)  # Add the thread to the list
    p = threading.Thread(target=infer, args=(engine, data2))  # Create a new thread to run inference on the second dataset
    threads.append(p)  # Add the thread to the list
    for p in threads:  # Iterate through the thread list
        p.start()  # Start each thread
    for p in threads:  # Iterate through the thread list
        p.join()  # Wait for each thread to complete
    for p in pBuffers:  # Iterate through the pinned memory pointer list
        cudart.cudaFreeHost(p)  # Free the pinned memory
if __name__ == "__main__":
	main()

The preceding code starts two Python threads. Each thread processes five batches, creates its own context, and runs inference independently. Then, execute the following shell script.

python 0_build.py --fp16 --share-profile # Enable profile sharing
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/8_multi-context \
	python 8_multi-context.py

The timeline shows the following results.
- GPU compute runs in parallel across two streams (Stream 19 and Stream 16).
- On each stream, processing the last three batches takes about 50 ms. Since two streams run in parallel, the effective processing time for these three batches is approximately 25 ms (50 ms / 2), yielding an average of about 8.3 ms per batch. This parallel execution leads to fuller GPU utilization, but the increased contention on each stream also highlights the risk of computational overload.
- GPU memory usage is approximately double that of a single-context inference run.

Approach 8: Multiple profiles with Dynamic Shape

Problem analysis

The Dynamic Shape feature improves kernel flexibility by allowing a model to process input data of various sizes. However, in practice, you might need to compile multiple versions of a kernel to optimally handle different input sizes or other conditions. This is where multiple profiles become useful. You can implement multiple profiles by adding several compilation options when you build the engine. For example, you can use different thread block sizes or other optimization parameters for different input sizes. In some scenarios, the performance of Dynamic Shape can degrade significantly when the min-opt-max range is too wide.

Solution design

The solution is to use multiple optimization profiles, each corresponding to a narrower min-opt-max range.

To demonstrate how to use multiple profiles, the following code simplifies and modifies the baseline example.

First, modify the 0_build.py script. During the build phase, you need to create two optimization profiles and set the min-opt-max shape for each. This code is saved in 9_multi-profiles-build.py.

Full code

import argparse
import os
import tensorrt as trt
def build(logger,ONNX_file,shapes,num_aux_stream,share_profile,fp16):
    errors = []  # Initialize an empty list to collect errors.
    builder = trt.Builder(logger)  # Create a TensorRT Builder instance.
    # Create a network definition that supports explicit batch sizes.
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    # Create an optimization profile for each input shape.
    profiles = [builder.create_optimization_profile() for _ in range(len(shapes))]
    config = builder.create_builder_config()  # Create a builder configuration object.
    if share_profile:  # If profile sharing is enabled.
        print("enable share profile")
        # Enable the preview feature for sharing optimization profiles.
        config.set_preview_feature(trt.PreviewFeature.PROFILE_SHARING_0806, True)
    if num_aux_stream > 0:  # If there are auxiliary streams.
        print("set aux stream " + str(num_aux_stream))
        config.max_aux_streams = num_aux_stream  # Set the number of auxiliary streams.
    if fp16:  # If FP16 precision is used.
        config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16 precision mode.
    parser = trt.OnnxParser(network, logger)  # Create an ONNX parser.
    if not os.path.exists(ONNX_file):  # Check if the ONNX file exists.
        errors.append("Failed to find onnx File!")  # If it does not exist, add an error message.
        return None, errors  # Return None and the list of errors.
    with open(ONNX_file, "rb") as model:  # Open the ONNX file in binary read mode.
        if not parser.parse(model.read()):  # Read and parse the ONNX model.
            errors.append("failed to parse .onnx file: ")  # If parsing fails, add an error message.
            for error in range(parser.num_errors):  # Iterate through all parsing errors.
                errors.append(parser.get_error(error))  # Add each parsing error message.
            return None, errors  # Return None and the list of errors.
    inputTensor = network.get_input(0)  # Get the first input tensor of the network.
    for i in range(len(shapes)):  # Iterate through all input shapes.
        # Set the input shape for the optimization profile for each input tensor.
        profiles[i].set_shape(inputTensor.name, *shapes[i])
        config.add_optimization_profile(profiles[i])  # Add the optimization profile to the builder configuration.
    config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED  # Set the profiling verbosity to detailed.
    # Build and serialize the network into an engine string.
    engine_string = builder.build_serialized_network(network, config)
    if engine_string == None:  # Check if the engine was built successfully.
        errors.append("Failed to build engine")  # If the build fails, add an error message.
        return None, errors  # Return None and the list of errors.
    return engine_string, errors  # If the build is successful, return the engine string and the list of errors.
def save_engine(engine_string,planFile):
	with open(planFile, "wb") as f:
		f.write(engine_string)
	return 0
def main():
    # Create a command-line argument parser using the argparse library.
    parser = argparse.ArgumentParser(description='ResNet18 TensorRT Builder')
    # Add an argument to specify the number of auxiliary streams (default: 0).
    parser.add_argument('--aux-stream', type=int, default=0, metavar='N',
                        help='specify the aux stream (default: 0)')
    # Add an argument to enable shared profiles.
    parser.add_argument('--share-profile', action='store_true', default=False,
                        help='enable share profile')
    # Add the --fp16 command-line argument to enable FP16 mode, which is a switch option and is off by default.
    parser.add_argument('--fp16', action='store_true', default=False,
                        help='enable fp16 mode')
    # Add the --output command-line argument to specify the output plan file, with a default value of 'resnet18.plan'.
    parser.add_argument('--output', type=str, default='resnet18.plan', metavar='N',
                        help='specify the plan file')
    # Add the --ONNX-file command-line argument to specify the input ONNX file, with a default value of 'resnet18.ONNX'.
    parser.add_argument('--ONNX-file', type=str, default='resnet18.ONNX', metavar='N',
                        help='specify the onnx file')
    # Parse the command-line arguments.
    args = parser.parse_args()
    # Create a TensorRT logger object and set the log level to ERROR.
    logger = trt.Logger(trt.Logger.ERROR)
    # Define two shape ranges for the optimization profiles.
    shapes = [
        [[1,3,224,224],[16,3,224,224],[32,3,224,224]],
        [[64,3,224,224],[128,3,224,224],[192,3,224,224]],
    ]
    # Build the TensorRT engine.
    engine_string, errors = build(logger, args.ONNX_file, shapes, args.aux_stream, args.share_profile, args.fp16)
    # Check for build errors.
    if len(errors) != 0:
        print(errors)  # If there are errors, print them.
        return 1  # Return error code 1 to indicate an abnormal program exit.
    # Save the engine to the specified file.
    save_engine(engine_string, args.output)
    return 0  # Exit the program normally with a return code of 0.
if __name__ == "__main__":
	main()

Next, prepare the inference script and save it as 9_multi-profiles.py. The script prepares two datasets, data0 and data1, with shapes of [16,3,224,224] and [128,3,224,224], respectively. Without multiple profiles, you would need to allocate new GPU memory each time the data shape changes.

Full code

import argparse
import os
import tensorrt as trt
def build(logger,ONNX_file,shapes,num_aux_stream,share_profile,fp16):
    errors = []  # Initialize an empty list to collect errors.
    builder = trt.Builder(logger)  # Create a TensorRT Builder instance.
    # Create a network definition that supports explicit batch sizes.
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    # Create an optimization profile for each input shape.
    profiles = [builder.create_optimization_profile() for _ in range(len(shapes))]
    config = builder.create_builder_config()  # Create a builder configuration object.
    if share_profile:  # If profile sharing is enabled.
        print("enable share profile")
        # Enable the preview feature for sharing optimization profiles.
        config.set_preview_feature(trt.PreviewFeature.PROFILE_SHARING_0806, True)
    if num_aux_stream > 0:  # If there are auxiliary streams.
        print("set aux stream " + str(num_aux_stream))
        config.max_aux_streams = num_aux_stream  # Set the number of auxiliary streams.
    if fp16:  # If FP16 precision is used.
        config.set_flag(trt.BuilderFlag.FP16)  # Enable FP16 precision mode.
    parser = trt.OnnxParser(network, logger)  # Create an ONNX parser.
    if not os.path.exists(ONNX_file):  # Check if the ONNX file exists.
        errors.append("Failed to find onnx File!")  # If it does not exist, add an error message.
        return None, errors  # Return None and the list of errors.
    with open(ONNX_file, "rb") as model:  # Open the ONNX file in binary read mode.
        if not parser.parse(model.read()):  # Read and parse the ONNX model.
            errors.append("failed to parse .onnx file: ")  # If parsing fails, add an error message.
            for error in range(parser.num_errors):  # Iterate through all parsing errors.
                errors.append(parser.get_error(error))  # Add each parsing error message.
            return None, errors  # Return None and the list of errors.
    inputTensor = network.get_input(0)  # Get the first input tensor of the network.
    for i in range(len(shapes)):  # Iterate through all input shapes.
        # Set the input shape for the optimization profile for each input tensor.
        profiles[i].set_shape(inputTensor.name, *shapes[i])
        config.add_optimization_profile(profiles[i])  # Add the optimization profile to the builder configuration.
    config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED  # Set the profiling verbosity to detailed.
    # Build and serialize the network into an engine string.
    engine_string = builder.build_serialized_network(network, config)
    if engine_string == None:  # Check if the engine was built successfully.
        errors.append("Failed to build engine")  # If the build fails, add an error message.
        return None, errors  # Return None and the list of errors.
    return engine_string, errors  # If the build is successful, return the engine string and the list of errors.
def save_engine(engine_string,planFile):
	with open(planFile, "wb") as f:
		f.write(engine_string)
	return 0
def main():
    # Create a command-line argument parser using the argparse library.
    parser = argparse.ArgumentParser(description='ResNet18 TensorRT Builder')
    # Add an argument to specify the number of auxiliary streams (default: 0).
    parser.add_argument('--aux-stream', type=int, default=0, metavar='N',
                        help='specify the aux stream (default: 0)')
    # Add an argument to enable shared profiles.
    parser.add_argument('--share-profile', action='store_true', default=False,
                        help='enable share profile')
    # Add the --fp16 command-line argument to enable FP16 mode, which is a switch option and is off by default.
    parser.add_argument('--fp16', action='store_true', default=False,
                        help='enable fp16 mode')
    # Add the --output command-line argument to specify the output plan file, with a default value of 'resnet18.plan'.
    parser.add_argument('--output', type=str, default='resnet18.plan', metavar='N',
                        help='specify the plan file')
    # Add the --ONNX-file command-line argument to specify the input ONNX file, with a default value of 'resnet18.ONNX'.
    parser.add_argument('--ONNX-file', type=str, default='resnet18.ONNX', metavar='N',
                        help='specify the onnx file')
    # Parse the command-line arguments.
    args = parser.parse_args()
    # Create a TensorRT logger object and set the log level to ERROR.
    logger = trt.Logger(trt.Logger.ERROR)
    # Define two shape ranges for the optimization profiles.
    shapes = [
        [[1,3,224,224],[16,3,224,224],[32,3,224,224]],
        [[64,3,224,224],[128,3,224,224],[192,3,224,224]],
    ]
    # Build the TensorRT engine.
    engine_string, errors = build(logger, args.ONNX_file, shapes, args.aux_stream, args.share_profile, args.fp16)
    # Check for build errors.
    if len(errors) != 0:
        print(errors)  # If there are errors, print them.
        return 1  # Return error code 1 to indicate an abnormal program exit.
    # Save the engine to the specified file.
    save_engine(engine_string, args.output)
    return 0  # Exit the program normally with a return code of 0.
if __name__ == "__main__":
	main()

After the setup is complete, run the following shell script.

python 9_multi-profiles-build.py --output resnet18-multi-profiles.plan
nsys profile -w true \
	-t cuda,nvtx,osrt,cudnn,cublas \
	--cuda-memory-usage=true \
	--cudabacktrace=all \
	--cuda-graph-trace=node \
	--gpu-metrics-device=all \
	-f true \
	-o reports/9_multi-profiles \
	python 9_multi-profiles.py

Import the results into Nsight Systems. The result is shown below. In the Nsight Systems performance analysis, select the /conv1/Conv+/relu/Relu layer. The Description panel on the right shows that its input dimension is [16,3,224,224], which confirms that the inference for the first 10 batches used the shape configuration from the first optimization profile.

The shape for the next 10 batches is [128,3,224,224].

For more information about Dynamic Shape, see the TensorRT Cookbook.

Summary

These optimization techniques reduced the processing time for three batches from 133.577 ms to about 25 ms. For more on TensorRT's advanced features and optimization techniques, see the TensorRT Cookbook.