Bidirectional Encoder Representations from Transformers (BERT) models deliver strong accuracy on natural language processing (NLP) tasks, but their high parameter count makes GPU inference expensive. PAI-Blade reduces inference latency without changing your model architecture or rewriting your inference code.
This tutorial walks you through optimizing a TensorFlow BERT news-classification model with PAI-Blade, from downloading the model to deploying the optimized version.
Prerequisites
Before you begin, ensure that you have:
Linux with Python 3.6 or later and CUDA 10.0
TensorFlow 1.15
PAI-Blade 3.16.0 or later
How it works
PAI-Blade analyzes your SavedModel and applies a pipeline of GPU-specific optimizations — including graph cleanup, mixed-precision conversion, and AI compiler fusion — then saves the result as a new SavedModel. Integration requires a single additional import; no other inference code changes are needed.
Step 1: Prepare the model and test data
Install the
tokenizerslibrary.pip3 install tokenizersDownload the example BERT model and extract it.
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/bert_example/nlu_general_news_classification_base.tar.gz mkdir nlu_general_news_classification_base tar zxvf nlu_general_news_classification_base.tar.gz -C nlu_general_news_classification_baseInspect the model's input and output signature.
saved_model_cli show --dir nlu_general_news_classification_base --allThe output shows three input tensors (
input_ids:0,input_mask:0,segment_ids:0) and three output tensors (logits,predictions,probabilities). The key inference result isArgMax:0underpredictions, which holds the final classification category.MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs: signature_def['serving_default']: The given SavedModel SignatureDef contains the following input(s): inputs['input_ids'] tensor_info: dtype: DT_INT32 shape: (-1, -1) name: input_ids:0 inputs['input_mask'] tensor_info: dtype: DT_INT32 shape: (-1, -1) name: input_mask:0 inputs['segment_ids'] tensor_info: dtype: DT_INT32 shape: (-1, -1) name: segment_ids:0 The given SavedModel SignatureDef contains the following output(s): outputs['logits'] tensor_info: dtype: DT_FLOAT shape: (-1, 28) name: app/ez_dense/BiasAdd:0 outputs['predictions'] tensor_info: dtype: DT_INT32 shape: (-1) name: ArgMax:0 outputs['probabilities'] tensor_info: dtype: DT_FLOAT shape: (-1, 28) name: Softmax:0 Method name is: tensorflow/serving/predictPrepare test data using the
tokenizerslibrary. This example encodes a batch of four news articles padded to sequence length 128, then formats the result as a TensorFlow FeedDict.from tokenizers import BertWordPieceTokenizer # Initialize the tokenizer from the vocab.txt file in the model directory. tokenizer = BertWordPieceTokenizer('./nlu_general_news_classification_base/vocab.txt') # Group four news articles into a batch for encoding. news = [ 'Mexico declares a health emergency with over 1,000 confirmed cases. Chinanews.com, March 31 - According to comprehensive reports, Mexico\'s COVID-19 cases have surpassed 1,000. On March 30, the Mexican government declared a health emergency and strengthened measures to curb the spread of the pandemic.', 'Data from the National Bureau of Statistics shows that in August, China\'s Manufacturing Purchasing Managers\' Index (PMI) was 50.1%, remaining above the threshold but 0.3 percentage points lower than the previous month.', 'UTC+8, August 31 - In the final round of the men\'s blind football group stage at the just-concluded Tokyo Paralympics, the Chinese team defeated host Japan 2-0 with a brace from Zhu Ruiming, advancing to the semifinals with a record of two wins and one loss.', 'As of August 30, the "Zhurong" Mars rover has been on the surface of Mars for 100 days. In these 100 days, "Zhurong" has traveled a cumulative 1064 meters south of its landing site, carrying 6 scientific payloads and acquiring about 10 GB of raw scientific data.', ] tokenized = tokenizer.encode_batch(news) # Pad the sequence length to 128. def pad(seq, seq_len, padding_val): return seq + [padding_val] * (seq_len - len(seq)) input_ids = [pad(tok.ids, 128, 0) for tok in tokenized] segment_ids = [pad(tok.type_ids, 128, 0) for tok in tokenized] input_mask = [ pad([1] * len(tok.ids), 128, 0) for tok in tokenized ] # The final test data is in the TensorFlow FeedDict format. test_data = { "input_ids:0": input_ids, "segment_ids:0": segment_ids, "input_mask:0": input_mask, }Load the original model and run a baseline inference to confirm the expected output.
import tensorflow.compat.v1 as tf import json # Load the label mapping file to get the category names that correspond to the output integers. with open('./nlu_general_news_classification_base/label_mapping.json') as f: MAPPING = {v: k for k, v in json.load(f).items()} # Load and run the model. cfg = tf.ConfigProto() cfg.gpu_options.allow_growth = True with tf.Session(config=cfg) as sess: tf.saved_model.loader.load(sess, ['serve'], './nlu_general_news_classification_base') result = sess.run('ArgMax:0', test_data) print([MAPPING[r] for r in result])Expected output:
['International', 'Finance', 'Sports', 'Science']
Step 2: Optimize the model with PAI-Blade
Call
blade.optimizewith the SavedModel path and your test data. Use optimization levelo1(lossless) targeting GPU devices.Do not pass the
inputsoroutputsparameters. PAI-Blade infers input and output nodes automatically from the SavedModel signature. The first return value is the path to the optimized SavedModel. For information about other optimization levels and when to use them, see the Python interface documentation.import blade saved_model_dir = 'nlu_general_news_classification_base' optimized_model, _, report = blade.optimize( saved_model_dir, # The model path. 'o1', # O1 lossless optimization. device_type='gpu', # Optimize for GPU devices. test_data=[test_data] # The test data. )After optimization completes, print the optimization report.
The speedup in this example is for reference only. Actual results vary by model architecture, batch size, and GPU type. For a description of all report fields, see Optimization report.
print("Report: {}".format(report))The report lists each optimization pass applied, its individual speedup, and the overall before/after latency. In this example, two passes were applied —
TfAutoMixedPrecisionGpu(1.46x) andTfAicompilerGpu(2.43x) — for a combined 3.54x speedup, reducing inference time from 35 ms to 9.9 ms.{ "software_context": [ {"software": "tensorflow", "version": "1.15.0"}, {"software": "cuda", "version": "10.0.0"} ], "hardware_context": {"device_type": "gpu", "microarchitecture": "T4"}, "user_config": "", "diagnosis": { "model": "nlu_general_news_classification_base", "test_data_source": "user provided", "shape_variation": "dynamic", "message": "", "test_data_info": "input_ids:0 shape: (4, 128) data type: int64\nsegment_ids:0 shape: (4, 128) data type: int64\ninput_mask:0 shape: (4, 128) data type: int64" }, "optimizations": [ {"name": "TfStripUnusedNodes", "status": "effective", "speedup": "na", "pre_run": "na", "post_run": "na"}, {"name": "TfStripDebugOps", "status": "effective", "speedup": "na", "pre_run": "na", "post_run": "na"}, {"name": "TfAutoMixedPrecisionGpu", "status": "effective", "speedup": "1.46", "pre_run": "35.04 ms", "post_run": "24.02 ms"}, {"name": "TfAicompilerGpu", "status": "effective", "speedup": "2.43", "pre_run": "23.99 ms", "post_run": "9.87 ms"} ], "overall": {"baseline": "35.01 ms", "optimized": "9.90 ms", "speedup": "3.54"}, "model_info": {"input_format": "saved_model"}, "compatibility_list": [{"device_type": "gpu", "microarchitecture": "T4"}], "model_sdk": {} }Print the path of the optimized model.
print("Optimized model: {}".format(optimized_model))Output:
Optimized model: /root/nlu_general_news_classification_base_blade_opt_20210901141823/nlu_general_news_classification_baseThe optimized model is saved to a new directory, leaving the original model unchanged.
Step 3: Verify performance and correctness
Run a benchmark to confirm that the optimization report numbers match measured latency and that prediction results remain correct.
The benchmark function below warms up the model 10 times, then runs 1,000 timed inference calls and prints the average latency and final prediction.
import time
def benchmark(model, test_data):
tf.reset_default_graph()
with tf.Session() as sess:
sess.graph.as_default()
tf.saved_model.loader.load(sess, ['serve'], model)
# Warm up.
for i in range(0, 10):
result = sess.run('ArgMax:0', test_data)
# Benchmark.
num_runs = 1000
start = time.time()
for i in range(0, num_runs):
result = sess.run('ArgMax:0', test_data)
elapsed = time.time() - start
rt_ms = elapsed / num_runs * 1000.0
# Show the result.
print("Latency of model: {:.2f} ms.".format(rt_ms))
print("Predict result: {}".format([MAPPING[r] for r in result]))Benchmark the original model.
benchmark('nlu_general_news_classification_base', test_data)Output:
Latency of model: 36.20 ms. Predict result: ['International', 'Finance', 'Sports', 'Science']The 36.20 ms latency is consistent with the
"baseline": "35.01 ms"value in the optimization report'soverallsection.Benchmark the optimized model.
The
TfAicompilerGpuoptimization uses AICompiler, which compiles asynchronously by default — meaning the unoptimized model handles requests during compilation. SetTAO_COMPILATION_MODE_ASYNC=0before benchmarking to force synchronous compilation and get accurate latency numbers.import os os.environ['TAO_COMPILATION_MODE_ASYNC'] = '0' benchmark(optimized_model, test_data)Output:
Latency of model: 9.87 ms. Predict result: ['International', 'Finance', 'Sports', 'Science']The 9.87 ms latency is consistent with the
"optimized": "9.90 ms"value in the optimization report. The prediction results are identical to the original model, confirming correctness.
Step 4: Deploy the optimized model
This section covers the Python SDK. For the C++ SDK, see Use an SDK to deploy a TensorFlow model for inference.
(Optional, trial period only) Set the following environment variable to prevent the program from quitting unexpectedly due to an authentication failure.
export BLADE_AUTH_USE_COUNTING=1Authenticate to use PAI-Blade by setting your region and authentication token.
Placeholder Description <region>The region where you use PAI-Blade. Get the available regions by joining the PAI-Blade DingTalk user group. For the group QR code, see Install PAI-Blade. <token>The authentication token for PAI-Blade. Get the token by joining the PAI-Blade DingTalk user group. For the group QR code, see Install PAI-Blade. export BLADE_REGION=<region> export BLADE_TOKEN=<token>Replace the following placeholders:
Load and run the optimized model. The only code change needed is adding
import blade.runtime.tensorflow— the rest of your inference code stays the same.import tensorflow.compat.v1 as tf import blade.runtime.tensorflow # Replace <your_optimized_model_path> with the path of the optimized model. savedmodel_dir = <your_optimized_model_path> # Replace <your_infer_data> with the data for inference. infer_data = <your_infer_data> with tf.Session() as sess: sess.graph.as_default() tf.saved_model.loader.load(sess, ['serve'], savedmodel_dir) result = sess.run('ArgMax:0', infer_data)
What's next
Python interface documentation — full reference for
blade.optimizeparameters and return valuesOptimization report — description of all fields in the optimization report
Use an SDK to deploy a TensorFlow model for inference — C++ SDK integration guide