,

更新时间:
复制 MD 格式

Bidirectional Encoder Representations from Transformers (BERT) models deliver strong accuracy on natural language processing (NLP) tasks, but their high parameter count makes GPU inference expensive. PAI-Blade reduces inference latency without changing your model architecture or rewriting your inference code.

This tutorial walks you through optimizing a TensorFlow BERT news-classification model with PAI-Blade, from downloading the model to deploying the optimized version.

Prerequisites

Before you begin, ensure that you have:

  • Linux with Python 3.6 or later and CUDA 10.0

  • TensorFlow 1.15

  • PAI-Blade 3.16.0 or later

How it works

PAI-Blade analyzes your SavedModel and applies a pipeline of GPU-specific optimizations — including graph cleanup, mixed-precision conversion, and AI compiler fusion — then saves the result as a new SavedModel. Integration requires a single additional import; no other inference code changes are needed.

Step 1: Prepare the model and test data

  1. Install the tokenizers library.

    pip3 install tokenizers
  2. Download the example BERT model and extract it.

    wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/bert_example/nlu_general_news_classification_base.tar.gz
    mkdir nlu_general_news_classification_base
    tar zxvf nlu_general_news_classification_base.tar.gz -C nlu_general_news_classification_base
  3. Inspect the model's input and output signature.

    saved_model_cli show --dir nlu_general_news_classification_base --all

    The output shows three input tensors (input_ids:0, input_mask:0, segment_ids:0) and three output tensors (logits, predictions, probabilities). The key inference result is ArgMax:0 under predictions, which holds the final classification category.

    MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
    
    signature_def['serving_default']:
      The given SavedModel SignatureDef contains the following input(s):
        inputs['input_ids'] tensor_info:
            dtype: DT_INT32
            shape: (-1, -1)
            name: input_ids:0
        inputs['input_mask'] tensor_info:
            dtype: DT_INT32
            shape: (-1, -1)
            name: input_mask:0
        inputs['segment_ids'] tensor_info:
            dtype: DT_INT32
            shape: (-1, -1)
            name: segment_ids:0
      The given SavedModel SignatureDef contains the following output(s):
        outputs['logits'] tensor_info:
            dtype: DT_FLOAT
            shape: (-1, 28)
            name: app/ez_dense/BiasAdd:0
        outputs['predictions'] tensor_info:
            dtype: DT_INT32
            shape: (-1)
            name: ArgMax:0
        outputs['probabilities'] tensor_info:
            dtype: DT_FLOAT
            shape: (-1, 28)
            name: Softmax:0
      Method name is: tensorflow/serving/predict
  4. Prepare test data using the tokenizers library. This example encodes a batch of four news articles padded to sequence length 128, then formats the result as a TensorFlow FeedDict.

    from tokenizers import BertWordPieceTokenizer
    
    # Initialize the tokenizer from the vocab.txt file in the model directory.
    tokenizer = BertWordPieceTokenizer('./nlu_general_news_classification_base/vocab.txt')
    
    # Group four news articles into a batch for encoding.
    news = [
        'Mexico declares a health emergency with over 1,000 confirmed cases. Chinanews.com, March 31 - According to comprehensive reports, Mexico\'s COVID-19 cases have surpassed 1,000. On March 30, the Mexican government declared a health emergency and strengthened measures to curb the spread of the pandemic.',
        'Data from the National Bureau of Statistics shows that in August, China\'s Manufacturing Purchasing Managers\' Index (PMI) was 50.1%, remaining above the threshold but 0.3 percentage points lower than the previous month.',
        'UTC+8, August 31 - In the final round of the men\'s blind football group stage at the just-concluded Tokyo Paralympics, the Chinese team defeated host Japan 2-0 with a brace from Zhu Ruiming, advancing to the semifinals with a record of two wins and one loss.',
        'As of August 30, the "Zhurong" Mars rover has been on the surface of Mars for 100 days. In these 100 days, "Zhurong" has traveled a cumulative 1064 meters south of its landing site, carrying 6 scientific payloads and acquiring about 10 GB of raw scientific data.',
    ]
    tokenized = tokenizer.encode_batch(news)
    
    # Pad the sequence length to 128.
    def pad(seq, seq_len, padding_val):
        return seq + [padding_val] * (seq_len - len(seq))
    
    input_ids = [pad(tok.ids, 128, 0) for tok in tokenized]
    segment_ids = [pad(tok.type_ids, 128, 0) for tok in tokenized]
    input_mask = [ pad([1] * len(tok.ids), 128, 0) for tok in tokenized ]
    
    # The final test data is in the TensorFlow FeedDict format.
    test_data = {
        "input_ids:0": input_ids,
        "segment_ids:0": segment_ids,
        "input_mask:0": input_mask,
    }
  5. Load the original model and run a baseline inference to confirm the expected output.

    import tensorflow.compat.v1 as tf
    import json
    
    # Load the label mapping file to get the category names that correspond to the output integers.
    with open('./nlu_general_news_classification_base/label_mapping.json') as f:
        MAPPING = {v: k for k, v in json.load(f).items()}
    
    # Load and run the model.
    cfg = tf.ConfigProto()
    cfg.gpu_options.allow_growth = True
    with tf.Session(config=cfg) as sess:
        tf.saved_model.loader.load(sess, ['serve'], './nlu_general_news_classification_base')
        result = sess.run('ArgMax:0', test_data)
        print([MAPPING[r] for r in result])

    Expected output:

    ['International', 'Finance', 'Sports', 'Science']

Step 2: Optimize the model with PAI-Blade

  1. Call blade.optimize with the SavedModel path and your test data. Use optimization level o1 (lossless) targeting GPU devices.

    Do not pass the inputs or outputs parameters. PAI-Blade infers input and output nodes automatically from the SavedModel signature. The first return value is the path to the optimized SavedModel. For information about other optimization levels and when to use them, see the Python interface documentation.
    import blade
    
    saved_model_dir = 'nlu_general_news_classification_base'
    optimized_model, _, report = blade.optimize(
        saved_model_dir,       # The model path.
        'o1',                  # O1 lossless optimization.
        device_type='gpu',     # Optimize for GPU devices.
        test_data=[test_data]  # The test data.
    )
  2. After optimization completes, print the optimization report.

    The speedup in this example is for reference only. Actual results vary by model architecture, batch size, and GPU type. For a description of all report fields, see Optimization report.
    print("Report: {}".format(report))

    The report lists each optimization pass applied, its individual speedup, and the overall before/after latency. In this example, two passes were applied — TfAutoMixedPrecisionGpu (1.46x) and TfAicompilerGpu (2.43x) — for a combined 3.54x speedup, reducing inference time from 35 ms to 9.9 ms.

    {
      "software_context": [
        {"software": "tensorflow", "version": "1.15.0"},
        {"software": "cuda", "version": "10.0.0"}
      ],
      "hardware_context": {"device_type": "gpu", "microarchitecture": "T4"},
      "user_config": "",
      "diagnosis": {
        "model": "nlu_general_news_classification_base",
        "test_data_source": "user provided",
        "shape_variation": "dynamic",
        "message": "",
        "test_data_info": "input_ids:0 shape: (4, 128) data type: int64\nsegment_ids:0 shape: (4, 128) data type: int64\ninput_mask:0 shape: (4, 128) data type: int64"
      },
      "optimizations": [
        {"name": "TfStripUnusedNodes", "status": "effective", "speedup": "na", "pre_run": "na", "post_run": "na"},
        {"name": "TfStripDebugOps", "status": "effective", "speedup": "na", "pre_run": "na", "post_run": "na"},
        {"name": "TfAutoMixedPrecisionGpu", "status": "effective", "speedup": "1.46", "pre_run": "35.04 ms", "post_run": "24.02 ms"},
        {"name": "TfAicompilerGpu", "status": "effective", "speedup": "2.43", "pre_run": "23.99 ms", "post_run": "9.87 ms"}
      ],
      "overall": {"baseline": "35.01 ms", "optimized": "9.90 ms", "speedup": "3.54"},
      "model_info": {"input_format": "saved_model"},
      "compatibility_list": [{"device_type": "gpu", "microarchitecture": "T4"}],
      "model_sdk": {}
    }
  3. Print the path of the optimized model.

    print("Optimized model: {}".format(optimized_model))

    Output:

    Optimized model: /root/nlu_general_news_classification_base_blade_opt_20210901141823/nlu_general_news_classification_base

    The optimized model is saved to a new directory, leaving the original model unchanged.

Step 3: Verify performance and correctness

Run a benchmark to confirm that the optimization report numbers match measured latency and that prediction results remain correct.

The benchmark function below warms up the model 10 times, then runs 1,000 timed inference calls and prints the average latency and final prediction.

import time

def benchmark(model, test_data):
    tf.reset_default_graph()
    with tf.Session() as sess:
        sess.graph.as_default()
        tf.saved_model.loader.load(sess, ['serve'], model)
        # Warm up.
        for i in range(0, 10):
            result = sess.run('ArgMax:0', test_data)
        # Benchmark.
        num_runs = 1000
        start = time.time()
        for i in range(0, num_runs):
            result = sess.run('ArgMax:0', test_data)
        elapsed = time.time() - start
        rt_ms = elapsed / num_runs * 1000.0
        # Show the result.
        print("Latency of model: {:.2f} ms.".format(rt_ms))
        print("Predict result: {}".format([MAPPING[r] for r in result]))
  1. Benchmark the original model.

    benchmark('nlu_general_news_classification_base', test_data)

    Output:

    Latency of model: 36.20 ms.
    Predict result: ['International', 'Finance', 'Sports', 'Science']

    The 36.20 ms latency is consistent with the "baseline": "35.01 ms" value in the optimization report's overall section.

  2. Benchmark the optimized model.

    The TfAicompilerGpu optimization uses AICompiler, which compiles asynchronously by default — meaning the unoptimized model handles requests during compilation. Set TAO_COMPILATION_MODE_ASYNC=0 before benchmarking to force synchronous compilation and get accurate latency numbers.
    import os
    os.environ['TAO_COMPILATION_MODE_ASYNC'] = '0'
    
    benchmark(optimized_model, test_data)

    Output:

    Latency of model: 9.87 ms.
    Predict result: ['International', 'Finance', 'Sports', 'Science']

    The 9.87 ms latency is consistent with the "optimized": "9.90 ms" value in the optimization report. The prediction results are identical to the original model, confirming correctness.

Step 4: Deploy the optimized model

This section covers the Python SDK. For the C++ SDK, see Use an SDK to deploy a TensorFlow model for inference.

  1. (Optional, trial period only) Set the following environment variable to prevent the program from quitting unexpectedly due to an authentication failure.

    export BLADE_AUTH_USE_COUNTING=1
  2. Authenticate to use PAI-Blade by setting your region and authentication token.

    PlaceholderDescription
    <region>The region where you use PAI-Blade. Get the available regions by joining the PAI-Blade DingTalk user group. For the group QR code, see Install PAI-Blade.
    <token>The authentication token for PAI-Blade. Get the token by joining the PAI-Blade DingTalk user group. For the group QR code, see Install PAI-Blade.
    export BLADE_REGION=<region>
    export BLADE_TOKEN=<token>

    Replace the following placeholders:

  3. Load and run the optimized model. The only code change needed is adding import blade.runtime.tensorflow — the rest of your inference code stays the same.

    import tensorflow.compat.v1 as tf
    import blade.runtime.tensorflow
    # Replace <your_optimized_model_path> with the path of the optimized model.
    savedmodel_dir = <your_optimized_model_path>
    # Replace <your_infer_data> with the data for inference.
    infer_data = <your_infer_data>
    
    with tf.Session() as sess:
        sess.graph.as_default()
        tf.saved_model.loader.load(sess, ['serve'], savedmodel_dir)
        result = sess.run('ArgMax:0', infer_data)

What's next