PAI-Bladellm LLM推理性能压测说明

更新时间:
复制为 MD 格式

本文为您介绍如何为使用PAI-Bladellm(以下简称Bladellm)的LLM推理服务进行压测。

概述

模型服务部署

可参考PAI-EAS上使用PPU部署模型服务文档进行 EAS 基本环境搭建和推理服务部署,以 qwen2-72b-instruct 模型为例,进行 tp=4 的模型切分和 a16w4 的模型量化操作,进行模型部署,参考命令如下:

blade_llm_split --tensor_parallel_size 4 --model /path/to/your/models/qwen-ckpts/qwen2-72b-instruct --output_dir /path/to/your/models/qwen-ckpts/qwen2-72b-instruct/tp4pp1

blade_llm_quantize --model /path/to/your/models/qwen-ckpts/qwen2-72b-instruct/tp4pp1 --output_dir /path/to/your/models/qwen-ckpts/qwen2-72b-instruct/tp4pp1_a16w4 --quant_mode weight_only_quant --tensor_parallel_size 4 --quant_dtype int --bit 4 --quant_algo gptq --experimental_pad

blade_llm_server --port 8888 --model /path/to/your/models/qwen-ckpts/qwen2-72b-instruct/tp4pp1_a16w4 --tensor_parallel_size 4

压测步骤

  1. 压测环境准备。

    启动 dsw 实例,下载压测工具,并安装相应依赖。

    cd /path/to/your/models
    tar -xzvf archive.tar.gz
    cd /mnt/workspace/z**k/eas-model-deployment-main
    python -m venv venv-ben
    source venv-ben/bin/activate
    pip install -r requirements.txt

    PG1-压测环境准备

  2. 在压测工具中的 run.sh 脚本中修改相应参数,执行压测命令。

    供参考:下列代码将会执行 5 组测试,分别会在 request_rate=2,4,6,8,10 的情况下测试输入长度和输出长度为 2000 和 300 的 case,总请求数为 300 个。压测完毕后,程序会将相应的统计结果以 p99 的方式进行统计并持久化保存。

    #!/bin/bash
    #
    
    PYTHON=python3
    Now=$(date +"%Y%m%d_%H%M%S")
    echo ${Now}
    # Define inputlength,outputlength, concurrency combination and other parameters for testing
    InputLengthList=(2000 2000 2000 2000 2000)
    OutPutLengthList=(300 300 300 300 300)
    # ConcurrencyList=(1 2 3 4 5)
    RequestRateList=(2 4 6 8 10)
    Percentile=99
    #by default RequestRate is set inf, if willing to test for qps performance, set it to a number and cancel the concurrency part.
    # RequestRate=inf
    NumPrompts=300
    IgnoreEos=false
    
    
    input_length=${#InputLengthList[@]}
    output_length=${#OutPutLengthList[@]}
    concurrency_length=${#ConcurrencyList[@]}
    
    # if [[ $input_length -ne $output_length || $input_length -ne $concurrency_length ]]; then
    #   echo "parameters length should have the same length."
    #   exit 1
    # fi
    
    Dir=$(realpath $(dirname -- "${BASH_SOURCE[0]}"))
    BenchmarkDir=$(dirname ${Dir})
    ResultDir=${BenchmarkDir}/result/${Now}
    echo ${ResultDir}
    
    # Define the service url and other parameters for the serverside
    ServerUrl=http://xxxxx.cn-wulanchabu.pai-eas.aliyuncs.com/api/predict/test_mixtral_8x7b_ppu
    
    Endpoint=/v1/completions
    tokenizer=/path/to/your/models/eas-model-deployment-main/tokenizer/qwen
    ClientName=testing_client
    ModelPath=/path/to/your/models/qwen-ckpts/xxxxx
    LogInCSV=true
    
    AccessToken=xxxxxx
    if [ ! -d ${ResultDir} ]; then
        mkdir -p ${ResultDir}
    fi
    
    
    export PYTHONPATH=./
    echo ${AccessToken}
    LogDir=${ResultDir}/log/${ClientName}
    CSVFile=${ResultDir}/log/${ClientName}/summary.csv
    if [ ! -d ${LogDir} ]; then
        mkdir -p ${LogDir}
    fi
    
    if [[ $LogInCSV == true ]]; then
        touch ${CSVFile}
        saveFields=("num_prompts" "request_rate" "max_concurrency" "duration" "total_input_tokens" "total_output_tokens" "request_throughput" "output_throughput" "total_token_throughput" "mean_ttft_ms" "median_ttft_ms" "std_ttft_ms" p${Percentile}"_ttft_ms" "mean_tpot_ms" "median_tpot_ms" "std_tpot_ms" p${Percentile}"_tpot_ms" "mean_itl_ms" "median_itl_ms" "std_itl_ms" p${Percentile}"_itl_ms" "mean_e2el_ms" "median_e2el_ms" "std_e2el_ms" p${Percentile}"_e2el_ms" "avg_input_len" "avg_output_len")
    fi
    
    
    # Iterate over all combinations of batch size and sequence length
    
    for ((i=0;i<${#InputLengthList[*]};i++)) do
        InputLength=${InputLengthList[$i]}
        OutputLength=${OutPutLengthList[$i]}
        # Concurrency=${ConcurrencyList[$i]}
        RequestRate=${RequestRateList[$i]}
        echo ${InputLength} ${OutputLength} ${Concurrency}
        # for OutputLength in "${OutPutLengthList[@]}"; do
        #     for Concurrency in "${ConcurrencyList[@]}"; do
            echo "Running benchmark with input len :${InputLength},output len: ${OutputLength}, prompts:${NumPrompts}, concurrency:${Concurrency}, request_rate:${RequestRate}"
            result_file=${LogDir}/result_input_${InputLength}_output_${OutputLength}_prompts_${NumPrompts}_concurrency_${Concurrency}_request_rate_${RequestRate}.json
    
            AccessToken=${AccessToken} \
            python3 benchmark_serving.py \
            --backend openai \
            --base-url ${ServerUrl} \
            --endpoint ${Endpoint} \
            --tokenizer ${tokenizer} \
            --percentile-metrics ttft,tpot,itl,e2el \
            --save-result \
            --result-filename ${result_file} \
            --ignore-eos \
            --trust-remote-code \
            --model ${ModelPath} \
            --dataset-name random \
            --random-output-len ${OutputLength} \
            --random-input-len ${InputLength}\
            --num-prompts ${NumPrompts} \
            --request-rate ${RequestRate} \
            --metric-percentiles ${Percentile} \
            # --max-concurrency ${Concurrency}
            if [[ $LogInCSV == true ]]; then
                python3 ${BenchmarkDir}/eas-model-deployment-main/scripts/write_csv.py \
                --json_file ${result_file} \
                --csv_file ${CSVFile} \
                --fields ${saveFields[@]}
            fi
        #     done
        # done
    done

    支持定义多组输入长度,输出长度,并发量,每秒发送请求个数,并指定请求总数。如需进行并发程度压测需打开concurrencyList并设定数值;如需进行QPS测试,可关闭concurrency相关设置,调整request_rate进行测试。需结合实际测试需求场景指定相应参数。相应测试结果均会以JSON格式记录下来,并可以自动汇总在CSV表格中。