将大语言模型转化为推理服务

大语言模型LLM(Large Language Model)指参数数量达到亿级别的神经网络语言模型,例如GPT-3、GPT-4、PaLM、PaLM2等。当您需要处理大量自然语言数据或希望建立复杂的语言理解系统时,可以将大语言模型转化为推理服务,通过API轻松集成先进的NLP能力(例如文本分类、情感分析、机器翻译等)到您的应用程序中。通过服务化LLM,您可以避免昂贵的基础设施成本,快速响应市场变化,并且由于模型运行在云端,还可以随时扩展服务以应对用户请求的高峰,从而提高运营效率。

前提条件

步骤一:构建自定义运行时

构建自定义运行时,提供带有提示调整配置的HuggingFace LLM。此示例中的默认值设置为预先构建的自定义运行时镜像和预先构建的提示调整配置。

  1. 实现一个继承自MLServer MLModel的类。

    peft_model_server.py文件包含了如何提供带有提示调整配置的HuggingFace LLM的所有代码。_load_model函数是该文件中的一部分,用于选择已训练的PEFT提示调整配置的预训练LLM模型。_load_model函数还定义了分词器,以便对推理请求中的原始字符串输入进行编码和解码,而无需用户预处理其输入为张量字节。

    展开查看peft_model_server.py

    from typing import List
    
    from mlserver import MLModel, types
    from mlserver.codecs import decode_args
    
    from peft import PeftModel, PeftConfig
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
    import os
    
    class PeftModelServer(MLModel):
        async def load(self) -> bool:
            self._load_model()
            self.ready = True
            return self.ready
    
        @decode_args
        async def predict(self, content: List[str]) -> List[str]:
            return self._predict_outputs(content)
    
        def _load_model(self):
            model_name_or_path = os.environ.get("PRETRAINED_MODEL_PATH", "bigscience/bloomz-560m")
            peft_model_id = os.environ.get("PEFT_MODEL_ID", "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM")
            self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, local_files_only=True)
            config = PeftConfig.from_pretrained(peft_model_id)
            self.model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
            self.model = PeftModel.from_pretrained(self.model, peft_model_id)
            self.text_column = os.environ.get("DATASET_TEXT_COLUMN_NAME", "Tweet text")
            return
    
        def _predict_outputs(self, content: List[str]) -> List[str]:
            output_list = []
            for input in content:
                inputs = self.tokenizer(
                    f'{self.text_column} : {input} Label : ',
                    return_tensors="pt",
                )
                with torch.no_grad():
                    inputs = {k: v for k, v in inputs.items()}
                    outputs = self.model.generate(
                        input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], max_new_tokens=10, eos_token_id=3
                    )
                    outputs = self.tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)
                output_list.append(outputs[0])
            return output_list
    
  2. 构建Docker镜像。

    实现模型类之后,您需要将其依赖项(包括MLServer)打包到一个支持ServingRuntime资源的镜像中。您可以参考如下Dockerfile进行镜像构建。

    展开查看Dockerfile

    # TODO: choose appropriate base image, install Python, MLServer, and
    # dependencies of your MLModel implementation
    FROM python:3.8-slim-buster
    RUN pip install mlserver peft transformers datasets
    # ...
    
    # The custom `MLModel` implementation should be on the Python search path
    # instead of relying on the working directory of the image. If using a
    # single-file module, this can be accomplished with:
    COPY --chown=${USER} ./peft_model_server.py /opt/peft_model_server.py
    ENV PYTHONPATH=/opt/
    
    # environment variables to be compatible with ModelMesh Serving
    # these can also be set in the ServingRuntime, but this is recommended for
    # consistency when building and testing
    ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
        MLSERVER_GRPC_PORT=8001 \
        MLSERVER_HTTP_PORT=8002 \
        MLSERVER_LOAD_MODELS_AT_STARTUP=false \
        MLSERVER_MODEL_NAME=peft-model
    
    # With this setting, the implementation field is not required in the model
    # settings which eases integration by allowing the built-in adapter to generate
    # a basic model settings file
    ENV MLSERVER_MODEL_IMPLEMENTATION=peft_model_server.PeftModelServer
    
    CMD mlserver start ${MLSERVER_MODELS_DIR}
    
  3. 创建新的ServingRuntime资源。

    1. 使用以下内容,保存为sample-runtime.yaml, 创建一个新的ServingRuntime资源,并将其指向您刚创建的镜像。

      展开查看YAML

      apiVersion: serving.kserve.io/v1alpha1
      kind: ServingRuntime
      metadata:
        name: peft-model-server
        namespace: modelmesh-serving
      spec:
        supportedModelFormats:
          - name: peft-model
            version: "1"
            autoSelect: true
        multiModel: true
        grpcDataEndpoint: port:8001
        grpcEndpoint: port:8085
        containers:
          - name: mlserver
            image:  registry.cn-beijing.aliyuncs.com/test/peft-model-server:latest
            env:
              - name: MLSERVER_MODELS_DIR
                value: "/models/_mlserver_models/"
              - name: MLSERVER_GRPC_PORT
                value: "8001"
              - name: MLSERVER_HTTP_PORT
                value: "8002"
              - name: MLSERVER_LOAD_MODELS_AT_STARTUP
                value: "true"
              - name: MLSERVER_MODEL_NAME
                value: peft-model
              - name: MLSERVER_HOST
                value: "127.0.0.1"
              - name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
                value: "-1"
              - name: PRETRAINED_MODEL_PATH
                value: "bigscience/bloomz-560m"
              - name: PEFT_MODEL_ID
                value: "aipipeline/bloomz-560m_PROMPT_TUNING_CAUSAL_LM"
              # - name: "TRANSFORMERS_OFFLINE"
              #   value: "1"  
              # - name: "HF_DATASETS_OFFLINE"
              #   value: "1"    
            resources:
              requests:
                cpu: 500m
                memory: 4Gi
              limits:
                cpu: "5"
                memory: 5Gi
        builtInAdapter:
          serverType: mlserver
          runtimeManagementPort: 8001
          memBufferBytes: 134217728
          modelLoadingTimeoutMillis: 90000
      
    2. 执行以下命令,部署ServingRuntime资源。

      kubectl apply -f sample-runtime.yaml

      创建完成后,您可以在ModelMesh部署中看到新的自定义运行时。

步骤二:部署LLM服务

为了使用新创建的运行时部署模型,您需要创建一个InferenceService资源来提供模型服务。该资源是KServe和ModelMesh用于管理模型的主要接口,代表了模型在推理中的逻辑端点。

  1. 使用以下内容,创建一个InferenceService资源来提供模型服务。

    展开查看YAML

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: peft-demo
      namespace: modelmesh-serving
      annotations:
        serving.kserve.io/deploymentMode: ModelMesh
    spec:
      predictor:
        model:
          modelFormat:
            name: peft-model
          runtime: peft-model-server
          storage:
            key: localMinIO
            path: sklearn/mnist-svm.joblib
    

    在YAML中,InferenceService命名为peft-demo,并声明其模型格式为peft-model,与之前创建的示例自定义运行时使用相同的格式。还传递了一个可选字段runtime,明确告诉ModelMesh使用peft-model-server运行时来部署此模型。

  2. 执行以下命令,部署InferenceService资源。

    kubectl apply -f ${实际YAML名称}.yaml

步骤三:运行推理服务

使用curl命令,发送推理请求到上面部署的LLM模型服务。

MODEL_NAME="peft-demo"
ASM_GW_IP="ASM网关IP地址"
curl -X POST -k http://${ASM_GW_IP}:8008/v2/models/${MODEL_NAME}/infer -d @./input.json

curl命令中的input.json表示请求数据:

{
    "inputs": [
        {
          "name": "content",
          "shape": [1],
          "datatype": "BYTES",
          "contents": {"bytes_contents": ["RXZlcnkgZGF5IGlzIGEgbmV3IGJpbm5pbmcsIGZpbGxlZCB3aXRoIG9wdGlvbnBpZW5pbmcgYW5kIGhvcGU="]}
        }
    ]
}

bytes_contents对应字符串“Every day is a new beginning, filled with opportunities and hope”的Base64编码。

JSON响应如下所示:

{
 "modelName": "peft-demo__isvc-5c5315c302",
 "outputs": [
  {
   "name": "output-0",
   "datatype": "BYTES",
   "shape": [
    "1",
    "1"
   ],
   "parameters": {
    "content_type": {
     "stringParam": "str"
    }
   },
   "contents": {
    "bytesContents": [
     "VHdlZXQgdGV4dCA6IEV2ZXJ5IGRheSBpcyBhIG5ldyBiaW5uaW5nLCBmaWxsZWQgd2l0aCBvcHRpb25waWVuaW5nIGFuZCBob3BlIExhYmVsIDogbm8gY29tcGxhaW50"
    ]
   }
  }
 ]
}

bytesContents进行Base64解码后的内容如下,表明上述大语言模型LLM的模型服务请求符合预期。

Tweet text : Every day is a new binning, filled with optionpiening and hope Label : no complaint