Build a knowledge base application using a large language model and vector retrieval on ACK

更新时间:
复制 MD 格式

This topic shows you how to quickly build a Retrieval-Augmented Generation (RAG) knowledge base system. You will use the open-source Langchain-Chatchat framework to integrate the DeepGPU-LLM inference engine with AnalyticDB for PostgreSQL on Alibaba Cloud Container Service for Kubernetes (ACK).

Background information

  • DeepGPU-LLM is a high-performance inference engine from Alibaba Cloud for running large language model (LLM) inference on GPU cloud servers. For more information, see What is the DeepGPU-LLM inference engine?.

  • AnalyticDB for PostgreSQL is an Alibaba Cloud-optimized service based on the open-source Greenplum project. It is compatible with ANSI SQL 2003 and the PostgreSQL/Oracle ecosystem, and supports both row and column storage modes. It delivers high performance for offline data processing and high-concurrency online analytical queries. For more information, see AnalyticDB for PostgreSQL overview.

  • For more information about the open-source local knowledge base Q&A project Langchain-Chatchat, see Langchain-Chatchat.

Important
  • Alibaba Cloud does not guarantee the legality, security, or accuracy of third-party models and is not liable for any damages that arise from their use.

  • You are solely responsible for the legal and compliant use of these models.

Prerequisites

Create an ACK Pro cluster or create an ACK Serverless cluster. The cluster must meet the following conditions.

  • Add GPU nodes to the cluster. The GPU instance type must be from the V100 or A10 series. vGPUs are not supported. To prevent out-of-memory errors, use a GPU instance type with at least 24 GiB of GPU memory.

  • Configure a NAT Gateway for your cluster to provide public network access to the cluster's nodes and applications.

Billing

Using an ACK Pro cluster or an ACK Serverless cluster incurs charges for any Alibaba Cloud resources you create, such as GPU nodes and NAT Gateways. For more information, see Billing.

Step 1: Deploy the Langchain-Chatchat application

  1. Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.

  2. On the App Marketplace page, search for langchain-chatchat and click the application.

  3. On the application details page, click One-Click Deploy in the upper-right corner. In the User Notice dialog box, read the notice carefully and click I Understand.

  4. On the Basic Information page, specify the target cluster, namespace, and release name, and then click Next.

  5. On the Parameter Settings page, configure the parameters and click OK.

    The following table describes the parameters. This topic uses the Qwen model as an example.

    Parameter

    Description

    Default

    llm.model

    The name of the LLM.

    qwen-7b-chat-aiacc

    llm.load8bit

    Enable INT8 quantization for the LLM.

    true

    llm.modelPVC

    The PersistentVolumeClaim for model storage, mounted to the /llm-model directory in the container.

    true

    llm.pod.replicas

    The number of replicas for the model inference service.

    1

    llm.pod.instanceType

    The deployment method for the model inference service. Valid values:

    • ecs: Deploy to an ECS node.

    • eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to eci.

    ecs

    chat.pod.replicas

    The number of replicas for the application service.

    1

    chat.pod.instanceType

    The deployment method for the application. Valid values:

    • ecs: Deploy to an ECS node.

    • eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to eci. For information about how to use GPU instances in an ACK Serverless cluster, see Create a Pod by specifying a GPU specification.

    ecs

    chat.kbPVC

    The PersistentVolumeClaim for storing knowledge base files, mounted to the /root/Langchain-Chatchat/knowledge_base directory in the container.

    None

    db.dbType

    The type of vector database. Valid values: Faiss, ADB.

    faiss

    db.embeddingModel

    The embedding model.

    text2vec-bge-large-chinese

  6. In the left-side navigation pane of the target cluster, choose Workloads > Deployments. Select the namespace where Langchain-Chatchat is deployed and wait for the Pods to become ready (the number of Pods changes to 1/1).

    Note

    Pulling the image can take 10 to 20 minutes.

Step 2: Access the service

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Network > Services.

  3. On the Services page, find the Service where Langchain-Chatchat is deployed. Its name is in the format chat-{releaseName}.

  4. Access the Langchain-Chatchat application from your browser.

    1. Run the following command to forward the chat Service in the cluster to your local port 8501.

      # Replace chat-chatchat with the actual Service name and aigc with the actual namespace where the Service is located.
      kubectl port-forward service/chat-chatchat 8501:8501 -n aigc

      Expected output:

      Forwarding from 127.0.0.1:8501 -> 8501
      Forwarding from [::1]:8501 -> 8501
      Handling connection for 8501
    2. Enter http://localhost:8501 in your browser to access the service.

      The expected output is as follows.

      The Langchain-Chatchat interface (version v0.2.6) loads successfully. The left sidebar contains the Chat and Knowledge Base Management navigation options. The chat mode is set to LLM Chat, and the LLM model shows as qwen-7b-chat-aiacc (Running). On the right, a chat area allows you to interact with the model.

FAQ

How do I change the model?

  1. Download the model.

    To deploy other open-source models, download them from Hugging Face or ModelScope and store them in Object Storage Service (OSS) or Network Attached Storage (NAS). The following table lists some example models. For a complete list of compatible types, see Langchain-Chatchat.

    Model type

    Model name

    Container model path

    DeepGPU-LLM converted model

    qwen-7b-chat-aiacc

    /llm-model/qwen-7b-chat-aiacc

    DeepGPU-LLM converted model

    qwen-14b-chat-aiacc

    /llm-model/qwen-14b-chat-aiacc

    DeepGPU-LLM converted model

    chatglm2-6b-aiacc

    /llm-model/chatglm2-6b-aiacc

    DeepGPU-LLM converted model

    baichuan2-7b-chat-aiacc

    /llm-model/baichuan2-7b-chat-aiacc

    DeepGPU-LLM converted model

    baichuan2-13b-chat-aiacc

    /llm-model/baichuan2-13b-chat-aiacc

    DeepGPU-LLM converted model

    llama-2-7b-hf-aiacc

    /llm-model/llama-2-7b-hf-aiacc

    DeepGPU-LLM converted model

    llama-2-13b-hf-aiacc

    /llm-model/llama-2-13b-hf-aiacc

    Open-source model

    qwen-7b-chat

    /llm-model/Qwen-7B-Chat

    Open-source model

    qwen-14b-chat

    /llm-model/Qwen-14B-Chat

    Open-source model

    chatglm2-6b

    /llm-model/chatglm2-6b

    Open-source model

    chatglm2-6b-32k

    /llm-model/chatglm2-6b-32k

    Open-source model

    baichuan2-7b-chat

    /llm-model/Baichuan2-7B-Chat

    Open-source model

    baichuan2-13b-chat

    /llm-model/Baichuan2-13B-Chat

    Open-source model

    llama-2-7b-hf

    /llm-model/Llama-2-7b-hf

    Open-source model

    llama-2-13b-hf

    /llm-model/Llama-2-13b-hf

  2. (Optional) Convert the model.

    The Langchain-Chatchat project in this topic includes the DeepGPU-LLM model converter and uses the DeepGPU-LLM accelerated model qwen-7b-chat-aiacc by default.

    To use DeepGPU-LLM for inference optimization on other open-source LLMs, you must first convert the Hugging Face models to the format that DeepGPU-LLM supports.

    For example, to convert qwen-7b-chat, run the following command in the container to convert the model format. Alternatively, you can install the DeepGPU-LLM inference engine and use it to convert the model format. For more information, see Install and use DeepGPU-LLM.

    # qwen-7b weight convert
    huggingface_qwen_convert \
        -in_file /llm-model/Qwen-7B-Chat \
        -saved_dir /llm-model/qwen-7b-chat-aiacc \
        -infer_gpu_num 1 \
        -weight_data_type fp16 \
        -model_name qwen-7b
  3. Create a static PersistentVolume and PersistentVolumeClaim.

    1. This example uses OSS. After the model is downloaded, run the following command to create a Secret.

      kubectl create -f oss-secret.yaml

      The following is an example oss-secret.yaml file for creating a Secret. Replace <your AccessKeyID> and <your AccessKeySecret> with your credentials.

      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
        namespace: default
      stringData:
        akId: <your AccessKeyID>
        akSecret: <your AccessKeySecret>
    2. Run the following command to create a static PersistentVolume.

      kubectl create -f model-oss.yaml

      The following is an example model-oss.yaml file for creating a static PersistentVolume. Replace "<Your_Bucket_Name>" and "<Your_OSS_Endpoint>" with your bucket name and endpoint URL.

      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: model-oss
        labels:
          alicloud-pvname: model-oss
      spec:
        capacity:
          storage: 30Gi 
        accessModes:
          - ReadOnlyMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: model-oss
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: "<Your_Bucket_Name>"
            url: "<Your_OSS_Endpoint>" # The URL in this topic is oss-cn-hangzhou.aliyuncs.com.
            otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
            path: "/"
    3. Run the following command to create a static PersistentVolumeClaim.

      kubectl create -f model-pvc.yaml

      The following is an example model-pvc.yaml file for creating a static PersistentVolumeClaim.

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-pvc
      spec:
        accessModes:
          - ReadOnlyMany
        resources:
          requests:
            storage: 30Gi
        selector:
          matchLabels:
            alicloud-pvname: model-oss

    For specific parameter configurations, see Use a static ossfs 1.0 volume.

  4. Update the Helm values.

    1. Log on to the ACK console. In the left navigation pane, click Clusters.

    2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Helm.

    3. On the Helm page, find the deployed langchain-chatchat service and click Update in the Actions column. Change llm.model to the new model name and llm.modelPVC to the name of the PVC that stores the new model. For the model name and model mount path, see List of supported models.

How do I deploy using ECI Pods?

Set the llm.pod.instanceType and chat.pod.instanceType parameters to eci. The default Annotation and Label configurations for the ECI type are as follows. For more information about Annotations, see ECI Pod Annotation.

annotations:
  k8s.aliyun.com/eci-use-specs: ecs.gn6i-c8g1.2xlarge,ecs.gn6i-c16g1.4xlarge,ecs.gn6v-c8g1.8xlarge,ecs.gn7i-c8g1.2xlarge,ecs.gn7i-c16g1.4xlarge
  k8s.aliyun.com/eci-extra-ephemeral-storage: "50Gi"
labels:
  alibabacloud.com/eci: "true"

If you change the image or model, you must modify the k8s.aliyun.com/eci-use-specs and k8s.aliyun.com/eci-extra-ephemeral-storage annotations. Otherwise, the application will fail to start due to insufficient resources. For more information about ECI billing, see Billing overview.

How do I accelerate text vectorization?

The application's built-in embedding model is text2vec-bge-large-chinese. For more information, see text2vec-bge-large-chinese.

By default, the chat application runs the embedding model on the CPU, but you can accelerate text vectorization by requesting GPU resources in chat.pod.resources.

...
chat:
  pod:
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 8Gi
        nvidia.com/gpu: "1"

How do I specify the vector database type?

Supported vector databases include Faiss and Alibaba Cloud AnalyticDB for PostgreSQL.

  • Faiss is an open-source, in-memory vector library developed by Facebook. Faiss is deployed in the chat Pod and is therefore constrained by the Pod's resources. If you use Faiss, we recommend increasing the chat Pod's memory.

  • AnalyticDB for PostgreSQL is a massively parallel processing (MPP) data warehousing service that provides online analysis of massive datasets.

To use AnalyticDB for PostgreSQL with this project, your instance must meet the following conditions:

  • The vector engine optimization feature must be enabled.

  • The compute node specification must be 4-core 16 GiB or higher.

  • Modify the parameter db.dbType to adb, and enter the database information in db.adb.

db:
  dbType: adb
  embeddingModel: text2vec-bge-large-chinese
  adb:
    pgHost: "pg.host.demo.com"
    pgPort: "5432"
    pgDataBase: "langchain"
    pgUser: "langchain"
    pgPassword: "pg_password"

How do I change the number of service replicas?

Set llm.pod.replicas to the desired number of inference service replicas.

llm:
  model: qwen-7b-chat-aiacc
  modelPVC: "" # PVC name.
  load8bit: true
  pod:
    replicas: 1 # The number of replicas for the model inference service.

Set chat.pod.replicas to the desired number of application service replicas.

chat:
 kbPVC: ""
 pod:
   replicas: 1 # The number of replicas for the model application service.

Clean up resources

The fees for using an ACK Pro cluster or ACK Serverless cluster consist of two parts:

  • Cluster management fees: Billed directly by Container Service for Kubernetes (ACK).

  • Cloud resource fees: Resources such as GPU nodes and storage are billed by their respective cloud services according to their billing rules.

After you finish the deployment, choose one of the following options:

  • Delete the cluster: If you no longer need the cluster, log on to the ACK console to delete it. For more information about deleting an ACK cluster, see Delete a cluster.

  • Continue using the cluster: Ensure your Alibaba Cloud account has a balance of at least 100 CNY. For the billing details of other Alibaba Cloud resources used with your ACK cluster, see Cloud resource fees.

Contact us

If you have questions or suggestions while completing the ACK AIGC tutorial, join the DingTalk group (DingTalk group ID: 31850017754) to discuss them.