Build a knowledge base application using a large language model and vector retrieval on ACK-Container Service for Kubernetes(ACK)-阿里云帮助中心

Background information

DeepGPU-LLM is a high-performance inference engine from Alibaba Cloud for running large language model (LLM) inference on GPU cloud servers. For more information, see What is the DeepGPU-LLM inference engine?.
AnalyticDB for PostgreSQL is an Alibaba Cloud-optimized service based on the open-source Greenplum project. It is compatible with ANSI SQL 2003 and the PostgreSQL/Oracle ecosystem, and supports both row and column storage modes. It delivers high performance for offline data processing and high-concurrency online analytical queries. For more information, see AnalyticDB for PostgreSQL overview.
For more information about the open-source local knowledge base Q&A project Langchain-Chatchat, see Langchain-Chatchat.

Prerequisites

Create an ACK Pro cluster or create an ACK Serverless cluster. The cluster must meet the following conditions.

Add GPU nodes to the cluster. The GPU instance type must be from the V100 or A10 series. vGPUs are not supported. To prevent out-of-memory errors, use a GPU instance type with at least 24 GiB of GPU memory.
Configure a NAT Gateway for your cluster to provide public network access to the cluster's nodes and applications.

Billing

Using an ACK Pro cluster or an ACK Serverless cluster incurs charges for any Alibaba Cloud resources you create, such as GPU nodes and NAT Gateways. For more information, see Billing.

Step 1: Deploy the Langchain-Chatchat application

Log on to the ACK console. In the left navigation pane, click Marketplace > Marketplace.
On the App Marketplace page, search for langchain-chatchat and click the application.
On the application details page, click One-Click Deploy in the upper-right corner. In the User Notice dialog box, read the notice carefully and click I Understand.
On the Basic Information page, specify the target cluster, namespace, and release name, and then click Next.

On the Parameter Settings page, configure the parameters and click OK.

The following table describes the parameters. This topic uses the Qwen model as an example.

Parameter	Description	Default
llm.model	The name of the LLM.	qwen-7b-chat-aiacc
llm.load8bit	Enable INT8 quantization for the LLM.	true
llm.modelPVC	The PersistentVolumeClaim for model storage, mounted to the /llm-model directory in the container.	true
llm.pod.replicas	The number of replicas for the model inference service.	1
llm.pod.instanceType	The deployment method for the model inference service. Valid values: ecs: Deploy to an ECS node. eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to `eci`.	ecs
chat.pod.replicas	The number of replicas for the application service.	1
chat.pod.instanceType	The deployment method for the application. Valid values: ecs: Deploy to an ECS node. eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to `eci`. For information about how to use GPU instances in an ACK Serverless cluster, see Create a Pod by specifying a GPU specification.	ecs
chat.kbPVC	The PersistentVolumeClaim for storing knowledge base files, mounted to the /root/Langchain-Chatchat/knowledge_base directory in the container.	None
db.dbType	The type of vector database. Valid values: Faiss, ADB.	faiss
db.embeddingModel	The embedding model.	text2vec-bge-large-chinese

In the left-side navigation pane of the target cluster, choose Workloads > Deployments. Select the namespace where Langchain-Chatchat is deployed and wait for the Pods to become ready (the number of Pods changes to 1/1).

Note
Pulling the image can take 10 to 20 minutes.

Step 2: Access the service

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Network > Services.
On the Services page, find the Service where Langchain-Chatchat is deployed. Its name is in the format chat-{releaseName}.
Access the Langchain-Chatchat application from your browser.
1. Run the following command to forward the chat Service in the cluster to your local port 8501.
```
# Replace chat-chatchat with the actual Service name and aigc with the actual namespace where the Service is located.
kubectl port-forward service/chat-chatchat 8501:8501 -n aigc
```
  Expected output:
```
Forwarding from 127.0.0.1:8501 -> 8501
Forwarding from [::1]:8501 -> 8501
Handling connection for 8501
```
2. Enter http://localhost:8501 in your browser to access the service.
  
  The expected output is as follows.
  
  The Langchain-Chatchat interface (version v0.2.6) loads successfully. The left sidebar contains the Chat and Knowledge Base Management navigation options. The chat mode is set to LLM Chat, and the LLM model shows as qwen-7b-chat-aiacc (Running). On the right, a chat area allows you to interact with the model.

FAQ

How do I change the model?

Download the model.

To deploy other open-source models, download them from Hugging Face or ModelScope and store them in Object Storage Service (OSS) or Network Attached Storage (NAS). The following table lists some example models. For a complete list of compatible types, see Langchain-Chatchat.

Model type	Model name	Container model path
DeepGPU-LLM converted model	qwen-7b-chat-aiacc	/llm-model/qwen-7b-chat-aiacc
DeepGPU-LLM converted model	qwen-14b-chat-aiacc	/llm-model/qwen-14b-chat-aiacc
DeepGPU-LLM converted model	chatglm2-6b-aiacc	/llm-model/chatglm2-6b-aiacc
DeepGPU-LLM converted model	baichuan2-7b-chat-aiacc	/llm-model/baichuan2-7b-chat-aiacc
DeepGPU-LLM converted model	baichuan2-13b-chat-aiacc	/llm-model/baichuan2-13b-chat-aiacc
DeepGPU-LLM converted model	llama-2-7b-hf-aiacc	/llm-model/llama-2-7b-hf-aiacc
DeepGPU-LLM converted model	llama-2-13b-hf-aiacc	/llm-model/llama-2-13b-hf-aiacc
Open-source model	qwen-7b-chat	/llm-model/Qwen-7B-Chat
Open-source model	qwen-14b-chat	/llm-model/Qwen-14B-Chat
Open-source model	chatglm2-6b	/llm-model/chatglm2-6b
Open-source model	chatglm2-6b-32k	/llm-model/chatglm2-6b-32k
Open-source model	baichuan2-7b-chat	/llm-model/Baichuan2-7B-Chat
Open-source model	baichuan2-13b-chat	/llm-model/Baichuan2-13B-Chat
Open-source model	llama-2-7b-hf	/llm-model/Llama-2-7b-hf
Open-source model	llama-2-13b-hf	/llm-model/Llama-2-13b-hf

(Optional) Convert the model.
The Langchain-Chatchat project in this topic includes the DeepGPU-LLM model converter and uses the DeepGPU-LLM accelerated model qwen-7b-chat-aiacc by default.

To use DeepGPU-LLM for inference optimization on other open-source LLMs, you must first convert the Hugging Face models to the format that DeepGPU-LLM supports.

For example, to convert qwen-7b-chat, run the following command in the container to convert the model format. Alternatively, you can install the DeepGPU-LLM inference engine and use it to convert the model format. For more information, see Install and use DeepGPU-LLM.
```
# qwen-7b weight convert
huggingface_qwen_convert \
    -in_file /llm-model/Qwen-7B-Chat \
    -saved_dir /llm-model/qwen-7b-chat-aiacc \
    -infer_gpu_num 1 \
    -weight_data_type fp16 \
    -model_name qwen-7b
```

Create a static PersistentVolume and PersistentVolumeClaim.

This example uses OSS. After the model is downloaded, run the following command to create a Secret.
```
kubectl create -f oss-secret.yaml
```
The following is an example oss-secret.yaml file for creating a Secret. Replace <your AccessKeyID> and <your AccessKeySecret> with your credentials.
```
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
  namespace: default
stringData:
  akId: <your AccessKeyID>
  akSecret: <your AccessKeySecret>
```

Run the following command to create a static PersistentVolume.

kubectl create -f model-oss.yaml

The following is an example model-oss.yaml file for creating a static PersistentVolume. Replace "<Your_Bucket_Name>" and "<Your_OSS_Endpoint>" with your bucket name and endpoint URL.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: model-oss
  labels:
    alicloud-pvname: model-oss
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-oss
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: "<Your_Bucket_Name>"
      url: "<Your_OSS_Endpoint>" # The URL in this topic is oss-cn-hangzhou.aliyuncs.com.
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: "/"

Run the following command to create a static PersistentVolumeClaim.

kubectl create -f model-pvc.yaml

The following is an example model-pvc.yaml file for creating a static PersistentVolumeClaim.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: model-oss

For specific parameter configurations, see Use a static ossfs 1.0 volume.

Update the Helm values.
1. Log on to the ACK console. In the left navigation pane, click Clusters.
2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Applications > Helm.
3. On the Helm page, find the deployed langchain-chatchat service and click Update in the Actions column. Change llm.model to the new model name and llm.modelPVC to the name of the PVC that stores the new model. For the model name and model mount path, see List of supported models.

How do I deploy using ECI Pods?

Set the llm.pod.instanceType and chat.pod.instanceType parameters to eci. The default Annotation and Label configurations for the ECI type are as follows. For more information about Annotations, see ECI Pod Annotation.

annotations:
  k8s.aliyun.com/eci-use-specs: ecs.gn6i-c8g1.2xlarge,ecs.gn6i-c16g1.4xlarge,ecs.gn6v-c8g1.8xlarge,ecs.gn7i-c8g1.2xlarge,ecs.gn7i-c16g1.4xlarge
  k8s.aliyun.com/eci-extra-ephemeral-storage: "50Gi"
labels:
  alibabacloud.com/eci: "true"

If you change the image or model, you must modify the k8s.aliyun.com/eci-use-specs and k8s.aliyun.com/eci-extra-ephemeral-storage annotations. Otherwise, the application will fail to start due to insufficient resources. For more information about ECI billing, see Billing overview.

How do I accelerate text vectorization?

The application's built-in embedding model is text2vec-bge-large-chinese. For more information, see text2vec-bge-large-chinese.

By default, the chat application runs the embedding model on the CPU, but you can accelerate text vectorization by requesting GPU resources in chat.pod.resources.

...
chat:
  pod:
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 8Gi
        nvidia.com/gpu: "1"

How do I specify the vector database type?

Supported vector databases include Faiss and Alibaba Cloud AnalyticDB for PostgreSQL.

Faiss is an open-source, in-memory vector library developed by Facebook. Faiss is deployed in the chat Pod and is therefore constrained by the Pod's resources. If you use Faiss, we recommend increasing the chat Pod's memory.
AnalyticDB for PostgreSQL is a massively parallel processing (MPP) data warehousing service that provides online analysis of massive datasets.

To use AnalyticDB for PostgreSQL with this project, your instance must meet the following conditions:

The vector engine optimization feature must be enabled.
The compute node specification must be 4-core 16 GiB or higher.
Modify the parameter db.dbType to adb, and enter the database information in db.adb.

db:
  dbType: adb
  embeddingModel: text2vec-bge-large-chinese
  adb:
    pgHost: "pg.host.demo.com"
    pgPort: "5432"
    pgDataBase: "langchain"
    pgUser: "langchain"
    pgPassword: "pg_password"

How do I change the number of service replicas?

Set llm.pod.replicas to the desired number of inference service replicas.

llm:
  model: qwen-7b-chat-aiacc
  modelPVC: "" # PVC name.
  load8bit: true
  pod:
    replicas: 1 # The number of replicas for the model inference service.

Set chat.pod.replicas to the desired number of application service replicas.

chat:
 kbPVC: ""
 pod:
   replicas: 1 # The number of replicas for the model application service.

Clean up resources

The fees for using an ACK Pro cluster or ACK Serverless cluster consist of two parts:

Cluster management fees: Billed directly by Container Service for Kubernetes (ACK).
Cloud resource fees: Resources such as GPU nodes and storage are billed by their respective cloud services according to their billing rules.

After you finish the deployment, choose one of the following options:

Delete the cluster: If you no longer need the cluster, log on to the ACK console to delete it. For more information about deleting an ACK cluster, see Delete a cluster.
Continue using the cluster: Ensure your Alibaba Cloud account has a balance of at least 100 CNY. For the billing details of other Alibaba Cloud resources used with your ACK cluster, see Cloud resource fees.

Contact us

If you have questions or suggestions while completing the ACK AIGC tutorial, join the DingTalk group (DingTalk group ID: 31850017754) to discuss them.