This topic shows you how to quickly build a Retrieval-Augmented Generation (RAG) knowledge base system. You will use the open-source Langchain-Chatchat framework to integrate the DeepGPU-LLM inference engine with AnalyticDB for PostgreSQL on Alibaba Cloud Container Service for Kubernetes (ACK).
Background information
-
DeepGPU-LLM is a high-performance inference engine from Alibaba Cloud for running large language model (LLM) inference on GPU cloud servers. For more information, see What is the DeepGPU-LLM inference engine?.
-
AnalyticDB for PostgreSQL is an Alibaba Cloud-optimized service based on the open-source Greenplum project. It is compatible with ANSI SQL 2003 and the PostgreSQL/Oracle ecosystem, and supports both row and column storage modes. It delivers high performance for offline data processing and high-concurrency online analytical queries. For more information, see AnalyticDB for PostgreSQL overview.
-
For more information about the open-source local knowledge base Q&A project Langchain-Chatchat, see Langchain-Chatchat.
-
Alibaba Cloud does not guarantee the legality, security, or accuracy of third-party models and is not liable for any damages that arise from their use.
-
You are solely responsible for the legal and compliant use of these models.
Prerequisites
Create an ACK Pro cluster or create an ACK Serverless cluster. The cluster must meet the following conditions.
-
Add GPU nodes to the cluster. The GPU instance type must be from the V100 or A10 series. vGPUs are not supported. To prevent out-of-memory errors, use a GPU instance type with at least 24 GiB of GPU memory.
-
Configure a NAT Gateway for your cluster to provide public network access to the cluster's nodes and applications.
Billing
Using an ACK Pro cluster or an ACK Serverless cluster incurs charges for any Alibaba Cloud resources you create, such as GPU nodes and NAT Gateways. For more information, see Billing.
Step 1: Deploy the Langchain-Chatchat application
Log on to the ACK console. In the left navigation pane, click .
-
On the App Marketplace page, search for langchain-chatchat and click the application.
-
On the application details page, click One-Click Deploy in the upper-right corner. In the User Notice dialog box, read the notice carefully and click I Understand.
-
On the Basic Information page, specify the target cluster, namespace, and release name, and then click Next.
-
On the Parameter Settings page, configure the parameters and click OK.
The following table describes the parameters. This topic uses the Qwen model as an example.
Parameter
Description
Default
llm.model
The name of the LLM.
qwen-7b-chat-aiacc
llm.load8bit
Enable INT8 quantization for the LLM.
true
llm.modelPVC
The PersistentVolumeClaim for model storage, mounted to the /llm-model directory in the container.
true
llm.pod.replicas
The number of replicas for the model inference service.
1
llm.pod.instanceType
The deployment method for the model inference service. Valid values:
-
ecs: Deploy to an ECS node.
-
eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to
eci.
ecs
chat.pod.replicas
The number of replicas for the application service.
1
chat.pod.instanceType
The deployment method for the application. Valid values:
-
ecs: Deploy to an ECS node.
-
eci: Deploy to an Elastic Container Instance (ECI). For an ACK Serverless cluster, set this to
eci. For information about how to use GPU instances in an ACK Serverless cluster, see Create a Pod by specifying a GPU specification.
ecs
chat.kbPVC
The PersistentVolumeClaim for storing knowledge base files, mounted to the /root/Langchain-Chatchat/knowledge_base directory in the container.
None
db.dbType
The type of vector database. Valid values: Faiss, ADB.
faiss
db.embeddingModel
The embedding model.
text2vec-bge-large-chinese
-
-
In the left-side navigation pane of the target cluster, choose Workloads > Deployments. Select the namespace where Langchain-Chatchat is deployed and wait for the Pods to become ready (the number of Pods changes to 1/1).
NotePulling the image can take 10 to 20 minutes.
Step 2: Access the service
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Services page, find the Service where Langchain-Chatchat is deployed. Its name is in the format
chat-{releaseName}. -
Access the Langchain-Chatchat application from your browser.
-
Run the following command to forward the chat Service in the cluster to your local port 8501.
# Replace chat-chatchat with the actual Service name and aigc with the actual namespace where the Service is located. kubectl port-forward service/chat-chatchat 8501:8501 -n aigcExpected output:
Forwarding from 127.0.0.1:8501 -> 8501 Forwarding from [::1]:8501 -> 8501 Handling connection for 8501 -
Enter
http://localhost:8501in your browser to access the service.The expected output is as follows.
The Langchain-Chatchat interface (version v0.2.6) loads successfully. The left sidebar contains the Chat and Knowledge Base Management navigation options. The chat mode is set to LLM Chat, and the LLM model shows as qwen-7b-chat-aiacc (Running). On the right, a chat area allows you to interact with the model.
-
FAQ
How do I change the model?
-
Download the model.
To deploy other open-source models, download them from Hugging Face or ModelScope and store them in Object Storage Service (OSS) or Network Attached Storage (NAS). The following table lists some example models. For a complete list of compatible types, see Langchain-Chatchat.
Model type
Model name
Container model path
DeepGPU-LLM converted model
qwen-7b-chat-aiacc
/llm-model/qwen-7b-chat-aiacc
DeepGPU-LLM converted model
qwen-14b-chat-aiacc
/llm-model/qwen-14b-chat-aiacc
DeepGPU-LLM converted model
chatglm2-6b-aiacc
/llm-model/chatglm2-6b-aiacc
DeepGPU-LLM converted model
baichuan2-7b-chat-aiacc
/llm-model/baichuan2-7b-chat-aiacc
DeepGPU-LLM converted model
baichuan2-13b-chat-aiacc
/llm-model/baichuan2-13b-chat-aiacc
DeepGPU-LLM converted model
llama-2-7b-hf-aiacc
/llm-model/llama-2-7b-hf-aiacc
DeepGPU-LLM converted model
llama-2-13b-hf-aiacc
/llm-model/llama-2-13b-hf-aiacc
Open-source model
qwen-7b-chat
/llm-model/Qwen-7B-Chat
Open-source model
qwen-14b-chat
/llm-model/Qwen-14B-Chat
Open-source model
chatglm2-6b
/llm-model/chatglm2-6b
Open-source model
chatglm2-6b-32k
/llm-model/chatglm2-6b-32k
Open-source model
baichuan2-7b-chat
/llm-model/Baichuan2-7B-Chat
Open-source model
baichuan2-13b-chat
/llm-model/Baichuan2-13B-Chat
Open-source model
llama-2-7b-hf
/llm-model/Llama-2-7b-hf
Open-source model
llama-2-13b-hf
/llm-model/Llama-2-13b-hf
-
(Optional) Convert the model.
The Langchain-Chatchat project in this topic includes the DeepGPU-LLM model converter and uses the DeepGPU-LLM accelerated model
qwen-7b-chat-aiaccby default.To use DeepGPU-LLM for inference optimization on other open-source LLMs, you must first convert the Hugging Face models to the format that DeepGPU-LLM supports.
For example, to convert
qwen-7b-chat, run the following command in the container to convert the model format. Alternatively, you can install the DeepGPU-LLM inference engine and use it to convert the model format. For more information, see Install and use DeepGPU-LLM.# qwen-7b weight convert huggingface_qwen_convert \ -in_file /llm-model/Qwen-7B-Chat \ -saved_dir /llm-model/qwen-7b-chat-aiacc \ -infer_gpu_num 1 \ -weight_data_type fp16 \ -model_name qwen-7b -
Create a static PersistentVolume and PersistentVolumeClaim.
-
This example uses OSS. After the model is downloaded, run the following command to create a Secret.
kubectl create -f oss-secret.yamlThe following is an example
oss-secret.yamlfile for creating a Secret. Replace <your AccessKeyID> and <your AccessKeySecret> with your credentials.apiVersion: v1 kind: Secret metadata: name: oss-secret namespace: default stringData: akId: <your AccessKeyID> akSecret: <your AccessKeySecret> -
Run the following command to create a static PersistentVolume.
kubectl create -f model-oss.yamlThe following is an example
model-oss.yamlfile for creating a static PersistentVolume. Replace "<Your_Bucket_Name>" and "<Your_OSS_Endpoint>" with your bucket name and endpoint URL.apiVersion: v1 kind: PersistentVolume metadata: name: model-oss labels: alicloud-pvname: model-oss spec: capacity: storage: 30Gi accessModes: - ReadOnlyMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: model-oss nodePublishSecretRef: name: oss-secret namespace: default volumeAttributes: bucket: "<Your_Bucket_Name>" url: "<Your_OSS_Endpoint>" # The URL in this topic is oss-cn-hangzhou.aliyuncs.com. otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other" path: "/" -
Run the following command to create a static PersistentVolumeClaim.
kubectl create -f model-pvc.yamlThe following is an example
model-pvc.yamlfile for creating a static PersistentVolumeClaim.apiVersion: v1 kind: PersistentVolumeClaim metadata: name: model-pvc spec: accessModes: - ReadOnlyMany resources: requests: storage: 30Gi selector: matchLabels: alicloud-pvname: model-oss
For specific parameter configurations, see Use a static ossfs 1.0 volume.
-
-
Update the Helm values.
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click .
-
On the Helm page, find the deployed langchain-chatchat service and click Update in the Actions column. Change
llm.modelto the new model name andllm.modelPVCto the name of the PVC that stores the new model. For the model name and model mount path, see List of supported models.
How do I deploy using ECI Pods?
Set the llm.pod.instanceType and chat.pod.instanceType parameters to eci. The default Annotation and Label configurations for the ECI type are as follows. For more information about Annotations, see ECI Pod Annotation.
annotations:
k8s.aliyun.com/eci-use-specs: ecs.gn6i-c8g1.2xlarge,ecs.gn6i-c16g1.4xlarge,ecs.gn6v-c8g1.8xlarge,ecs.gn7i-c8g1.2xlarge,ecs.gn7i-c16g1.4xlarge
k8s.aliyun.com/eci-extra-ephemeral-storage: "50Gi"
labels:
alibabacloud.com/eci: "true"
If you change the image or model, you must modify the k8s.aliyun.com/eci-use-specs and k8s.aliyun.com/eci-extra-ephemeral-storage annotations. Otherwise, the application will fail to start due to insufficient resources. For more information about ECI billing, see Billing overview.
How do I accelerate text vectorization?
The application's built-in embedding model is text2vec-bge-large-chinese. For more information, see text2vec-bge-large-chinese.
By default, the chat application runs the embedding model on the CPU, but you can accelerate text vectorization by requesting GPU resources in chat.pod.resources.
...
chat:
pod:
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
How do I specify the vector database type?
Supported vector databases include Faiss and Alibaba Cloud AnalyticDB for PostgreSQL.
-
Faiss is an open-source, in-memory vector library developed by Facebook. Faiss is deployed in the chat Pod and is therefore constrained by the Pod's resources. If you use Faiss, we recommend increasing the chat Pod's memory.
-
AnalyticDB for PostgreSQL is a massively parallel processing (MPP) data warehousing service that provides online analysis of massive datasets.
To use AnalyticDB for PostgreSQL with this project, your instance must meet the following conditions:
-
The vector engine optimization feature must be enabled.
-
The compute node specification must be 4-core 16 GiB or higher.
-
Modify the parameter
db.dbTypetoadb, and enter the database information indb.adb.
db:
dbType: adb
embeddingModel: text2vec-bge-large-chinese
adb:
pgHost: "pg.host.demo.com"
pgPort: "5432"
pgDataBase: "langchain"
pgUser: "langchain"
pgPassword: "pg_password"
How do I change the number of service replicas?
Set llm.pod.replicas to the desired number of inference service replicas.
llm:
model: qwen-7b-chat-aiacc
modelPVC: "" # PVC name.
load8bit: true
pod:
replicas: 1 # The number of replicas for the model inference service.
Set chat.pod.replicas to the desired number of application service replicas.
chat:
kbPVC: ""
pod:
replicas: 1 # The number of replicas for the model application service.
Clean up resources
The fees for using an ACK Pro cluster or ACK Serverless cluster consist of two parts:
-
Cluster management fees: Billed directly by Container Service for Kubernetes (ACK).
-
Cloud resource fees: Resources such as GPU nodes and storage are billed by their respective cloud services according to their billing rules.
After you finish the deployment, choose one of the following options:
-
Delete the cluster: If you no longer need the cluster, log on to the ACK console to delete it. For more information about deleting an ACK cluster, see Delete a cluster.
-
Continue using the cluster: Ensure your Alibaba Cloud account has a balance of at least 100 CNY. For the billing details of other Alibaba Cloud resources used with your ACK cluster, see Cloud resource fees.
Contact us
If you have questions or suggestions while completing the ACK AIGC tutorial, join the DingTalk group (DingTalk group ID: 31850017754) to discuss them.