Traditional retrieval-augmented generation (RAG) flows only process text. They ignore images in documents such as PDF and Word files, which causes information loss. The multimodal feature of PAI-RAG integrates a multimodal large language model (LLM). This allows it to understand both images and text to provide more complete answers. This topic describes how to enable multimodal inference in a RAG service.
Prerequisites
A RAG service is deployed. The key configurations are described below. For more information, see RAG chatbot (v0.3.x).
Version Selection: Select Decoupled LLM Deployment.
RAG Version: Select pai-rag:0.3.4.
VPC: Select a virtual private cloud (VPC) with public network access. This allows the service to access image URLs on the public network. For more information, see Access public or internal resources from EAS.
Configure a multimodal LLM
Configure a multimodal LLM to generate answers based on image and text content.
Prepare a multimodal LLM service.
(Recommended) Use a multimodal model service that is deployed on Elastic Algorithm Service (EAS) and is compatible with the OpenAI protocol. Examples include the open source Qwen2.5-VL-72B-Instruct-AWQ and Qwen2.5-VL-72B-Instruct models from the Qwen2.5-VL series. For more information about how to deploy a multimodal model service on EAS, see Deploy a Large Language Model (LLM).
Use a DashScope multimodal model. For more information, see Call Qwen API for the first time.
Obtain the endpoint and token of the multimodal LLM service.
Service type
Acquisition method
EAS multimodal model service
On the Elastic Algorithm Service (EAS) page, click the multimodal model service name, and then in the Basic Information section, click View Endpoint Information to obtain the endpoint and token.
NoteUse the Internet Endpoint: The RAG service must be attached to a VPC with public network access.
Use the VPC endpoint: The RAG service and the multimodal model service must be in the same VPC.
DashScope multimodal model
Endpoint:
https://dashscope.aliyuncs.com/compatible-mode/v1Service Token: This is the API key. For more information about how to obtain it, see Call Qwen API for the first time.
On the WebUI page of the RAG service, configure the request parameters for the multimodal LLM service.
On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service, and then click View Web Application in the upper-right corner of the page.
On the System Settings tab, on the Model And Storage Configuration tab, configure the following parameters, and then click Save Model Configuration.
URL: Set this to the service endpoint.
Key: Set this to the service token.
Model Name: Set this to the name of the multimodal LLM.
Enable Multimodal Support: Select the check box.
Configure Alibaba Cloud OSS storage
You can add an Object Storage Service (OSS) data source to store image file information. This allows the system to display images as links in the inference results.
On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service, and then in the upper-right corner of the page, click View Web Application.
On the System Settings tab, on the Model And Storage Configuration tab, select Use OSS For Image Storage, and configure the following parameters.
Parameter
Description
Example
OSS bucket
The name of the OSS bucket. You can go to the Bucket List page to view it. If you have not created a bucket, see Quick Start to create one.
examplebucket
OSS domain name
Enter the public endpoint that corresponds to the region of the OSS bucket. For a list of regions and endpoints, see Regions and endpoints.
NoteBy default, EAS cannot access the public network. If you use a public endpoint, such as
oss-cn-hangzhou.aliyuncs.com, you must configure a VPC with public network access for the RAG service. For more information, see Access public or internal resources from EAS.oss-<Region ID>.aliyuncs.comAccessKey ID
The AccessKey ID of your Alibaba Cloud account.
yourAccessKeyID
AccessKey secret
The AccessKey secret of your Alibaba Cloud account.
yourAccessKeySecret
After you configure the parameters, click Save OSS Configuration.
Upload multimodal files
This section describes how to upload multimodal files on the WebUI page of the RAG service. You can also upload multimodal files using an API. For more information, see Knowledgebase API.
Upload knowledge base files
The service supports multiple multimodal file formats, such as PDF, Markdown, Word, PPT, PNG, and JPG. If you have already configured a multimodal LLM and Alibaba Cloud OSS storage, the system automatically processes the images in a file and generates image captions during the upload.
On the WebUI page of the RAG service, go to the Knowledge Base > File Management tab. Click My files and navigate to the default/docs folder. To upload knowledge base files, you can drag local files into the window or click
in the upper-right corner.
View upload status
Switch to the Upload History tab and click the Refresh button. The upload is successful when the Upload Status for the file changes to done.
Q&A demo
Call from the WebUI
On the WebUI page of the RAG service, switch to the Chat tab to perform service inference.
Use the WebUI to perform a Q&A task on the knowledge base with a multimodal LLM
Follow the instructions in the following figure to retrieve results from the knowledge base and perform inference using the multimodal LLM. The following figure shows an example that uses a car manual.
You can also go to the Knowledge Base tab and click the Knowledge Base Q&A Prompt Template Configuration tab to modify the prompt for the Q&A task. This prompts the model to display an image in the answer if the image is referenced from the knowledge base.
For example, add the following text to the task description: If the answer mentions an image from the materials, you should provide the corresponding image link in Markdown format in the answer..
After you modify the prompt, the answer displays the image even if the question does not include a prompt to display it.
Use the WebUI to perform a Q&A task with a multimodal LLM
Follow the instructions in the following figure to perform inference using the multimodal LLM.
API call
Obtain the endpoint and token of the RAG service.
On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service.
On the Overview tab, in the Basic Information section, click View Endpoint Information to obtain the service endpoint and token.
NoteInternet Endpoint: The client that calls the service must have public network access.
VPC endpoint: The client that calls the service must be in the same VPC as the RAG service.
In a terminal, run the following code to perform inference.
Perform a Q&A task with a multimodal LLM
curl -X 'POST' <EAS_SERVICE_URL>/v1/chat/completions \ -H "Content-Type: application/json" \ -H 'Authorization: <EAS_TOKEN>' \ -d '{ "model": "qwen-vl-max", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in the picture?" }, { "type": "image_url", "image_url": { "url": "https://pai-rag.oss-cn-hangzhou.aliyuncs.com/data/demo/shirts/10.jpg" } } ] } ], "stream": true }'The following list describes the key configurations.
<EAS_SERVICE_URL>: The RAG service endpoint. Remove the trailing
/.<EAS_TOKEN>: Replace this with the token of the RAG service.
model: Set this to the name of the multimodal LLM.
url: Set this to the URL of the image.
Perform a Q&A task on the knowledge base with a multimodal LLM
curl -X 'POST' <EAS_SERVICE_URL>/v1/chat/completions \ -H "Content-Type: application/json" \ -H 'Authorization: <EAS_TOKEN>' \ -d '{ "model": "qwen-vl-max", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in the picture?" }, { "type": "image_url", "image_url": { "url": "https://pai-rag.oss-cn-hangzhou.aliyuncs.com/data/demo/shirts/10.jpg" } } ] } ], "stream": true, "chat_knowledgebase": true, "index_name": "default" }'The following list describes the key configurations.
<EAS_SERVICE_URL>: The endpoint of the RAG service. Remove the trailing
/.<EAS_TOKEN>: Replace this with the token of the RAG service.
model: Set this to the name of the multimodal LLM.
url: Set this to the URL of the image.
chat_knowledgebase: Set this to true to query the local knowledge base. Before you use this feature, you must upload knowledge base files. For more information, see Upload multimodal files or Knowledgebase API.
index_name: Enter the name of the knowledge base that contains the document. For more information about how to view the name, see Query the knowledge base list.