Build a multimodal RAG

更新时间:
复制 MD 格式

Traditional retrieval-augmented generation (RAG) flows only process text. They ignore images in documents such as PDF and Word files, which causes information loss. The multimodal feature of PAI-RAG integrates a multimodal large language model (LLM). This allows it to understand both images and text to provide more complete answers. This topic describes how to enable multimodal inference in a RAG service.

Prerequisites

A RAG service is deployed. The key configurations are described below. For more information, see RAG chatbot (v0.3.x).

  • Version Selection: Select Decoupled LLM Deployment.

  • RAG Version: Select pai-rag:0.3.4.

  • VPC: Select a virtual private cloud (VPC) with public network access. This allows the service to access image URLs on the public network. For more information, see Access public or internal resources from EAS.

Configure a multimodal LLM

Configure a multimodal LLM to generate answers based on image and text content.

  1. Prepare a multimodal LLM service.

    • (Recommended) Use a multimodal model service that is deployed on Elastic Algorithm Service (EAS) and is compatible with the OpenAI protocol. Examples include the open source Qwen2.5-VL-72B-Instruct-AWQ and Qwen2.5-VL-72B-Instruct models from the Qwen2.5-VL series. For more information about how to deploy a multimodal model service on EAS, see Deploy a Large Language Model (LLM).

    • Use a DashScope multimodal model. For more information, see Call Qwen API for the first time.

  2. Obtain the endpoint and token of the multimodal LLM service.

    Service type

    Acquisition method

    EAS multimodal model service

    On the Elastic Algorithm Service (EAS) page, click the multimodal model service name, and then in the Basic Information section, click View Endpoint Information to obtain the endpoint and token.

    Note
    • Use the Internet Endpoint: The RAG service must be attached to a VPC with public network access.

    • Use the VPC endpoint: The RAG service and the multimodal model service must be in the same VPC.

    DashScope multimodal model

    • Endpoint: https://dashscope.aliyuncs.com/compatible-mode/v1

    • Service Token: This is the API key. For more information about how to obtain it, see Call Qwen API for the first time.

  3. On the WebUI page of the RAG service, configure the request parameters for the multimodal LLM service.

    1. On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service, and then click View Web Application in the upper-right corner of the page.

    2. On the System Settings tab, on the Model And Storage Configuration tab, configure the following parameters, and then click Save Model Configuration.

      • URL: Set this to the service endpoint.

      • Key: Set this to the service token.

      • Model Name: Set this to the name of the multimodal LLM.

      • Enable Multimodal Support: Select the check box.

Configure Alibaba Cloud OSS storage

You can add an Object Storage Service (OSS) data source to store image file information. This allows the system to display images as links in the inference results.

  1. On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service, and then in the upper-right corner of the page, click View Web Application.

  2. On the System Settings tab, on the Model And Storage Configuration tab, select Use OSS For Image Storage, and configure the following parameters.

    Parameter

    Description

    Example

    OSS bucket

    The name of the OSS bucket. You can go to the Bucket List page to view it. If you have not created a bucket, see Quick Start to create one.

    examplebucket

    OSS domain name

    Enter the public endpoint that corresponds to the region of the OSS bucket. For a list of regions and endpoints, see Regions and endpoints.

    Note

    By default, EAS cannot access the public network. If you use a public endpoint, such as oss-cn-hangzhou.aliyuncs.com, you must configure a VPC with public network access for the RAG service. For more information, see Access public or internal resources from EAS.

    oss-<Region ID>.aliyuncs.com

    AccessKey ID

    The AccessKey ID of your Alibaba Cloud account.

    yourAccessKeyID

    AccessKey secret

    The AccessKey secret of your Alibaba Cloud account.

    yourAccessKeySecret

  3. After you configure the parameters, click Save OSS Configuration.

Upload multimodal files

This section describes how to upload multimodal files on the WebUI page of the RAG service. You can also upload multimodal files using an API. For more information, see Knowledgebase API.

Upload knowledge base files

The service supports multiple multimodal file formats, such as PDF, Markdown, Word, PPT, PNG, and JPG. If you have already configured a multimodal LLM and Alibaba Cloud OSS storage, the system automatically processes the images in a file and generates image captions during the upload.

On the WebUI page of the RAG service, go to the Knowledge Base > File Management tab. Click My files and navigate to the default/docs folder. To upload knowledge base files, you can drag local files into the window or click image in the upper-right corner.

View upload status

Switch to the Upload History tab and click the Refresh button. The upload is successful when the Upload Status for the file changes to done.

Q&A demo

Call from the WebUI

On the WebUI page of the RAG service, switch to the Chat tab to perform service inference.

Use the WebUI to perform a Q&A task on the knowledge base with a multimodal LLM

Follow the instructions in the following figure to retrieve results from the knowledge base and perform inference using the multimodal LLM. The following figure shows an example that uses a car manual.

You can also go to the Knowledge Base tab and click the Knowledge Base Q&A Prompt Template Configuration tab to modify the prompt for the Q&A task. This prompts the model to display an image in the answer if the image is referenced from the knowledge base.

For example, add the following text to the task description: If the answer mentions an image from the materials, you should provide the corresponding image link in Markdown format in the answer..

After you modify the prompt, the answer displays the image even if the question does not include a prompt to display it.

Use the WebUI to perform a Q&A task with a multimodal LLM

Follow the instructions in the following figure to perform inference using the multimodal LLM.

API call

  1. Obtain the endpoint and token of the RAG service.

    1. On the Elastic Algorithm Service (EAS) page, click the name of the target RAG service.

    2. On the Overview tab, in the Basic Information section, click View Endpoint Information to obtain the service endpoint and token.

      Note
      • Internet Endpoint: The client that calls the service must have public network access.

      • VPC endpoint: The client that calls the service must be in the same VPC as the RAG service.

  2. In a terminal, run the following code to perform inference.

    Perform a Q&A task with a multimodal LLM

    curl -X 'POST' <EAS_SERVICE_URL>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H 'Authorization: <EAS_TOKEN>' \
    -d '{
     "model": "qwen-vl-max",
     "messages": [
       {
         "role": "user",
         "content": [
           {
             "type": "text",
             "text": "What is in the picture?"
           },
           {
             "type": "image_url",
             "image_url": {
                 "url": "https://pai-rag.oss-cn-hangzhou.aliyuncs.com/data/demo/shirts/10.jpg"
    		 }
           }
         ]
       }
     ],
     "stream": true
    }'

    The following list describes the key configurations.

    • <EAS_SERVICE_URL>: The RAG service endpoint. Remove the trailing /.

    • <EAS_TOKEN>: Replace this with the token of the RAG service.

    • model: Set this to the name of the multimodal LLM.

    • url: Set this to the URL of the image.

    Perform a Q&A task on the knowledge base with a multimodal LLM

    curl -X 'POST' <EAS_SERVICE_URL>/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H 'Authorization: <EAS_TOKEN>' \
    -d '{
      "model": "qwen-vl-max",
      "messages": [
        {
    	  "role": "user",
    	  "content": [
    	    {
    		  "type": "text",
    		  "text": "What is in the picture?"
    		},
    	    {
    		  "type": "image_url",
    		  "image_url": {
    		    "url": "https://pai-rag.oss-cn-hangzhou.aliyuncs.com/data/demo/shirts/10.jpg"
    		  }
    		}
    	  ]
    	}
      ],
      "stream": true,
      "chat_knowledgebase": true,
      "index_name": "default"
    }'

    The following list describes the key configurations.

    • <EAS_SERVICE_URL>: The endpoint of the RAG service. Remove the trailing /.

    • <EAS_TOKEN>: Replace this with the token of the RAG service.

    • model: Set this to the name of the multimodal LLM.

    • url: Set this to the URL of the image.

    • chat_knowledgebase: Set this to true to query the local knowledge base. Before you use this feature, you must upload knowledge base files. For more information, see Upload multimodal files or Knowledgebase API.

    • index_name: Enter the name of the knowledge base that contains the document. For more information about how to view the name, see Query the knowledge base list.