Connect to the photo-based Q&A agent using the HTTP protocol

更新时间:
复制 MD 格式

Terminology

Direct path: A request path that bypasses nodes such as automatic speech recognition (ASR), intent recognition, and text-to-speech (TTS). Instead, it sends the request directly to the agent and returns the agent's response.

Scenarios

You can send images directly to the agent for analysis without intent recognition or voice processing. This approach is ideal for camera-equipped products that require image understanding, such as photo-based learning devices or educational tablets with visual question-answering features.

Prerequisites

Activate Alibaba Cloud Model Studio and obtain an API key

For more information, see Obtain an API key. The API key is your authentication credential for Alibaba Cloud Model Studio.

Enable the photo-based Q&A direct path in the console

  1. In the Multimodal Development Suite, create a multimodal interactive application. Select the Full-Featured template (do not select the Vision-only template) and disable interactive voice response.

image.png

⚠️ Note: Only disable interactive voice response. Keep intent recognition and text model configurations enabled.

  1. Disable conversation carryover, knowledge base, web search, and long-term memory configurations.

image.png

  1. Clear all skill configurations and keep only the photo-based Q&A agent in the agent configuration.

image.png

  1. Configure the photo-based Q&A agent. Leave the launch instruction empty. We recommend that you select the “Balanced Visual Understanding” model. You can customize the prompt as needed.

image.png

For testing image description scenarios, you can use the following prompt. This prompt is for demonstration purposes only and should be customized for your scenario.

You are a professional image analysis and description assistant. Generate a concise, accurate, and informative textual description based on the image content. Ensure your description covers the following points:
Main objects and scene: Clearly identify core elements in the image, including their position, color, shape, quantity, and scene type (e.g., indoor, outdoor, urban, natural).
Human characteristics and actions (if any): Describe clothing, facial expressions, posture, actions, interactions, and possible roles or identities.
Text information (if any): Describe any visible text, its font style, and its meaning or purpose (e.g., signs, advertisements, titles).
Environment and atmosphere: Describe background details (e.g., weather, lighting, season), overall color tone, and the emotion or story conveyed by the image.
Perspective and composition: Optionally note the shooting angle (top-down, eye-level, low-angle), depth of field, and focal point.
Do not include subjective opinions. Provide only objective, observable facts.
Description requirements:
Use clear and concise Chinese. Avoid wordiness and vague terms.
Highlight key information while covering relevant details.
If clear relationships exist among elements (e.g., cause-effect, action sequences), describe them explicitly.
Avoid personal speculation or emotional judgments (unless the atmosphere is unmistakable).
  1. After you complete the configuration, click Publish in the upper-right corner. You must publish the application before you can test it.

Connect using the HTTP protocol

Request parameters

Top-level parameter

Secondary parameters

Type

Required

Description

model

string

Yes

Alibaba Cloud Model Studio model name. Always use "multimodal-dialog". Copy this value directly.

input

directive

string

Yes

Directive name: Request

app_id

string

Yes

Your application ID (see Obtain an app ID). You can find it on the My Applications page in the Multimodal Interactive Development Suite console.

dialog_id

string

No

Dialog ID. If left blank, a new dialog starts. The server generates a dialog ID automatically and returns it in the response. Example format: "12345678-1234-1234-1234-1234567890ab" (36 characters total). To continue a previous dialog, pass the dialog_id previously returned by the server.

text

string

Yes

Text to process. In this scenario, leave it as an empty string ("").

parameters

client_info

object

Yes

See the table below for parameters under parameters.client_info.

images

list[]

No

Image data for analysis. Only multimodal applications support image-based Q&A. See the table below for parameters under parameters.images.

biz_params

object

No

Configure as needed. See the table below for parameters under parameters.biz_params.

parameters.client_info parameters

Top-level parameter

Secondary parameters

Type

Required

Description

user_id

string

Yes

End user ID. Generate this ID according to your business rules to enable customized features per end user. Maximum length: 36 characters.

device

uuid

string

No

Globally unique client device ID. Generate and pass this ID into the SDK yourself. Maximum length: 40 characters. One end user can have multiple devices, each with a different uuid but the same user_id.

parameters.images parameters

Top-level parameters

Type

Required

Description

type

string

Yes

Image type. Supported values: base64 or url.

value

string

Yes

Image content.

  • If type is base64, this field contains the image’s base64-encoded string.

  • If type is url, this field contains the image’s URL.

parameters.biz_params parameters

Top-level parameter

Secondary Parameters

Type

Required

Description

commands[i]

name

string

Yes

Indicates direct routing to the agent. In this scenario, always use “agent_command”. Copy this value directly.

exec_params

object

Yes

Specifies the target agent. In this scenario, this value is fixed. See the commands.exec_params parameter table below for details.

commands.exec_params parameters

exec_params

app_id

string

Yes

Specifies the target agent. In this scenario, always use "visual_qa". Copy this value directly.

intent

string

Yes

Specifies the agent action. In this scenario, always use "open_visual_qa". Copy this value directly.

Sample request

{
    "model": "multimodal-dialog",
    "input": {
        "directive": "Request",
        "app_id": "xxxxxx",
        "text": ""
    },
    "parameters": {
        "images": [
            {
                "type": "url",
                "value": "img_url"
            }
        ],
        "client_info": {
            "user_id": "test-251222",
            "device": {
                "uuid": "test-251222-123"
            }
        },
        "biz_params": {
            "commands": [
                {
                    "name": "agent_command",
                    "exec_params": {
                        "app_id": "visual_qa",
                        "intent": "open_visual_qa"
                    }
                }
            ]
        }
    }
}

curl request example

curl --location "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer your_api_key' \
  --header 'X-DashScope-SSE: enable' \
  --data '{
    "model": "multimodal-dialog",
    "input": {
      "directive": "Request",
      "app_id": "xxxxxx",
      "text": ""
    },
    "parameters": {
      "images": [
        {
          "type": "url",
          "value": "img_url"
        }
      ],
      "client_info": {
        "user_id": "test-251222",
        "device": {
          "uuid": "test-251222-123"
        }
      },
      "biz_params": {
        "commands": [
          {
            "name": "agent_command",
            "exec_params": {
              "app_id": "visual_qa",
              "intent": "open_visual_qa"
            }
          }
        ]
      }
    }
  }'
Important
  1. Replace your_api_key (API key), app_id, and img_url in the request.

  2. Set the text field to an empty string (""). Otherwise, the system will trigger intent recognition.

Text response event fields

Top-level parameter

Secondary Parameters

Type

Required

Description

output

event

string

Yes

Event name: RespondingContent

dialog_id

string

Yes

Dialog ID

round_id

string

Yes

ID of the current conversation round

llm_request_id

string

Yes

Request ID for the LLM call

text

string

Yes

Text output from the system. Streamed as incremental updates.

spoken

string

Yes

Text used for speech synthesis. Streamed as incremental updates.

finished

bool

Yes

Indicates whether output has finished

finish_reason

string

No

Reason for completion. Currently supports only one value:

  • stop: Indicates normal completion

extra_info

object

No

Additional extended information. Currently supports:

  • agent_info: Agent information. See below.

extra_info.agent_info parameters

Top-level parameter

Secondary parameters

Type

Required

Description

round

string

Yes

Conversation round

device

device_id

string

Yes

The device.uuid used in the request

intent_infos

intent

string

Yes

Agent used. In this scenario, always "visual_qa".

domain

string

Yes

Agent used. In this scenario, always "visual_qa".

Sample response

{
    "output": {
        "round_id": "xxx",
        "llm_request_id": "xxx",
        "extra_info": {
            "agent_info": {
                "round": 1,
                "device": {
                    "device_id": "test-251222-123"
                },
                "intent_infos": [
                    {
                        "intent": "visual_qa",
                        "domain": "visual_qa"
                    }
                ]
            },
            "query": ""
        },
        "dialog_id": "6c4340eb-xxx-4055-8d66-0794df7986e0",
        "spoken": "Hello",
        "finished": false,
        "text": "Hello",
        "event": "RespondingContent"
    },
    "request_id": "xxx"
}