Connect to the photo-based Q&A agent using the HTTP protocol-Alibaba Cloud Model Studio(Model Studio)-阿里云帮助中心

Terminology

Direct path: A request path that bypasses nodes such as automatic speech recognition (ASR), intent recognition, and text-to-speech (TTS). Instead, it sends the request directly to the agent and returns the agent's response.

Scenarios

You can send images directly to the agent for analysis without intent recognition or voice processing. This approach is ideal for camera-equipped products that require image understanding, such as photo-based learning devices or educational tablets with visual question-answering features.

Prerequisites

Activate Alibaba Cloud Model Studio and obtain an API key

For more information, see Obtain an API key. The API key is your authentication credential for Alibaba Cloud Model Studio.

Enable the photo-based Q&A direct path in the console

In the Multimodal Development Suite, create a multimodal interactive application. Select the Full-Featured template (do not select the Vision-only template) and disable interactive voice response.

⚠️ Note: Only disable interactive voice response. Keep intent recognition and text model configurations enabled.

Disable conversation carryover, knowledge base, web search, and long-term memory configurations.

Clear all skill configurations and keep only the photo-based Q&A agent in the agent configuration.

Configure the photo-based Q&A agent. Leave the launch instruction empty. We recommend that you select the “Balanced Visual Understanding” model. You can customize the prompt as needed.

For testing image description scenarios, you can use the following prompt. This prompt is for demonstration purposes only and should be customized for your scenario.

You are a professional image analysis and description assistant. Generate a concise, accurate, and informative textual description based on the image content. Ensure your description covers the following points:
Main objects and scene: Clearly identify core elements in the image, including their position, color, shape, quantity, and scene type (e.g., indoor, outdoor, urban, natural).
Human characteristics and actions (if any): Describe clothing, facial expressions, posture, actions, interactions, and possible roles or identities.
Text information (if any): Describe any visible text, its font style, and its meaning or purpose (e.g., signs, advertisements, titles).
Environment and atmosphere: Describe background details (e.g., weather, lighting, season), overall color tone, and the emotion or story conveyed by the image.
Perspective and composition: Optionally note the shooting angle (top-down, eye-level, low-angle), depth of field, and focal point.
Do not include subjective opinions. Provide only objective, observable facts.
Description requirements:
Use clear and concise Chinese. Avoid wordiness and vague terms.
Highlight key information while covering relevant details.
If clear relationships exist among elements (e.g., cause-effect, action sequences), describe them explicitly.
Avoid personal speculation or emotional judgments (unless the atmosphere is unmistakable).

After you complete the configuration, click Publish in the upper-right corner. You must publish the application before you can test it.

Connect using the HTTP protocol

Request parameters

Top-level parameter	Secondary parameters	Type	Required	Description
model		string	Yes	Alibaba Cloud Model Studio model name. Always use "multimodal-dialog". Copy this value directly.
input	directive	string	Yes	Directive name: Request
	app_id	string	Yes	Your application ID (see Obtain an app ID). You can find it on the My Applications page in the Multimodal Interactive Development Suite console.
	dialog_id	string	No	Dialog ID. If left blank, a new dialog starts. The server generates a dialog ID automatically and returns it in the response. Example format: "12345678-1234-1234-1234-1234567890ab" (36 characters total). To continue a previous dialog, pass the dialog_id previously returned by the server.
	text	string	Yes	Text to process. In this scenario, leave it as an empty string ("").
parameters	client_info	object	Yes	See the table below for parameters under parameters.client_info.
	images	list[]	No	Image data for analysis. Only multimodal applications support image-based Q&A. See the table below for parameters under parameters.images.
	biz_params	object	No	Configure as needed. See the table below for parameters under parameters.biz_params.

parameters.client_info parameters

Top-level parameter	Secondary parameters	Type	Required	Description
user_id		string	Yes	End user ID. Generate this ID according to your business rules to enable customized features per end user. Maximum length: 36 characters.
device	uuid	string	No	Globally unique client device ID. Generate and pass this ID into the SDK yourself. Maximum length: 40 characters. One end user can have multiple devices, each with a different uuid but the same user_id.

parameters.images parameters

Top-level parameters

Type

Required

Description

type

string

Yes

Image type. Supported values: base64 or url.

value

string

Yes

Image content.

If type is base64, this field contains the image’s base64-encoded string.
If type is url, this field contains the image’s URL.

parameters.biz_params parameters

Top-level parameter

Secondary Parameters

Type

Required

Description

commands[i]

name

string

Yes

Indicates direct routing to the agent. In this scenario, always use “agent_command”. Copy this value directly.

exec_params

object

Yes

Specifies the target agent. In this scenario, this value is fixed. See the commands.exec_params parameter table below for details.

commands.exec_params parameters

exec_params

app_id

string

Yes

Specifies the target agent. In this scenario, always use "visual_qa". Copy this value directly.

intent

string

Yes

Specifies the agent action. In this scenario, always use "open_visual_qa". Copy this value directly.

Sample request

{
    "model": "multimodal-dialog",
    "input": {
        "directive": "Request",
        "app_id": "xxxxxx",
        "text": ""
    },
    "parameters": {
        "images": [
            {
                "type": "url",
                "value": "img_url"
            }
        ],
        "client_info": {
            "user_id": "test-251222",
            "device": {
                "uuid": "test-251222-123"
            }
        },
        "biz_params": {
            "commands": [
                {
                    "name": "agent_command",
                    "exec_params": {
                        "app_id": "visual_qa",
                        "intent": "open_visual_qa"
                    }
                }
            ]
        }
    }
}

curl request example

curl --location "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer your_api_key' \
  --header 'X-DashScope-SSE: enable' \
  --data '{
    "model": "multimodal-dialog",
    "input": {
      "directive": "Request",
      "app_id": "xxxxxx",
      "text": ""
    },
    "parameters": {
      "images": [
        {
          "type": "url",
          "value": "img_url"
        }
      ],
      "client_info": {
        "user_id": "test-251222",
        "device": {
          "uuid": "test-251222-123"
        }
      },
      "biz_params": {
        "commands": [
          {
            "name": "agent_command",
            "exec_params": {
              "app_id": "visual_qa",
              "intent": "open_visual_qa"
            }
          }
        ]
      }
    }
  }'

Important

Replace your_api_key (API key), app_id, and img_url in the request.
Set the text field to an empty string (""). Otherwise, the system will trigger intent recognition.

Text response event fields

Top-level parameter	Secondary Parameters	Type	Required	Description
output	event	string	Yes	Event name: RespondingContent
	dialog_id	string	Yes	Dialog ID
	round_id	string	Yes	ID of the current conversation round
	llm_request_id	string	Yes	Request ID for the LLM call
	text	string	Yes	Text output from the system. Streamed as incremental updates.
	spoken	string	Yes	Text used for speech synthesis. Streamed as incremental updates.
	finished	bool	Yes	Indicates whether output has finished
	finish_reason	string	No	Reason for completion. Currently supports only one value: stop: Indicates normal completion
	extra_info	object	No	Additional extended information. Currently supports: agent_info: Agent information. See below.

extra_info.agent_info parameters

Top-level parameter	Secondary parameters	Type	Required	Description
round		string	Yes	Conversation round
device	device_id	string	Yes	The device.uuid used in the request
intent_infos	intent	string	Yes	Agent used. In this scenario, always "visual_qa".
	domain	string	Yes	Agent used. In this scenario, always "visual_qa".

Sample response

{
    "output": {
        "round_id": "xxx",
        "llm_request_id": "xxx",
        "extra_info": {
            "agent_info": {
                "round": 1,
                "device": {
                    "device_id": "test-251222-123"
                },
                "intent_infos": [
                    {
                        "intent": "visual_qa",
                        "domain": "visual_qa"
                    }
                ]
            },
            "query": ""
        },
        "dialog_id": "6c4340eb-xxx-4055-8d66-0794df7986e0",
        "spoken": "Hello",
        "finished": false,
        "text": "Hello",
        "event": "RespondingContent"
    },
    "request_id": "xxx"
}