GUI-Plus 可基于屏幕截图和自然语言指令来解析用户意图,并转换为标准化的图像用户界面(GUI)操作(如点击、输入、滚动等),供外部系统决策或执行。相较于千问VL系列模型,提升了GUI操作的准确性。
支持的模型
模型名称 | 模式 | 上下文长度 | 最大输入 | 最大思维链长度 | 最大回复长度 | 输入成本 | 输出成本 | 免费额度 |
(Token数) | (每百万Token) | |||||||
gui-plus | 非思考模式 | 256,000 | 254,976 单图最大16384 | - | 32,768 | 1.5元 | 4.5元 | 各100万Token 有效期:百炼开通后90天内 |
gui-plus-2026-02-26 | 思考模式 | 262,144 | 258,048 单图最大16384 | 81,920 | 32,768 | |||
非思考模式 | 260,096 单图最大16384 | - | 32,768 | |||||
gui-plus-2026-02-26模型能力全面升级,支持思考与非思考模式,相较于gui-plus模型,gui-plus-2026-02-26模型在处理跨平台、多 APP 任务的效果得到大幅提升。推荐优先使用该模型。
快速开始
本节将演示如何快速发起 GUI-Plus 模型调用,获取执行 GUI 任务的指令。关于如何将指令转换为实际的 GUI 操作并执行,请参阅后文的如何使用章节。
前提条件
如果通过 SDK 进行调用,需安装最新版SDK。
推荐 System Prompt
System Prompt 可定义模型角色、能力和输出规范等,推荐gui-plus-2026-02-26模型使用以下系统提示词,否则会影响模型输出结果。
gui-plus和gui-plus-2026-02-26的系统提示词不可共用,gui-plus的系统提示词请参见GUI-Plus模型推荐提示词。
电脑端 System Prompt
"""# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""手机端 System Prompt
'''# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.
* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.
* The screen's resolution is 1000x1000.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:
* `key`: Perform a key event on the mobile device.
- This supports adb's `keyevent` syntax.
- Examples: "volume_up", "volume_down", "power", "camera", "clear".
* `click`: Click the point on the screen with coordinate (x, y).
* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.
* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).
* `type`: Input the specified text into the activated input box.
* `system_button`: Press the system button.
* `open`: Open an app on the device.
* `wait`: Wait specified seconds for the change to happen.
* `answer`: Terminate the current task and output the answer.
* `interact`: Resolve the blocking window by interacting with the user.
* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.'''OpenAI兼容
Python
import os
from openai import OpenAI
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"}},
{"type": "text", "text": "帮我打开浏览器"}
]
}
]
client = OpenAI(
# 若没有配置环境变量,请用阿里云百炼API Key将下行替换为:api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="gui-plus-2026-02-26",
messages=messages,
extra_body={"vl_high_resolution_images": True}
)
print(completion.choices[0].message.content)Node.js
import OpenAI from "openai";
const systemPrompt = `# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* \`key\`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* \`type\`: Type a string of text on the keyboard.\\n* \`mouse_move\`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`left_click\`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`left_click_drag\`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`right_click\`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`middle_click\`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`double_click\`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`triple_click\`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* \`scroll\`: Performs a scroll of the mouse scroll wheel.\\n* \`hscroll\`: Performs a horizontal scroll (mapped to regular scroll).\\n* \`wait\`: Wait specified seconds for the change to happen.\\n* \`terminate\`: Terminate the current task and report its completion status.\\n* \`answer\`: Answer a question.\\n* \`interact\`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by \`action=key\`.", "type": "array"}, "text": {"description": "Required only by \`action=type\`, \`action=answer\` and \`action=interact\`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by \`action=mouse_move\` and \`action=left_click_drag\`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by \`action=scroll\` and \`action=hscroll\`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by \`action=wait\`.", "type": "number"}, "status": {"description": "The status of the task. Required only by \`action=terminate\`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.`;
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
const messages = [
{
role: "system",
content: systemPrompt,
},
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png",
},
},
{ type: "text", text: "帮我打开浏览器" },
],
},
];
const completion = await client.chat.completions.create({
model: "gui-plus-2026-02-26",
messages: messages,
extra_body: { vl_high_resolution_images: true },
});
console.log(completion.choices[0].message.content);curl
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gui-plus-2026-02-26",
"messages": [
{
"role": "system",
"content": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
}
},
{
"type": "text",
"text": "帮我打开浏览器"
}
]
}
],
"vl_high_resolution_images": true
}'DashScope
Python
import os
import dashscope
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
{"text": "帮我打开浏览器。"}]
}]
response = dashscope.MultiModalConversation.call(
# 若没有配置环境变量, 请用百炼API Key将下行替换为: api_key = "sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='gui-plus-2026-02-26',
messages=messages,
vl_high_resolution_images=True
)
print(response.output.choices[0].message.content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
String systemPrompt = "# Tools\n\n" +
"You may call one or more functions to assist with the user query.\n\n" +
"You are provided with function signatures within <tools></tools> XML tags:\n" +
"<tools>\n" +
"{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n" +
"</tools>\n\n" +
"For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n" +
"<tool_call>\n" +
"{\"name\": <function-name>, \"arguments\": <args-json-object>}\n" +
"</tool_call>\n\n" +
"# Response format\n\n" +
"Response format for every step:\n" +
"1) Action: a short imperative describing what to do in the UI.\n" +
"2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\n" +
"Rules:\n" +
"- Output exactly in the order: Action, <tool_call>.\n" +
"- Be brief: one for Action.\n" +
"- Do not output anything else outside those two parts.\n" +
"- If finishing, use action=terminate in the tool call.";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMsg = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text",systemPrompt))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"),
Collections.singletonMap("text", "帮我打开浏览器。"))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// 若没有配置环境变量,请用百炼API Key将下行替换为:.apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("gui-plus-2026-02-26")
.messages(Arrays.asList(systemMsg,userMessage))
.vlHighResolutionImages(true)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gui-plus-2026-02-26",
"input": {
"messages": [
{
"role": "system",
"content": [
{
"text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
}
]
},
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
},
{
"text": "帮我打开浏览器"
}
]
}
]
},
"parameters": {
"vl_high_resolution_images": true
}
}'如何使用
电脑 GUI 任务
本示例适用于Windows操作系统,若在Mac/Linux 环境下系统,需修改ComputerTools类中的系统命令。如返回桌面操作,Windows 系统使用Win+D,Mac 系统使用Command+F3。
步骤1. 构造 System Prompt
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""以上系统提示词要求模型:
假设屏幕分辨率为 1000×1000(归一化坐标系)
输出格式严格:先输出动作(
Action)的描述,然后输出<tool_call>块支持的操作类型:点击、拖拽、输入、滚动、按键等
步骤2. 构造多轮对话消息
在 GUI 自动化任务中,模型需要基于历史操作上下文做出决策。为了让模型理解当前任务进度并生成合理的下一步操作,模型采用以下策略构造多轮对话消息:
仅保留最近 N 轮(默认 4 轮)的完整对话(截图 + 模型输出),避免模型上下文过长导致的性能下降
对更早的历史操作,仅保留文本摘要(模型输出的动作(
Action)部分),不包含截图,节省token消耗
def get_messages(image, instruction, history_output, model_name, system_prompt):
"""
构造多轮对话消息
参数:
image: 当前截图路径
instruction: 用户指令
history_output: 历史对话记录 [{"output": "...", "image": "..."}]
model_name: 模型名称
"""
history_n = 4 # 保留最近4轮历史
current_step = len(history_output)
# 构造历史操作摘要
history_start_idx = max(0, current_step - history_n)
previous_actions = []
for i in range(history_start_idx):
if i < len(history_output):
history_output_str = history_output[i]['output']
if 'Action:' in history_output_str and '<tool_call>':
history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
previous_actions.append(f"Step {i + 1}: {history_output_str}")
previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"
instruction_prompt = f"""
Please generate the next move according to the UI screenshot, instruction and previous actions.
Instruction: {instruction}
Previous actions:
{previous_actions_str}"""
# 构造 messages 数组
messages = [
{
"role": "system",
"content": [{"text": system_prompt}],
}
]
history_len = min(history_n, len(history_output))
if history_len > 0:
# 添加历史对话
for history_id, history_item in enumerate(history_output[-history_n:], 0):
if history_id == 0:
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + history_item['image']}
]
})
else:
messages.append({
"role": "user",
"content": [{"image": "file://" + history_item['image']}]
})
messages.append({
"role": "assistant",
"content": [{"text": history_item['output']}],
})
# 添加当前截图
messages.append({
"role": "user",
"content": [{"image": "file://" + image}]
})
else:
# 首轮对话
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + image}
]
})
return messagesGUI模型的多轮对话的message数组示例如下(以7轮对话为例)
model_input
[{
"role": "system",
"content": [{
"text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name_for_human\": \"mobile_use\", \"name\": \"mobile_use\", \"description\": \"Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n - This supports adb's `keyevent` syntax.\n - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `answer`: Terminate the current task and output the answer.\n* `interact`: Resolve the blocking window by interacting with the user.\n* `terminate`: Terminate the current task and report its completion status.\", \"enum\": [\"key\", \"click\", \"long_press\", \"swipe\", \"type\", \"system_button\", \"open\", \"wait\", \"answer\", \"interact\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.\", \"type\": \"string\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=long_press` and `action=wait`.\", \"type\": \"number\"}, \"button\": {\"description\": \"Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}, \"args_format\": \"Format the arguments as a JSON object.\"}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
}]
}, {
"role": "user",
"content": [{
"text": "\nPlease generate the next move according to the UI screenshot, instruction and previous actions.\n\nInstruction: 帮我在携程搜一下今天济南喜来登酒店的价格\n\nPrevious actions:\nStep 1: 点击携程旅行应用图标以启动携程旅行预订应用程序。\nStep 2: 等待促销启动画面自动过渡到携程主应用界面。"
}, {
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_a84122ac_853a630315784b64988492c9c07b5534.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: 点击应用更新通知弹窗右上角的关闭按钮(X图标)以将其关闭。\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [789, 280]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_6010a769_089b9b35b1904913bd5df492563b02b9.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: 点击搜索栏中的“济南的酒店”文本区域,以激活搜索输入框并准备修改搜索词。\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [112, 134]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_47446db4_fd4a5022002c4db99f110d5c7261fea2.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: 点击显示“厦门”的位置字段,将搜索位置从厦门更改为济南。\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [156, 347]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_3832132c_8c55861c1716467e802a3554402f3580.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: 在搜索输入框中键入“济南”,以指定酒店搜索的城市位置。\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"type\", \"text\": \"济南\"}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_ff247bac_39c3e20be32c4baf8677a2b6b61bc021.png"
}]
}] 步骤3. 解析模型输出
由于模型在处理图像时会进行内部缩放,其返回的坐标是基于缩放后图像的归一化坐标。为在原图上准确执行GUI操作,需要进行坐标映射。
提取 Tool Call 字段
首先从模型返回的字符串中提取Tool Call:
import re import json def extract_tool_calls(text): """ 从模型输出中提取所有 <tool_call> 块 参数: text: 模型返回的文本 返回: actions: 解析后的操作列表 """ pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE) blocks = pattern.findall(text) actions = [] for blk in blocks: blk = blk.strip() try: actions.append(json.loads(blk)) except json.JSONDecodeError as e: print(f'解析失败: {e} | 片段: {blk[:80]}...') return actions坐标映射函数
模型处理图像时会进行内部缩放,以下函数用于计算缩放后的尺寸:
import math from PIL import Image def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192): """ 计算模型内部缩放后的图像尺寸 参数: height: 原始图像高度 width: 原始图像宽度 factor: 分辨率因子(固定为 16) min_pixels: 最小像素值 max_pixels: 最大像素值 max_long_side: 最长边限制 返回: (h_bar, w_bar): 缩放后的高度和宽度 """ def round_by_factor(number, factor): return round(number / factor) * factor def ceil_by_factor(number, factor): return math.ceil(number / factor) * factor def floor_by_factor(number, factor): return math.floor(number / factor) * factor if height < 2 or width < 2: raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}") elif max(height, width) / min(height, width) > 200: raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}") # 限制最长边 if max(height, width) > max_long_side: beta = max(height, width) / max_long_side height, width = int(height / beta), int(width / beta) # 计算缩放后的尺寸 h_bar = round_by_factor(height, factor) w_bar = round_by_factor(width, factor) if h_bar * w_bar > max_pixels: beta = math.sqrt((height * width) / max_pixels) h_bar = floor_by_factor(height / beta, factor) w_bar = floor_by_factor(width / beta, factor) elif h_bar * w_bar < min_pixels: beta = math.sqrt(min_pixels / (height * width)) h_bar = ceil_by_factor(height * beta, factor) w_bar = ceil_by_factor(width * beta, factor) return h_bar, w_bar
步骤4. 执行GUI操作
解析动作指令后,接下来演示如何使用pyautogui库模拟用户的鼠标点击、键盘输入、滚动等物理 GUI 操作。
import pyautogui
import pyperclip
import time
from PIL import Image
import os
class ComputerTools:
"""电脑端 GUI 操作工具类"""
def __init__(self):
self.image_info = None
def load_image_info(self, path):
"""加载图像尺寸信息"""
width, height = Image.open(path).size
self.image_info = (width, height)
def get_screenshot(self, image_path, retry_times=3):
"""获取桌面截图"""
if os.path.exists(image_path):
os.remove(image_path)
for i in range(retry_times):
screenshot = pyautogui.screenshot()
screenshot.save(image_path)
if os.path.exists(image_path):
self.load_image_info(image_path)
return True
else:
time.sleep(0.1)
return False
def reset(self):
"""显示桌面"""
pyautogui.hotkey('win', 'd')
def press_key(self, keys):
"""按键操作"""
if isinstance(keys, list):
cleaned_keys = []
for key in keys:
if isinstance(key, str):
# 处理键名格式
if key.startswith("keys=["):
key = key[6:]
if key.endswith("]"):
key = key[:-1]
if key.startswith("['") or key.startswith('["'):
key = key[2:] if len(key) > 2 else key
if key.endswith("']") or key.endswith('"]'):
key = key[:-2] if len(key) > 2 else key
key = key.strip()
# 转换键名
key_map = {
"arrowleft": "left",
"arrowright": "right",
"arrowup": "up",
"arrowdown": "down"
}
key = key_map.get(key, key)
cleaned_keys.append(key)
else:
cleaned_keys.append(key)
keys = cleaned_keys
else:
keys = [keys]
if len(keys) > 1:
pyautogui.hotkey(*keys)
else:
pyautogui.press(keys[0])
def type(self, text):
"""输入文本(使用剪贴板方式支持中文)"""
pyperclip.copy(text)
pyautogui.keyDown('ctrl')
pyautogui.keyDown('v')
pyautogui.keyUp('v')
pyautogui.keyUp('ctrl')
def mouse_move(self, x, y):
"""移动鼠标到指定坐标"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.moveTo(x, y)
def left_click(self, x, y):
"""左键点击"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.click()
def left_click_drag(self, x, y):
"""从当前位置拖拽到指定坐标"""
pyautogui.dragTo(x, y, duration=0.5)
pyautogui.moveTo(x, y)
def right_click(self, x, y):
"""右键点击"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.rightClick()
def middle_click(self, x, y):
"""中键点击"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.middleClick()
def double_click(self, x, y):
"""双击"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.doubleClick()
def triple_click(self, x, y):
"""三击"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.tripleClick()
def scroll(self, pixels):
"""滚轮滚动"""
pyautogui.scroll(pixels)步骤5:完整自动化流程
将以上所有步骤整合到一个完整的自动化流程中,循环执行截图 → 模型推理 → 执行GUI操作,直到任务完成。
import os
import dashscope
import time
def run_gui_automation(instruction, max_step=30):
"""
运行完整的 GUI 自动化流程
参数:
instruction: 用户指令
max_step: 最大执行步骤数
"""# 配置 API
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
model_name = 'gui-plus-2026-02-26'# 初始化工具
computer_tools = ComputerTools()
computer_tools.reset() # 显示桌面# 创建输出目录
output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation")
os.makedirs(output_dir, exist_ok=True)
# 对话历史
history = []
stop_flag = Falseprint(f"[任务] {instruction}")
print("=" * 60)
for step_id in range(max_step):
if stop_flag:
breakprint(f"\n[步骤 {step_id + 1}]")
# 1. 截图
screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png')
computer_tools.get_screenshot(screen_shot)
# 2. 构造消息
messages = get_messages(screen_shot, instruction, history, model_name)
# 3. 调用模型
response = dashscope.MultiModalConversation.call(
model=model_name,
messages=messages,
vl_high_resolution_images=True,
stream=False
)
output_text = response.output.choices[0].message.content[0]['text']
print(f"[模型输出]\n{output_text}\n")
# 4. 解析操作
action_list = extract_tool_calls(output_text)
if not action_list:
print("未提取到有效操作")
break# 5. 执行操作for action_id, action in enumerate(action_list):
action_parameter = action['arguments']
action_type = action_parameter['action']
# 获取图像尺寸用于坐标映射
dummy_image = Image.open(screen_shot)
resized_height, resized_width = smart_resize(
dummy_image.height,
dummy_image.width,
factor=16,
min_pixels=3136,
max_pixels=1003520 * 200
)
# 映射坐标(从归一化坐标 1000x1000 映射到实际尺寸)for key in ['coordinate', 'coordinate1', 'coordinate2']:
if key in action_parameter:
action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)
# 执行对应操作if action_type in ['click', 'left_click']:
computer_tools.left_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 左键点击 ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'mouse_move':
computer_tools.mouse_move(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 移动鼠标到 ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'middle_click':
computer_tools.middle_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 中键点击")
elif action_type in ['right click', 'right_click']:
computer_tools.right_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 右键点击")
elif action_type in ['key', 'hotkey']:
computer_tools.press_key(action_parameter['keys'])
print(f"✓ 按键 {action_parameter['keys']}")
elif action_type == 'type':
text = action_parameter['text']
computer_tools.type(text)
print(f"✓ 输入文本: {text}")
elif action_type == 'drag':
computer_tools.left_click_drag(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 拖拽到 ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'scroll':
if 'coordinate' in action_parameter:
computer_tools.mouse_move(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
computer_tools.scroll(action_parameter.get("pixels", 1))
print(f"✓ 滚动 {action_parameter.get('pixels', 1)} 像素")
elif action_type in ['computer_double_click', 'double_click']:
computer_tools.double_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ 双击")
elif action_type == 'wait':
time.sleep(action_parameter.get('time', 2))
print(f"✓ 等待 {action_parameter.get('time', 2)} 秒")
elif action_type == 'answer':
print(f"✓ 任务完成: {action_parameter.get('text', '')}")
stop_flag = Truebreakelif action_type in ['stop', 'terminate', 'done']:
print(f"✓ 任务终止: {action_parameter.get('status', 'success')}")
stop_flag = Truebreakelse:
print(f"未知操作类型: {action_type}")
# 6. 保存历史
history.append({
'output': output_text,
'image': screen_shot
})
time.sleep(2) # 操作间隔print("\n" + "=" * 60)
print(f"[完成] 共执行 {len(history)} 步")
# 使用示例
if __name__ == '__main__':
run_gui_automation(
instruction='帮我打开chrome,在百度中搜索阿里巴巴',
max_step=30
)手机端 GUI 任务
手机端通过 ADB(Android Debug Bridge)工具实现自动化操作。
环境准备:
下载适合系统的 Android Debug Bridge,保存到指定路径
在手机上开启“USB调试”或“ADB调试”(通常需要先开启开发者选项)
通过数据线连接手机和电脑,选择“传输文件”模式
下载ADB键盘的安装包,并将安装包传输到手机上打开,选择无视风险安装
在系统设置中将默认输入法切换为
ADB Keyboard在电脑终端上测试连接:
/path/to/adb devices(设备列表不为空说明连接成功)电脑系统为
macOS/Linux时, 需要开启权限:sudo chmod +x /path/to/adb进入手机的某个App,然后执行命令:
/path/to/adb shell am start -a android.intent.action.MAIN -c android.intent.category.HOME,如果手机设备退回到桌面,则说明一切就绪
手机端GUI示例与电脑端大致相同,完整示例代码如下:
更多用法
使用说明
图像限制
gui-plus模型对输入图像有以下具体要求:
支持的图像格式:
图像格式
常见扩展名
MIME Type
BMP
.bmp
image/bmp
JPEG
.jpe, .jpeg, .jpg
image/jpeg
PNG
.png
image/png
TIFF
.tif, .tiff
image/tiff
WEBP
.webp
image/webp
HEIC
.heic
image/heic
图像大小:单个图像的大小不超过10 MB。如果传入 Base64编码的图像,需保证编码后的字符串小于10MB,详情请参见传入本地文件。如需压缩文件体积请参见图像或视频压缩方法。
尺寸与比例:图像的宽度和高度均需大于 10 像素,图像的宽高比(长边与短边的比值)不得超过 200。
像素总量:模型接受任意像素总量的图像输入,但会在内部将其缩放至特定处理上限,超过此上限的图像会损失细节。
图像输入方式
公网URL:提供一个公网可访问的图像地址,支持 HTTP 或 HTTPS 协议。可将本地图像上传至OSS或上传文件获取临时URL,获取公网 URL。
Base64编码传入:将图像转换为 Base64 编码字符串。
本地文件路径传入:直接传入本地图像的路径。
计费与限流
API参考
关于千问GUI-Plus模型的输入输出参数,请参见GUI-Plus API参考。
错误码
如果模型调用失败并返回报错信息,请参见错误信息进行解决。