GUI-Plus parses user intent from screenshots and natural language instructions. It converts the intent into standard graphical user interface (GUI) operations, such as clicking, typing, and scrolling. External systems can then use these operations for decision-making or execution. Compared to the Qwen-VL series models, GUI-Plus offers improved accuracy for GUI operations.
The GUI-Plus model is available only in the China (Beijing) region with the Chinese mainland deployment scope. If you need to use the model, please use the API key of the Beijing region.
Supported models
Model name | Mode | Context length | Max input | Max chain-of-thought length | Max response length | Input cost | Output cost | Free quota |
(Tokens) | (Per million tokens) | |||||||
gui-plus | Non-thinking mode | 256,000 | 254,976 Max 16,384 per image | - | 32,768 | CNY 1.5 | CNY 4.5 | 1 million tokens for input and output each Validity: 90 days after activating Model Studio |
gui-plus-2026-02-26 | Thinking mode | 262,144 | 258,048 Max 16,384 per image | 81,920 | 32,768 | |||
Non-thinking mode | 260,096 Max 16,384 per image | - | 32,768 | |||||
The gui-plus-2026-02-26 model is fully upgraded to support both thinking and non-thinking modes. Compared to the gui-plus model, gui-plus-2026-02-26 significantly improves performance on cross-platform and multi-app tasks. Use this model for the best results.
Quickstart
This section shows how to quickly call the GUI-Plus model to obtain instructions for GUI tasks. For more information about how to convert these instructions into actual GUI operations, see the How to use section. To quickly test the model, try the online demo.
Prerequisites
If you use an SDK, install the latest version.
Recommended system prompt
A System Prompt defines the model's role, capabilities, and output format. Use the following system prompt for the gui-plus-2026-02-26 model to ensure correct output.
gui-plusandgui-plus-2026-02-26are not interchangeable. For thegui-plussystem prompt, see Recommended prompts for the GUI-Plus model.
Computer System Prompt
"""# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""Mobile System Prompt
'''# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.
* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.
* The screen's resolution is 1000x1000.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:
* `key`: Perform a key event on the mobile device.
- This supports adb's `keyevent` syntax.
- Examples: "volume_up", "volume_down", "power", "camera", "clear".
* `click`: Click the point on the screen with coordinate (x, y).
* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.
* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).
* `type`: Input the specified text into the activated input box.
* `system_button`: Press the system button.
* `open`: Open an app on the device.
* `wait`: Wait specified seconds for the change to happen.
* `answer`: Terminate the current task and output the answer.
* `interact`: Resolve the blocking window by interacting with the user.
* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.'''OpenAI compatible
Python
import os
from openai import OpenAI
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"}},
{"type": "text", "text": "Open the browser for me"}
]
}
]
client = OpenAI(
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
api_key=os.getenv("DASHSCOPE_API_KEY"),
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
model="gui-plus-2026-02-26",
messages=messages,
extra_body={"vl_high_resolution_images": True}
)
print(completion.choices[0].message.content)Node.js
import OpenAI from "openai";
const systemPrompt = `# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* \`key\`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* \`type\`: Type a string of text on the keyboard.\\n* \`mouse_move\`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`left_click\`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`left_click_drag\`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`right_click\`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`middle_click\`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`double_click\`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`triple_click\`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* \`scroll\`: Performs a scroll of the mouse scroll wheel.\\n* \`hscroll\`: Performs a horizontal scroll (mapped to regular scroll).\\n* \`wait\`: Wait specified seconds for the change to happen.\\n* \`terminate\`: Terminate the current task and report its completion status.\\n* \`answer\`: Answer a question.\\n* \`interact\`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by \`action=key\`.", "type": "array"}, "text": {"description": "Required only by \`action=type\`, \`action=answer\` and \`action=interact\`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by \`action=mouse_move\` and \`action=left_click_drag\`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by \`action=scroll\` and \`action=hscroll\`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by \`action=wait\`.", "type": "number"}, "status": {"description": "The status of the task. Required only by \`action=terminate\`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.`;
const client = new OpenAI({
apiKey: process.env.DASHSCOPE_API_KEY,
baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});
const messages = [
{
role: "system",
content: systemPrompt,
},
{
role: "user",
content: [
{
type: "image_url",
image_url: {
url: "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png",
},
},
{ type: "text", text: "Open the browser for me" },
],
},
];
const completion = await client.chat.completions.create({
model: "gui-plus-2026-02-26",
messages: messages,
extra_body: { vl_high_resolution_images: true },
});
console.log(completion.choices[0].message.content);curl
curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gui-plus-2026-02-26",
"messages": [
{
"role": "system",
"content": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
}
},
{
"type": "text",
"text": "Open the browser for me"
}
]
}
],
"vl_high_resolution_images": true
}'DashScope
Python
import os
import dashscope
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
{"text": "Open the browser for me."}]
}]
response = dashscope.MultiModalConversation.call(
# If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='gui-plus-2026-02-26',
messages=messages,
vl_high_resolution_images=True
)
print(response.output.choices[0].message.content[0]["text"])
Java
import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;
public class Main {
public static void simpleMultiModalConversationCall()
throws ApiException, NoApiKeyException, UploadFileException {
String systemPrompt = "# Tools\n\n" +
"You may call one or more functions to assist with the user query.\n\n" +
"You are provided with function signatures within <tools></tools> XML tags:\n" +
"<tools>\n" +
"{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n" +
"</tools>\n\n" +
"For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n" +
"<tool_call>\n" +
"{\"name\": <function-name>, \"arguments\": <args-json-object>}\n" +
"</tool_call>\n\n" +
"# Response format\n\n" +
"Response format for every step:\n" +
"1) Action: a short imperative describing what to do in the UI.\n" +
"2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\n" +
"Rules:\n" +
"- Output exactly in the order: Action, <tool_call>.\n" +
"- Be brief: one for Action.\n" +
"- Do not output anything else outside those two parts.\n" +
"- If finishing, use action=terminate in the tool call.";
MultiModalConversation conv = new MultiModalConversation();
MultiModalMessage systemMsg = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
.content(Arrays.asList(
Collections.singletonMap("text",systemPrompt))).build();
MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
.content(Arrays.asList(
Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"),
Collections.singletonMap("text", "Open the browser for me."))).build();
MultiModalConversationParam param = MultiModalConversationParam.builder()
// If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
.apiKey(System.getenv("DASHSCOPE_API_KEY"))
.model("gui-plus-2026-02-26")
.messages(Arrays.asList(systemMsg,userMessage))
.vlHighResolutionImages(true)
.build();
MultiModalConversationResult result = conv.call(param);
System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
}
public static void main(String[] args) {
try {
simpleMultiModalConversationCall();
} catch (ApiException | NoApiKeyException | UploadFileException e) {
System.out.println(e.getMessage());
}
System.exit(0);
}
}curl
curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
-H "Authorization: Bearer $DASHSCOPE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gui-plus-2026-02-26",
"input": {
"messages": [
{
"role": "system",
"content": [
{
"text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
}
]
},
{
"role": "user",
"content": [
{
"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
},
{
"text": "Open the browser for me"
}
]
}
]
},
"parameters": {
"vl_high_resolution_images": true
}
}'How to use
Computer GUI tasks
This example is for the Windows operating system. If you are using macOS or Linux, you need to modify the system commands in the ComputerTools class. For example, to show the desktop, Windows uses Win+D, while macOS uses Command+F3.
Step 1. Construct the system prompt
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""The system prompt above instructs the model to:
Assume a screen resolution of 1000×1000 (normalized coordinate system).
Follow a strict output format: first output the action description (
Action), then the<tool_call>block.Support operation types such as click, drag, type, scroll, and key press.
Step 2. Construct multi-turn conversation messages
In GUI automation tasks, the model uses the context of historical operations to make decisions. To help the model understand the current task progress and generate the next logical operation, construct multi-turn conversation messages with the following strategy:
To prevent performance degradation from an excessively long context, keep only the complete conversation (screenshot and model output) for the last N turns (default is 4).
For older historical operations, keep only the text summary (the
Actionpart of the model output) without the screenshot to reducetokenconsumption.
def get_messages(image, instruction, history_output, model_name, system_prompt):
"""
Construct multi-turn conversation messages
Parameters:
image: Path to the current screenshot
instruction: User instruction
history_output: Historical conversation records [{"output": "...", "image": "..."}]
model_name: Model name
"""
history_n = 4 # Keep the last 4 turns of history
current_step = len(history_output)
# Construct a summary of historical operations
history_start_idx = max(0, current_step - history_n)
previous_actions = []
for i in range(history_start_idx):
if i < len(history_output):
history_output_str = history_output[i]['output']
if 'Action:' in history_output_str and '<tool_call>':
history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
previous_actions.append(f"Step {i + 1}: {history_output_str}")
previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"
instruction_prompt = f"""
Please generate the next move according to the UI screenshot, instruction and previous actions.
Instruction: {instruction}
Previous actions:
{previous_actions_str}"""
# Construct the messages array
messages = [
{
"role": "system",
"content": [{"text": system_prompt}],
}
]
history_len = min(history_n, len(history_output))
if history_len > 0:
# Add historical conversation
for history_id, history_item in enumerate(history_output[-history_n:], 0):
if history_id == 0:
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + history_item['image']}
]
})
else:
messages.append({
"role": "user",
"content": [{"image": "file://" + history_item['image']}]
})
messages.append({
"role": "assistant",
"content": [{"text": history_item['output']}],
})
# Add the current screenshot
messages.append({
"role": "user",
"content": [{"image": "file://" + image}]
})
else:
# First turn of the conversation
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + image}
]
})
return messagesThe following is an example of a message array for a 7-turn conversation with the GUI model:
model_input
[{
"role": "system",
"content": [{
"text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name_for_human\": \"mobile_use\", \"name\": \"mobile_use\", \"description\": \"Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n - This supports adb's `keyevent` syntax.\n - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `answer`: Terminate the current task and output the answer.\n* `interact`: Resolve the blocking window by interacting with the user.\n* `terminate`: Terminate the current task and report its completion status.\", \"enum\": [\"key\", \"click\", \"long_press\", \"swipe\", \"type\", \"system_button\", \"open\", \"wait\", \"answer\", \"interact\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.\", \"type\": \"string\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=long_press` and `action=wait`.\", \"type\": \"number\"}, \"button\": {\"description\": \"Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, "arguments": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, "arguments": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
}]
}, {
"role": "user",
"content": [{
"text": "\nPlease generate the next move according to the UI screenshot, instruction and previous actions.\n\nInstruction: Help me search for the price of Jinan Sheraton Hotel today on Ctrip\n\nPrevious actions:\nStep 1: Click the Ctrip Travel app icon to launch the Ctrip travel booking application.\nStep 2: Wait for the promotional splash screen to automatically transition to the main Ctrip app interface."
}, {
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_a84122ac_853a630315784b64988492c9c07b5534.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: Click the close button (X icon) in the upper right corner of the app update notification pop-up to close it.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [789, 280]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_6010a769_089b9b35b1904913bd5df492563b02b9.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: Click the \"Hotels in Jinan\" text area in the search bar to activate the search input box and prepare to modify the search term.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [112, 134]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_47446db4_fd4a5022002c4db99f110d5c7261fea2.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: Click the location field showing \"Xiamen\" to change the search location from Xiamen to Jinan.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [156, 347]}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_3832132c_8c55861c1716467e802a3554402f3580.png"
}]
}, {
"role": "assistant",
"content": [{
"text": "Action: Type \"Jinan\" in the search input box to specify the city location for the hotel search.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"type\", \"text\": \"Jinan\"}}\n</tool_call>"
}]
}, {
"role": "user",
"content": [{
"image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_ff247bac_39c3e20be32c4baf8677a2b6b61bc021.png"
}]
}] Step 3. Parse the model output
The model internally scales images during processing. The coordinates it returns are normalized based on the scaled image. To accurately perform GUI operations on the original image, you must map the coordinates.
Extract the Tool Call field
First, extract the Tool Call from the string returned by the model:
import re import json def extract_tool_calls(text): """ Extract all <tool_call> blocks from the model output Parameters: text: The text returned by the model Returns: actions: A list of parsed operations """ pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE) blocks = pattern.findall(text) actions = [] for blk in blocks: blk = blk.strip() try: actions.append(json.loads(blk)) except json.JSONDecodeError as e: print(f'Parsing failed: {e} | Snippet: {blk[:80]}...') return actionsCoordinate mapping function
The following function calculates the scaled dimensions:
import math from PIL import Image def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192): """ Calculate the image dimensions after internal scaling by the model Parameters: height: Original image height width: Original image width factor: Resolution factor (default is 32) min_pixels: Minimum pixel value max_pixels: Maximum pixel value max_long_side: Maximum long side limit Returns: (h_bar, w_bar): Scaled height and width """ def round_by_factor(number, factor): return round(number / factor) * factor def ceil_by_factor(number, factor): return math.ceil(number / factor) * factor def floor_by_factor(number, factor): return math.floor(number / factor) * factor if height < 2 or width < 2: raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}") elif max(height, width) / min(height, width) > 200: raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}") # Limit the longest side if max(height, width) > max_long_side: beta = max(height, width) / max_long_side height, width = int(height / beta), int(width / beta) # Calculate the scaled dimensions h_bar = round_by_factor(height, factor) w_bar = round_by_factor(width, factor) if h_bar * w_bar > max_pixels: beta = math.sqrt((height * width) / max_pixels) h_bar = floor_by_factor(height / beta, factor) w_bar = floor_by_factor(width / beta, factor) elif h_bar * w_bar < min_pixels: beta = math.sqrt(min_pixels / (height * width)) h_bar = ceil_by_factor(height * beta, factor) w_bar = ceil_by_factor(width * beta, factor) return h_bar, w_bar
Step 4. Execute the GUI operation
After parsing the action instruction, this section shows how to use the pyautogui library to simulate physical GUI operations such as mouse clicks, keyboard input, and scrolling.
import pyautogui
import pyperclip
import time
from PIL import Image
import os
class ComputerTools:
"""Computer GUI operation utility class"""
def __init__(self):
self.image_info = None
def load_image_info(self, path):
"""Load image dimension information"""
width, height = Image.open(path).size
self.image_info = (width, height)
def get_screenshot(self, image_path, retry_times=3):
"""Get a desktop screenshot"""
if os.path.exists(image_path):
os.remove(image_path)
for i in range(retry_times):
screenshot = pyautogui.screenshot()
screenshot.save(image_path)
if os.path.exists(image_path):
self.load_image_info(image_path)
return True
else:
time.sleep(0.1)
return False
def reset(self):
"""Show desktop"""
pyautogui.hotkey('win', 'd')
def press_key(self, keys):
"""Key press operation"""
if isinstance(keys, list):
cleaned_keys = []
for key in keys:
if isinstance(key, str):
# Process key name format
if key.startswith("keys=["):
key = key[6:]
if key.endswith("]"):
key = key[:-1]
if key.startswith("['") or key.startswith('["'):
key = key[2:] if len(key) > 2 else key
if key.endswith("']") or key.endswith('"]'):
key = key[:-2] if len(key) > 2 else key
key = key.strip()
# Convert key name
key_map = {
"arrowleft": "left",
"arrowright": "right",
"arrowup": "up",
"arrowdown": "down"
}
key = key_map.get(key, key)
cleaned_keys.append(key)
else:
cleaned_keys.append(key)
keys = cleaned_keys
else:
keys = [keys]
if len(keys) > 1:
pyautogui.hotkey(*keys)
else:
pyautogui.press(keys[0])
def type(self, text):
"""Type text (uses clipboard to support Chinese characters)"""
pyperclip.copy(text)
pyautogui.keyDown('ctrl')
pyautogui.keyDown('v')
pyautogui.keyUp('v')
pyautogui.keyUp('ctrl')
def mouse_move(self, x, y):
"""Move the mouse to the specified coordinates"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.moveTo(x, y)
def left_click(self, x, y):
"""Left-click"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.click()
def left_click_drag(self, x, y):
"""Drag from the current position to the specified coordinates"""
pyautogui.dragTo(x, y, duration=0.5)
pyautogui.moveTo(x, y)
def right_click(self, x, y):
"""Right-click"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.rightClick()
def middle_click(self, x, y):
"""Middle-click"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.middleClick()
def double_click(self, x, y):
"""Double-click"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.doubleClick()
def triple_click(self, x, y):
"""Triple-click"""
pyautogui.moveTo(x, y)
time.sleep(0.1)
pyautogui.tripleClick()
def scroll(self, pixels):
"""Scroll wheel"""
pyautogui.scroll(pixels)Step 5. Complete automation flow
Integrate the preceding steps into a complete automation flow. This flow loops through the sequence of taking a screenshot, performing model inference, and executing the GUI operation until the task is complete.
import os
import uuid
import dashscope
import time
def run_gui_automation(instruction, max_step=30):
"""
Run the complete GUI automation flow
Parameters:
instruction: User instruction
max_step: Maximum number of execution steps
"""
# Configure the API
dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
model_name = 'gui-plus-2026-02-26'
# Initialize tools
computer_tools = ComputerTools()
computer_tools.reset() # Show desktop
# Create an output directory
output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation")
os.makedirs(output_dir, exist_ok=True)
# Conversation history
history = []
stop_flag = False
session_id = str(uuid.uuid4())
print('session_id ', session_id)
print(f"[Task] {instruction}")
print("=" * 60)
for step_id in range(max_step):
if stop_flag:
break
print(f"\n[Step {step_id + 1}]")
# 1. Screenshot
screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png')
computer_tools.get_screenshot(screen_shot)
# 2. Construct messages
messages = get_messages(screen_shot, instruction, history, model_name)
# 3. Call the model
retry_time = 3
for _ in range(retry_time):
response = dashscope.MultiModalConversation.call(
model=model_name,
messages=messages,
vl_high_resolution_images=True,
headers={"x-dashscope-gui-session-id": session_id},
stream=False
)
print(response['request_id'])
try:
output_text = response.output.choices[0].message.content[0]['text']
break
except Exception as e:
print(response)
print(e)
else:
raise Exception('retry_time out')
print(f"[Model output]\n{output_text}\n")
# 4. Parse operations
action_list = extract_tool_calls(output_text)
if not action_list:
print("No valid operation extracted")
break
# 5. Execute operations
for action_id, action in enumerate(action_list):
action_parameter = action['arguments']
action_type = action_parameter['action']
# Get image dimensions for coordinate mapping
dummy_image = Image.open(screen_shot)
resized_height, resized_width = smart_resize(
dummy_image.height,
dummy_image.width,
factor=16,
min_pixels=3136,
max_pixels=1003520 * 200
)
# Map coordinates (from 1000x1000 normalized coordinates to actual dimensions)
for key in ['coordinate', 'coordinate1', 'coordinate2']:
if key in action_parameter:
action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)
# Execute the corresponding operation
if action_type in ['click', 'left_click']:
computer_tools.left_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Left-click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'mouse_move':
computer_tools.mouse_move(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Move mouse to ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'middle_click':
computer_tools.middle_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Middle-click")
elif action_type in ['right click', 'right_click']:
computer_tools.right_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Right-click")
elif action_type in ['key', 'hotkey']:
computer_tools.press_key(action_parameter['keys'])
print(f"✓ Key press {action_parameter['keys']}")
elif action_type == 'type':
text = action_parameter['text']
computer_tools.type(text)
print(f"✓ Type text: {text}")
elif action_type == 'drag':
computer_tools.left_click_drag(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Drag to ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
elif action_type == 'scroll':
if 'coordinate' in action_parameter:
computer_tools.mouse_move(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
computer_tools.scroll(action_parameter.get("pixels", 1))
print(f"✓ Scroll {action_parameter.get('pixels', 1)} pixels")
elif action_type in ['computer_double_click', 'double_click']:
computer_tools.double_click(
action_parameter['coordinate'][0],
action_parameter['coordinate'][1]
)
print(f"✓ Double-click")
elif action_type == 'wait':
time.sleep(action_parameter.get('time', 2))
print(f"✓ Wait {action_parameter.get('time', 2)} seconds")
elif action_type == 'answer':
print(f"✓ Task complete: {action_parameter.get('text', '')}")
stop_flag = True
break
elif action_type in ['stop', 'terminate', 'done']:
print(f"✓ Task terminated: {action_parameter.get('status', 'success')}")
stop_flag = True
break
else:
print(f"Unknown action type: {action_type}")
# 6. Save history
history.append({
'output': output_text,
'image': screen_shot
})
time.sleep(2) # Operation interval
print("\n" + "=" * 60)
print(f"[Complete] Total executed {len(history)} steps")
# Example usage
if __name__ == '__main__':
run_gui_automation(
instruction='Open Chrome for me and search for Alibaba on Baidu',
max_step=30
)Mobile GUI tasks
Mobile automation is achieved using the Android Debug Bridge (ADB) tool.
Environment setup:
Download the Android Debug Bridge for your system and save it to a specified path.
Enable "USB debugging" or "ADB debugging" on your phone. You usually need to enable Developer options first.
Connect your phone to your computer with a data cable and select "File Transfer" mode.
Download the ADB Keyboard APK, transfer it to your phone, open it, and choose to install it despite the risk warning.
In the system settings, switch the default input method to
ADB Keyboard.Test the connection in your computer's terminal:
/path/to/adb devices. If the device list is not empty, the connection is successful.If you are using
macOS or Linux, you need to grant execute permission:sudo chmod +x /path/to/adbOpen an app on your phone, then run the command:
/path/to/adb shell am start -a android.intent.action.MAIN -c android.intent.category.HOME. If your phone returns to the home screen, the setup is complete.
The mobile GUI example is similar to the computer example. The complete code is as follows:
Browser GUI tasks
For browsers, automation is achieved by controlling the browser with Playwright. It uses Set-of-Mark (SoM) technology to automatically add numeric labels to page elements. The model then uses these labels to precisely control web elements.
Environment setup:
Install dependencies:
pip install playwright pillow dashscope playwright-stealth termcolorInstall the Playwright browser:
playwright install chromium
The complete code example includes constructing multi-turn conversation messages, parsing tool calls from the model output, and executing browser GUI operations. The complete code is as follows:
Tool calling
Step 1. Construct the system prompt
Computer GUI tasks
# Define a list of custom tools (optional, pass as needed)
tool_list = [
{
"name": "save_to_file",
"description": "Save text content to a file on the local filesystem.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "The file path to save to"
},
"content": {
"type": "string",
"description": "The text content to save"
}
},
"required": ["file_path", "content"]
}
}
]
# Wrap custom tools in the standard structure
tools_def = {"type": "function", "function": tool_list}
# Define the GUI tool
gui_tool = {
"type": "function",
"function": {
"name": "computer_use",
"description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1000x1000.\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
"parameters": {
"properties": {
"action": {
"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.\n* `answer`: Answer a question.",
"enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer"],
"type": "string"
},
"keys": {"description": "Required only by `action=key`.", "type": "array"},
"text": {"description": "Required only by `action=type` and `action=answer`.", "type": "string"},
"coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"},
"pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"},
"time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"},
"status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}
},
"required": ["action"],
"type": "object"
}
}
}
# Combine the GUI tool and custom tools to build the System Prompt
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one sentence for Action.
- Do not output anything else outside those two parts.
- If finishing, use 'terminate' function in the tool call."""Mobile GUI tasks
# Define a list of custom tools (optional, pass as needed)
tool_list = [
{
"name": "save_to_file",
"description": "Save text content to a file on the local filesystem.",
"parameters": {
"type": "object",
"properties": {
"file_path": {
"type": "string",
"description": "The file path to save to"
},
"content": {
"type": "string",
"description": "The text content to save"
}
},
"required": ["file_path", "content"]
}
}
]
# Wrap custom tools in the standard structure
tools_def = {"type": "function", "function": tool_list}
# Define the GUI tool
gui_tool = {
"type": "function",
"function": {
"name_for_human": "mobile_use",
"name": "mobile_use",
"description": "Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
"parameters": {
"properties": {
"action": {
"description": 'The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n - This supports adb\'s `keyevent` syntax.\n - Examples: "volume_up", "volume_down", "power", "camera", "clear".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.',
"enum": [
"key",
"click",
"long_press",
"swipe",
"type",
"system_button",
"open",
"wait",
"terminate",
],
"type": "string",
},
"coordinate": {
"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.",
"type": "array",
},
"coordinate2": {
"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.",
"type": "array",
},
"text": {
"description": "Required only by `action=key`, `action=type`, and `action=open`.",
"type": "string",
},
"time": {
"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.",
"type": "number",
},
"button": {
"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`",
"enum": ["Back", "Home", "Menu", "Enter"],
"type": "string",
},
"status": {
"description": "The status of the task. Required only by `action=terminate`.",
"type": "string",
"enum": ["success", "failure"],
},
},
"required": ["action"],
"type": "object",
},
"args_format": "Format the arguments as a JSON object.",
},
}
# Combine the GUI tool and custom tools to build the System Prompt
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one sentence for Action.
- Do not output anything else outside those two parts.
- If finishing, use 'terminate' function in the tool call."""Step 2. Construct multi-turn conversation messages
In tool calling scenarios, you must correctly construct multi-turn conversation messages. This includes managing history and passing tool call results. The last user message must wrap the tool execution result with the <tool_response> tag.
Computer GUI tasks
def get_messages(image, instruction, history_output, system_prompt):
"""
Construct multi-turn conversation messages
Parameters:
image: Path to the current screenshot
instruction: User instruction
history_output: Historical conversation records [{"output": "...", "tool_response": "...", "image": "..."}]
system_prompt: System prompt
"""
history_n = 4 # Keep the last 4 turns of history
current_step = len(history_output)
history_start_idx = max(0, current_step - history_n)
# Construct a summary of historical operations
previous_actions = []
for i in range(history_start_idx):
if i < len(history_output):
history_output_str = history_output[i]['output']
if 'Action:' in history_output_str and '<tool_call>':
history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
previous_actions.append(f"Step {i + 1}: {history_output_str}")
previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"
instruction_prompt = f"""Please generate the next move according to the UI screenshot, instruction and previous actions.
Instruction: {instruction}
Previous actions:
{previous_actions_str}"""
# Construct the messages array
messages = [{"role": "system", "content": [{"text": system_prompt}]}]
history_len = min(history_n, len(history_output))
if history_len > 0:
# Add historical conversation
for history_id, history_item in enumerate(history_output[-history_n:], 0):
if history_id == 0:
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + history_item['image']}
]
})
else:
messages.append({
"role": "user",
"content": [{"image": "file://" + history_item['image']}]
})
messages.append({
"role": "assistant",
"content": [{"text": history_item['output']}]
})
# Add the current screenshot, including the tool execution result from the previous turn
messages.append({
"role": "user",
"content": [
{"text": "<tool_response>\n"},
{"text": history_output[-1]['tool_response']},
{"image": "file://" + image},
{"text": "\n</tool_response>"}
]
})
else:
# First turn of the conversation
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + image}
]
})
return messagesMobile GUI tasks
def get_messages(image, instruction, history_output, system_prompt):
"""
Construct multi-turn conversation messages (mobile)
Parameters:
image: Path to the current screenshot
instruction: User instruction
history_output: Historical conversation records [{"output": "...", "tool_response": "...", "image": "..."}]
system_prompt: System prompt
"""
from datetime import datetime
history_n = 4 # Keep the last 4 turns of history
current_step = len(history_output)
history_start_idx = max(0, current_step - history_n)
# Construct a summary of historical operations
previous_actions = []
for i in range(history_start_idx):
if i < len(history_output):
history_output_str = history_output[i]['output']
if 'Action:' in history_output_str and '<tool_call>':
history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
previous_actions.append(f"Step {i + 1}: {history_output_str}")
previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"
# Add background information (date)
today = datetime.today()
weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
weekday = weekday_names[today.weekday()]
formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
ground_info = f'''Today's date is: {formatted_date}.'''
instruction_prompt = f"""Please generate the next move according to the UI screenshot, instruction and previous actions.
Instruction: {ground_info}{instruction}
Previous actions:
{previous_actions_str}"""
# Construct the messages array
messages = [{"role": "system", "content": [{"text": system_prompt}]}]
history_len = min(history_n, len(history_output))
if history_len > 0:
for history_id, history_item in enumerate(history_output[-history_n:], 0):
if history_id == 0:
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + history_item['image']}
]
})
else:
messages.append({
"role": "user",
"content": [{"image": "file://" + history_item['image']}]
})
messages.append({
"role": "assistant",
"content": [{"text": history_item['output']}]
})
messages.append({
"role": "user",
"content": [
{"text": "<tool_response>\n"},
{"text": history_output[-1]['tool_response']},
{"image": "file://" + image},
{"text": "\n</tool_response>"}
]
})
else:
messages.append({
"role": "user",
"content": [
{"text": instruction_prompt},
{"image": "file://" + image}
]
})
return messagesStep 3. Parse the model output
Extract the <tool_call> block from the model output, parse it as JSON, and then map the coordinates based on the image scaling ratio. For more information, see Parse model output and map coordinates.
Step 4. Execute the GUI operation
Implement a GUI operation utility class to perform actual interface operations such as clicking, typing, and scrolling. For more information, see Computer GUI operation utility class and Mobile GUI operation utility class.
Step 5. Complete automation flow
Integrate the preceding steps into a complete automation flow. This flow loops through the sequence of taking a screenshot, performing model inference, and executing the GUI operation until the task is complete. For more information, see the complete code example below.
More usage
Usage notes
Image restrictions
The gui-plus model has the following specific requirements for input images:
Supported image formats:
Image format
Common extensions
Multipurpose Internet Mail Extensions (MIME) Type
BMP
.bmp
image/bmp
JPEG
.jpe, .jpeg, .jpg
image/jpeg
PNG
.png
image/png
TIFF
.tif, .tiff
image/tiff
WEBP
.webp
image/webp
HEIC
.heic
image/heic
Image size: The size of a single image cannot exceed 10 MB. If you provide a Base64-encoded image, ensure that the encoded string is less than 10 MB. For more information, see Pass local files. To compress the file size, see Image or video compression methods.
Dimensions and aspect ratio: The width and height of the image must both be greater than 10 pixels. The aspect ratio (the ratio of the long side to the short side) of the image must not exceed 200.
Total pixels: The model accepts images with any total number of pixels but will internally scale them down to a specific processing limit. Images exceeding this limit will lose detail.
Image input methods
Public URL: Provide a publicly accessible image address that supports the HTTP or HTTPS protocol. You can upload a local image to OSS or upload a file to get a temporary URL to obtain a public URL.
Base64 encoding: Convert the image to a Base64-encoded string.
Local file path: Provide the path to the local image directly.
Billing and rate limiting
Rate limiting: For the rate limiting conditions of the Qwen GUI-Plus model, see Rate limiting.
Free quota: The model provides a free quota of 1 million tokens, valid for 90 days from the date you activate Model Studio or your model application is approved.
Billing: Total cost = Input tokens × Model input price + Model output tokens × Model output price. For input/output prices, see the Model list.
View bills: You can view your bills or top up your account on the Expenses and Costs page in the Alibaba Cloud Management Console.
API reference
For the input and output parameters of the Qwen GUI-Plus model, see GUI-Plus API reference.
Error codes
If the model call fails and returns an error message, see Error codes for resolution.