GUI-Plus: A model for user interface interaction

更新时间:
复制 MD 格式

GUI-Plus parses user intent from screenshots and natural language instructions. It converts the intent into standard graphical user interface (GUI) operations, such as clicking, typing, and scrolling. External systems can then use these operations for decision-making or execution. Compared to the Qwen-VL series models, GUI-Plus offers improved accuracy for GUI operations.

Important

The GUI-Plus model is available only in the China (Beijing) region with the Chinese mainland deployment scope. If you need to use the model, please use the API key of the Beijing region.

Supported models

Model name

Mode

Context length

Max input

Max chain-of-thought length

Max response length

Input cost

Output cost

Free quota

View remaining quota

(Tokens)

(Per million tokens)

gui-plus

Non-thinking mode

256,000

254,976

Max 16,384 per image

-

32,768

CNY 1.5

CNY 4.5

1 million tokens for input and output each

Validity: 90 days after activating Model Studio

gui-plus-2026-02-26

Thinking mode

262,144

258,048

Max 16,384 per image

81,920

32,768

Non-thinking mode

260,096

Max 16,384 per image

-

32,768

Note

The gui-plus-2026-02-26 model is fully upgraded to support both thinking and non-thinking modes. Compared to the gui-plus model, gui-plus-2026-02-26 significantly improves performance on cross-platform and multi-app tasks. Use this model for the best results.

Quickstart

This section shows how to quickly call the GUI-Plus model to obtain instructions for GUI tasks. For more information about how to convert these instructions into actual GUI operations, see the How to use section. To quickly test the model, try the online demo.

Prerequisites

Recommended system prompt

A System Prompt defines the model's role, capabilities, and output format. Use the following system prompt for the gui-plus-2026-02-26 model to ensure correct output.

gui-plus and gui-plus-2026-02-26 are not interchangeable. For the gui-plus system prompt, see Recommended prompts for the GUI-Plus model.

Computer System Prompt

"""# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

Mobile System Prompt

'''# Tools
You may call one or more functions to assist with the user query.
        
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.
* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.
* The screen's resolution is 1000x1000.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:
* `key`: Perform a key event on the mobile device.
    - This supports adb's `keyevent` syntax.
    - Examples: "volume_up", "volume_down", "power", "camera", "clear".
* `click`: Click the point on the screen with coordinate (x, y).
* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.
* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).
* `type`: Input the specified text into the activated input box.
* `system_button`: Press the system button.
* `open`: Open an app on the device.
* `wait`: Wait specified seconds for the change to happen.
* `answer`: Terminate the current task and output the answer.
* `interact`: Resolve the blocking window by interacting with the user.
* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.'''

OpenAI compatible

Python

import os
from openai import OpenAI

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

messages = [
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"}},
            {"type": "text", "text": "Open the browser for me"}
        ]
    }
]

client = OpenAI(
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key="sk-xxx",
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="gui-plus-2026-02-26",
    messages=messages,
    extra_body={"vl_high_resolution_images": True}
)

print(completion.choices[0].message.content)

Response

<tool_call>
{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [2530, 314]}}
</tool_call>

Node.js

import OpenAI from "openai";

const systemPrompt = `# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* \`key\`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* \`type\`: Type a string of text on the keyboard.\\n* \`mouse_move\`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`left_click\`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`left_click_drag\`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* \`right_click\`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`middle_click\`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`double_click\`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* \`triple_click\`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* \`scroll\`: Performs a scroll of the mouse scroll wheel.\\n* \`hscroll\`: Performs a horizontal scroll (mapped to regular scroll).\\n* \`wait\`: Wait specified seconds for the change to happen.\\n* \`terminate\`: Terminate the current task and report its completion status.\\n* \`answer\`: Answer a question.\\n* \`interact\`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by \`action=key\`.", "type": "array"}, "text": {"description": "Required only by \`action=type\`, \`action=answer\` and \`action=interact\`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by \`action=mouse_move\` and \`action=left_click_drag\`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by \`action=scroll\` and \`action=hscroll\`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by \`action=wait\`.", "type": "number"}, "status": {"description": "The status of the task. Required only by \`action=terminate\`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.`;

const client = new OpenAI({
  apiKey: process.env.DASHSCOPE_API_KEY,
  baseURL: "https://dashscope.aliyuncs.com/compatible-mode/v1",
});

const messages = [
  {
    role: "system",
    content: systemPrompt,
  },
  {
    role: "user",
    content: [
      {
        type: "image_url",
        image_url: {
          url: "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png",
        },
      },
      { type: "text", text: "Open the browser for me" },
    ],
  },
];

const completion = await client.chat.completions.create({
  model: "gui-plus-2026-02-26",
  messages: messages,
  extra_body: { vl_high_resolution_images: true },
});

console.log(completion.choices[0].message.content);

Response

<tool_call>
{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [2530, 314]}}
</tool_call>

curl

curl -X POST https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gui-plus-2026-02-26",
    "messages": [
      {
        "role": "system",
        "content": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
            }
          },
          {
            "type": "text",
            "text": "Open the browser for me"
          }
        ]
      }
    ],
    "vl_high_resolution_images": true
  }'

Response

{
  "choices": [
    {
      "message": {
        "content": "<tool_call>\n{\"name\": \"computer_use\", \"arguments\": {\"action\": \"left_click\", \"coordinate\": [2530, 314]}}\n</tool_call>",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 7750,
    "completion_tokens": 36,
    "total_tokens": 7786,
    "prompt_tokens_details": {
      "image_tokens": 6743,
      "text_tokens": 1007
    },
    "completion_tokens_details": {
      "text_tokens": 36
    }
  },
  "created": 1773133741,
  "system_fingerprint": null,
  "model": "gui-plus",
  "id": "chatcmpl-8b375016-abb8-9791-856c-74b2825c22d5"
}

DashScope

Python

import os
import dashscope

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

messages = [
    {
        "role": "system",
        "content": system_prompt
    },
    {
        "role": "user",
        "content": [
            {"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
            {"text": "Open the browser for me."}]
    }]

response = dashscope.MultiModalConversation.call(
    # If the environment variable is not set, replace the following line with your Model Studio API key: api_key = "sk-xxx"
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    model='gui-plus-2026-02-26',
    messages=messages,
    vl_high_resolution_images=True
)

print(response.output.choices[0].message.content[0]["text"])

Response

<tool_call>
{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [2530, 314]}}
</tool_call>

Java

import java.util.Arrays;
import java.util.Collections;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam;
import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult;
import com.alibaba.dashscope.common.MultiModalMessage;
import com.alibaba.dashscope.common.Role;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

public class Main {
    public static void simpleMultiModalConversationCall()
            throws ApiException, NoApiKeyException, UploadFileException {
        String systemPrompt = "# Tools\n\n" +
                "You may call one or more functions to assist with the user query.\n\n" +
                "You are provided with function signatures within <tools></tools> XML tags:\n" +
                "<tools>\n" +
                "{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n" +
                "</tools>\n\n" +
                "For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n" +
                "<tool_call>\n" +
                "{\"name\": <function-name>, \"arguments\": <args-json-object>}\n" +
                "</tool_call>\n\n" +
                "# Response format\n\n" +
                "Response format for every step:\n" +
                "1) Action: a short imperative describing what to do in the UI.\n" +
                "2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\n" +
                "Rules:\n" +
                "- Output exactly in the order: Action, <tool_call>.\n" +
                "- Be brief: one for Action.\n" +
                "- Do not output anything else outside those two parts.\n" +
                "- If finishing, use action=terminate in the tool call.";    
        MultiModalConversation conv = new MultiModalConversation();
        MultiModalMessage systemMsg = MultiModalMessage.builder().role(Role.SYSTEM.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("text",systemPrompt))).build();
        MultiModalMessage userMessage = MultiModalMessage.builder().role(Role.USER.getValue())
                .content(Arrays.asList(
                        Collections.singletonMap("image", "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"),
                        Collections.singletonMap("text", "Open the browser for me."))).build();
        MultiModalConversationParam param = MultiModalConversationParam.builder()
                // If the environment variable is not set, replace the following line with your Model Studio API key: .apiKey("sk-xxx")
                .apiKey(System.getenv("DASHSCOPE_API_KEY"))
                .model("gui-plus-2026-02-26")
                .messages(Arrays.asList(systemMsg,userMessage))
                .vlHighResolutionImages(true)
                .build();
        MultiModalConversationResult result = conv.call(param);
        System.out.println(result.getOutput().getChoices().get(0).getMessage().getContent().get(0).get("text"));
    }
    public static void main(String[] args) {
        try {
            simpleMultiModalConversationCall();
        } catch (ApiException | NoApiKeyException | UploadFileException e) {
            System.out.println(e.getMessage());
        }
        System.exit(0);
    }
}

Response

<tool_call>
{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [2530, 314]}}
</tool_call>

curl

curl -X POST https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation \
  -H "Authorization: Bearer $DASHSCOPE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gui-plus-2026-02-26",
    "input": {
      "messages": [
        {
          "role": "system",
          "content": [
            {
              "text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"computer_use\", \"description\": \"Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn'\''t open, try wait and taking another screenshot.\\n* The screen'\''s resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don'\''t click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it'\''s the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.\", \"enum\": [\"key\", \"type\", \"mouse_move\", \"left_click\", \"left_click_drag\", \"right_click\", \"middle_click\", \"double_click\", \"triple_click\", \"scroll\", \"hscroll\", \"wait\", \"terminate\", \"answer\", \"interact\"], \"type\": \"string\"}, \"keys\": {\"description\": \"Required only by `action=key`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=type`, `action=answer` and `action=interact`.\", \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.\", \"type\": \"array\"}, \"pixels\": {\"description\": \"The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.\", \"type\": \"number\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=wait`.\", \"type\": \"number\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
            }
          ]
        },
        {
          "role": "user",
          "content": [
            {
              "image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"
            },
            {
              "text": "Open the browser for me"
            }
          ]
        }
      ]
    },
    "parameters": {
      "vl_high_resolution_images": true
    }
  }'

Response

{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "content": [
            {
              "text": "<tool_call>\n{\"name\": \"computer_use\", \"arguments\": {\"action\": \"left_click\", \"coordinate\": [2530, 314]}}\n</tool_call>"
            }
          ],
          "role": "assistant"
        }
      }
    ]
  },
  "usage": {
    "image_tokens": 6743,
    "input_tokens": 7750,
    "input_tokens_details": {
      "image_tokens": 6743,
      "text_tokens": 1007
    },
    "output_tokens": 36,
    "output_tokens_details": {
      "text_tokens": 36
    },
    "total_tokens": 7786
  },
  "request_id": "6821285d-e40f-4bca-903f-69f220e3c948"
}

How to use

Computer GUI tasks

Note

This example is for the Windows operating system. If you are using macOS or Linux, you need to modify the system commands in the ComputerTools class. For example, to show the desktop, Windows uses Win+D, while macOS uses Command+F3.

Step 1. Construct the system prompt

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

The system prompt above instructs the model to:

  • Assume a screen resolution of 1000×1000 (normalized coordinate system).

  • Follow a strict output format: first output the action description (Action), then the <tool_call> block.

  • Support operation types such as click, drag, type, scroll, and key press.

Step 2. Construct multi-turn conversation messages

In GUI automation tasks, the model uses the context of historical operations to make decisions. To help the model understand the current task progress and generate the next logical operation, construct multi-turn conversation messages with the following strategy:

  • To prevent performance degradation from an excessively long context, keep only the complete conversation (screenshot and model output) for the last N turns (default is 4).

  • For older historical operations, keep only the text summary (the Action part of the model output) without the screenshot to reduce token consumption.

def get_messages(image, instruction, history_output, model_name, system_prompt):
    """
    Construct multi-turn conversation messages

    Parameters:
        image: Path to the current screenshot
        instruction: User instruction
        history_output: Historical conversation records [{"output": "...", "image": "..."}]
        model_name: Model name
    """
    history_n = 4  # Keep the last 4 turns of history
    current_step = len(history_output)
    
    # Construct a summary of historical operations
    history_start_idx = max(0, current_step - history_n)
    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    instruction_prompt = f"""
      Please generate the next move according to the UI screenshot, instruction and previous actions.
      
      Instruction: {instruction}
      
      Previous actions:
      {previous_actions_str}"""

    # Construct the messages array
    messages = [
        {
            "role": "system",
            "content": [{"text": system_prompt}],
        }
    ]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        # Add historical conversation
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })

            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}],
            })

        # Add the current screenshot
        messages.append({
            "role": "user",
            "content": [{"image": "file://" + image}]
        })
    else:
        # First turn of the conversation
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })

    return messages

The following is an example of a message array for a 7-turn conversation with the GUI model:

model_input
  [{
    "role": "system",
    "content": [{
      "text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name_for_human\": \"mobile_use\", \"name\": \"mobile_use\", \"description\": \"Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n    - This supports adb's `keyevent` syntax.\n    - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `answer`: Terminate the current task and output the answer.\n* `interact`: Resolve the blocking window by interacting with the user.\n* `terminate`: Terminate the current task and report its completion status.\", \"enum\": [\"key\", \"click\", \"long_press\", \"swipe\", \"type\", \"system_button\", \"open\", \"wait\", \"answer\", \"interact\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.\", \"type\": \"string\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=long_press` and `action=wait`.\", \"type\": \"number\"}, \"button\": {\"description\": \"Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, "arguments": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, "arguments": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
    }]
  }, {
    "role": "user",
    "content": [{
      "text": "\nPlease generate the next move according to the UI screenshot, instruction and previous actions.\n\nInstruction: Help me search for the price of Jinan Sheraton Hotel today on Ctrip\n\nPrevious actions:\nStep 1: Click the Ctrip Travel app icon to launch the Ctrip travel booking application.\nStep 2: Wait for the promotional splash screen to automatically transition to the main Ctrip app interface."
    }, {
      "image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_a84122ac_853a630315784b64988492c9c07b5534.png"
    }]
  }, {
    "role": "assistant",
    "content": [{
      "text": "Action: Click the close button (X icon) in the upper right corner of the app update notification pop-up to close it.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [789, 280]}}\n</tool_call>"
    }]
  }, {
    "role": "user",
    "content": [{
      "image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_6010a769_089b9b35b1904913bd5df492563b02b9.png"
    }]
  }, {
    "role": "assistant",
    "content": [{
      "text": "Action: Click the \"Hotels in Jinan\" text area in the search bar to activate the search input box and prepare to modify the search term.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [112, 134]}}\n</tool_call>"
    }]
  }, {
    "role": "user",
    "content": [{
      "image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_47446db4_fd4a5022002c4db99f110d5c7261fea2.png"
    }]
  }, {
    "role": "assistant",
    "content": [{
      "text": "Action: Click the location field showing \"Xiamen\" to change the search location from Xiamen to Jinan.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [156, 347]}}\n</tool_call>"
    }]
  }, {
    "role": "user",
    "content": [{
      "image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_3832132c_8c55861c1716467e802a3554402f3580.png"
    }]
  }, {
    "role": "assistant",
    "content": [{
      "text": "Action: Type \"Jinan\" in the search input box to specify the city location for the hotel search.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"type\", \"text\": \"Jinan\"}}\n</tool_call>"
    }]
  }, {
    "role": "user",
    "content": [{
      "image": "http://nlp-mobile-agent.oss-cn-zhangjiakou.aliyuncs.com/computer-use%2Fscreenshot%2Fscreenshot_ff247bac_39c3e20be32c4baf8677a2b6b61bc021.png"
    }]
  }]                                              

Step 3. Parse the model output

The model internally scales images during processing. The coordinates it returns are normalized based on the scaled image. To accurately perform GUI operations on the original image, you must map the coordinates.

  1. Extract the Tool Call field

    First, extract the Tool Call from the string returned by the model:

    import re
    import json
    
    def extract_tool_calls(text):
        """
        Extract all <tool_call> blocks from the model output
    
        Parameters:
            text: The text returned by the model
    
        Returns:
            actions: A list of parsed operations
        """
        pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
        blocks = pattern.findall(text)
    
        actions = []
        for blk in blocks:
            blk = blk.strip()
            try:
                actions.append(json.loads(blk))
            except json.JSONDecodeError as e:
                print(f'Parsing failed: {e} | Snippet: {blk[:80]}...')
    
        return actions
  2. Coordinate mapping function

    The following function calculates the scaled dimensions:

    import math
    from PIL import Image
    
    def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192):
        """
        Calculate the image dimensions after internal scaling by the model
    
        Parameters:
            height: Original image height
            width: Original image width
            factor: Resolution factor (default is 32)
            min_pixels: Minimum pixel value
            max_pixels: Maximum pixel value
            max_long_side: Maximum long side limit
    
        Returns:
            (h_bar, w_bar): Scaled height and width
        """
        def round_by_factor(number, factor):
            return round(number / factor) * factor
    
        def ceil_by_factor(number, factor):
            return math.ceil(number / factor) * factor
    
        def floor_by_factor(number, factor):
            return math.floor(number / factor) * factor
    
        if height < 2 or width < 2:
            raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
        elif max(height, width) / min(height, width) > 200:
            raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}")
    
        # Limit the longest side
        if max(height, width) > max_long_side:
            beta = max(height, width) / max_long_side
            height, width = int(height / beta), int(width / beta)
    
        # Calculate the scaled dimensions
        h_bar = round_by_factor(height, factor)
        w_bar = round_by_factor(width, factor)
    
        if h_bar * w_bar > max_pixels:
            beta = math.sqrt((height * width) / max_pixels)
            h_bar = floor_by_factor(height / beta, factor)
            w_bar = floor_by_factor(width / beta, factor)
        elif h_bar * w_bar < min_pixels:
            beta = math.sqrt(min_pixels / (height * width))
            h_bar = ceil_by_factor(height * beta, factor)
            w_bar = ceil_by_factor(width * beta, factor)
    
        return h_bar, w_bar

Step 4. Execute the GUI operation

After parsing the action instruction, this section shows how to use the pyautogui library to simulate physical GUI operations such as mouse clicks, keyboard input, and scrolling.

import pyautogui
import pyperclip
import time
from PIL import Image
import os

class ComputerTools:
    """Computer GUI operation utility class"""

    def __init__(self):
        self.image_info = None

    def load_image_info(self, path):
        """Load image dimension information"""
        width, height = Image.open(path).size
        self.image_info = (width, height)

    def get_screenshot(self, image_path, retry_times=3):
        """Get a desktop screenshot"""
        if os.path.exists(image_path):
            os.remove(image_path)

        for i in range(retry_times):
            screenshot = pyautogui.screenshot()
            screenshot.save(image_path)
            if os.path.exists(image_path):
                self.load_image_info(image_path)
                return True
            else:
                time.sleep(0.1)
        return False

    def reset(self):
        """Show desktop"""
        pyautogui.hotkey('win', 'd')

    def press_key(self, keys):
        """Key press operation"""
        if isinstance(keys, list):
            cleaned_keys = []
            for key in keys:
                if isinstance(key, str):
                    # Process key name format
                    if key.startswith("keys=["):
                        key = key[6:]
                    if key.endswith("]"):
                        key = key[:-1]
                    if key.startswith("['") or key.startswith('["'):
                        key = key[2:] if len(key) > 2 else key
                    if key.endswith("']") or key.endswith('"]'):
                        key = key[:-2] if len(key) > 2 else key
                    key = key.strip()

                    # Convert key name
                    key_map = {
                        "arrowleft": "left",
                        "arrowright": "right",
                        "arrowup": "up",
                        "arrowdown": "down"
                    }
                    key = key_map.get(key, key)
                    cleaned_keys.append(key)
                else:
                    cleaned_keys.append(key)
            keys = cleaned_keys
        else:
            keys = [keys]

        if len(keys) > 1:
            pyautogui.hotkey(*keys)
        else:
            pyautogui.press(keys[0])

    def type(self, text):
        """Type text (uses clipboard to support Chinese characters)"""
        pyperclip.copy(text)
        pyautogui.keyDown('ctrl')
        pyautogui.keyDown('v')
        pyautogui.keyUp('v')
        pyautogui.keyUp('ctrl')

    def mouse_move(self, x, y):
        """Move the mouse to the specified coordinates"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.moveTo(x, y)

    def left_click(self, x, y):
        """Left-click"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.click()

    def left_click_drag(self, x, y):
        """Drag from the current position to the specified coordinates"""
        pyautogui.dragTo(x, y, duration=0.5)
        pyautogui.moveTo(x, y)

    def right_click(self, x, y):
        """Right-click"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.rightClick()

    def middle_click(self, x, y):
        """Middle-click"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.middleClick()

    def double_click(self, x, y):
        """Double-click"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.doubleClick()

    def triple_click(self, x, y):
        """Triple-click"""
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.tripleClick()

    def scroll(self, pixels):
        """Scroll wheel"""
        pyautogui.scroll(pixels)

Step 5. Complete automation flow

Integrate the preceding steps into a complete automation flow. This flow loops through the sequence of taking a screenshot, performing model inference, and executing the GUI operation until the task is complete.

import os
import uuid
import dashscope
import time

def run_gui_automation(instruction, max_step=30):
    """
    Run the complete GUI automation flow

    Parameters:
        instruction: User instruction
        max_step: Maximum number of execution steps
    """
    # Configure the API
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
    dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
    model_name = 'gui-plus-2026-02-26'

    # Initialize tools
    computer_tools = ComputerTools()
    computer_tools.reset()  # Show desktop

    # Create an output directory
    output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation")
    os.makedirs(output_dir, exist_ok=True)

    # Conversation history
    history = []
    stop_flag = False
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)

    print(f"[Task] {instruction}")
    print("=" * 60)

    for step_id in range(max_step):
        if stop_flag:
            break

        print(f"\n[Step {step_id + 1}]")

        # 1. Screenshot
        screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png')
        computer_tools.get_screenshot(screen_shot)

        # 2. Construct messages
        messages = get_messages(screen_shot, instruction, history, model_name)

        # 3. Call the model
        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name,
                messages=messages,
                vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(f"[Model output]\n{output_text}\n")

        # 4. Parse operations
        action_list = extract_tool_calls(output_text)
        if not action_list:
            print("No valid operation extracted")
            break

        # 5. Execute operations
        for action_id, action in enumerate(action_list):
            action_parameter = action['arguments']
            action_type = action_parameter['action']

            # Get image dimensions for coordinate mapping
            dummy_image = Image.open(screen_shot)
            resized_height, resized_width = smart_resize(
                dummy_image.height,
                dummy_image.width,
                factor=16,
                min_pixels=3136,
                max_pixels=1003520 * 200
            )

            # Map coordinates (from 1000x1000 normalized coordinates to actual dimensions)
            for key in ['coordinate', 'coordinate1', 'coordinate2']:
                if key in action_parameter:
                    action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
                    action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)

            # Execute the corresponding operation
            if action_type in ['click', 'left_click']:
                computer_tools.left_click(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Left-click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")

            elif action_type == 'mouse_move':
                computer_tools.mouse_move(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Move mouse to ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")

            elif action_type == 'middle_click':
                computer_tools.middle_click(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Middle-click")

            elif action_type in ['right click', 'right_click']:
                computer_tools.right_click(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Right-click")

            elif action_type in ['key', 'hotkey']:
                computer_tools.press_key(action_parameter['keys'])
                print(f"✓ Key press {action_parameter['keys']}")

            elif action_type == 'type':
                text = action_parameter['text']
                computer_tools.type(text)
                print(f"✓ Type text: {text}")

            elif action_type == 'drag':
                computer_tools.left_click_drag(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Drag to ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")

            elif action_type == 'scroll':
                if 'coordinate' in action_parameter:
                    computer_tools.mouse_move(
                        action_parameter['coordinate'][0],
                        action_parameter['coordinate'][1]
                    )
                computer_tools.scroll(action_parameter.get("pixels", 1))
                print(f"✓ Scroll {action_parameter.get('pixels', 1)} pixels")

            elif action_type in ['computer_double_click', 'double_click']:
                computer_tools.double_click(
                    action_parameter['coordinate'][0],
                    action_parameter['coordinate'][1]
                )
                print(f"✓ Double-click")

            elif action_type == 'wait':
                time.sleep(action_parameter.get('time', 2))
                print(f"✓ Wait {action_parameter.get('time', 2)} seconds")

            elif action_type == 'answer':
                print(f"✓ Task complete: {action_parameter.get('text', '')}")
                stop_flag = True
                break

            elif action_type in ['stop', 'terminate', 'done']:
                print(f"✓ Task terminated: {action_parameter.get('status', 'success')}")
                stop_flag = True
                break

            else:
                print(f"Unknown action type: {action_type}")

        # 6. Save history
        history.append({
            'output': output_text,
            'image': screen_shot
        })

        time.sleep(2)  # Operation interval

    print("\n" + "=" * 60)
    print(f"[Complete] Total executed {len(history)} steps")

# Example usage
if __name__ == '__main__':
    run_gui_automation(
        instruction='Open Chrome for me and search for Alibaba on Baidu',
        max_step=30
    )

Complete code example for computer

import os
import re
import json
import math
import time
import uuid
import pyautogui
import pyperclip
import dashscope
from PIL import Image

# ===================== Step 1: System Prompt =====================

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""


# ===================== Step 2: Construct multi-turn conversation messages =====================

def get_messages(image, instruction, history_output, system_prompt):
    history_n = 4
    current_step = len(history_output)

    history_start_idx = max(0, current_step - history_n)
    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    instruction_prompt = f"""
      Please generate the next move according to the UI screenshot, instruction and previous actions.

      Instruction: {instruction}

      Previous actions:
      {previous_actions_str}"""

    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}],
            })
        messages.append({
            "role": "user",
            "content": [{"image": "file://" + image}]
        })
    else:
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })
    return messages


# ===================== Step 3: Parse model output and map coordinates =====================

def extract_tool_calls(text):
    pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
    blocks = pattern.findall(text)
    actions = []
    for blk in blocks:
        blk = blk.strip()
        try:
            actions.append(json.loads(blk))
        except json.JSONDecodeError as e:
            print(f'Parsing failed: {e} | Snippet: {blk[:80]}...')
    return actions

def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192):
    def round_by_factor(number, factor):
        return round(number / factor) * factor
    def ceil_by_factor(number, factor):
        return math.ceil(number / factor) * factor
    def floor_by_factor(number, factor):
        return math.floor(number / factor) * factor

    if height < 2 or width < 2:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}")

    if max(height, width) > max_long_side:
        beta = max(height, width) / max_long_side
        height, width = int(height / beta), int(width / beta)

    h_bar = round_by_factor(height, factor)
    w_bar = round_by_factor(width, factor)

    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


# ===================== Step 4: GUI operation utility class =====================

class ComputerTools:
    def __init__(self):
        self.image_info = None

    def load_image_info(self, path):
        width, height = Image.open(path).size
        self.image_info = (width, height)

    def get_screenshot(self, image_path, retry_times=3):
        if os.path.exists(image_path):
            os.remove(image_path)
        for i in range(retry_times):
            screenshot = pyautogui.screenshot()
            screenshot.save(image_path)
            if os.path.exists(image_path):
                self.load_image_info(image_path)
                return True
            else:
                time.sleep(0.1)
        return False

    def reset(self):
        pyautogui.hotkey('win', 'd')

    def press_key(self, keys):
        if isinstance(keys, list):
            cleaned_keys = []
            for key in keys:
                if isinstance(key, str):
                    if key.startswith("keys=["): key = key[6:]
                    if key.endswith("]"): key = key[:-1]
                    if key.startswith("['") or key.startswith('["'): key = key[2:] if len(key) > 2 else key
                    if key.endswith("']") or key.endswith('"]'): key = key[:-2] if len(key) > 2 else key
                    key = key.strip()
                    key_map = {"arrowleft": "left", "arrowright": "right", "arrowup": "up", "arrowdown": "down"}
                    key = key_map.get(key, key)
                    cleaned_keys.append(key)
                else:
                    cleaned_keys.append(key)
            keys = cleaned_keys
        else:
            keys = [keys]
        if len(keys) > 1:
            pyautogui.hotkey(*keys)
        else:
            pyautogui.press(keys[0])

    def type(self, text):
        pyperclip.copy(text)
        pyautogui.keyDown('ctrl')
        pyautogui.keyDown('v')
        pyautogui.keyUp('v')
        pyautogui.keyUp('ctrl')

    def mouse_move(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.moveTo(x, y)

    def left_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.click()

    def left_click_drag(self, x, y):
        pyautogui.dragTo(x, y, duration=0.5)
        pyautogui.moveTo(x, y)

    def right_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.rightClick()

    def middle_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.middleClick()

    def double_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.doubleClick()

    def triple_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.tripleClick()

    def scroll(self, pixels):
        pyautogui.scroll(pixels)


# ===================== Step 5: Complete automation flow =====================

def run_gui_automation(instruction, max_step=30):
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
    dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
    model_name = 'gui-plus-2026-02-26'

    computer_tools = ComputerTools()
    computer_tools.reset()

    output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation")
    os.makedirs(output_dir, exist_ok=True)

    history = []
    stop_flag = False
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)

    print(f"[Task] {instruction}")
    print("=" * 60)

    for step_id in range(max_step):
        if stop_flag:
            break

        print(f"\n[Step {step_id + 1}]")

        screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png')
        computer_tools.get_screenshot(screen_shot)

        messages = get_messages(screen_shot, instruction, history, system_prompt)

        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name,
                messages=messages,
                vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(f"[Model output]\n{output_text}\n")

        action_list = extract_tool_calls(output_text)
        if not action_list:
            print("No valid operation extracted")
            break

        for action_id, action in enumerate(action_list):
            action_parameter = action['arguments']
            action_type = action_parameter['action']

            dummy_image = Image.open(screen_shot)
            resized_height, resized_width = smart_resize(
                dummy_image.height, dummy_image.width,
                factor=16, min_pixels=3136, max_pixels=1003520 * 200
            )

            for key in ['coordinate', 'coordinate1', 'coordinate2']:
                if key in action_parameter:
                    action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
                    action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)

            if action_type in ['click', 'left_click']:
                computer_tools.left_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Left-click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
            elif action_type == 'mouse_move':
                computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Move mouse")
            elif action_type == 'middle_click':
                computer_tools.middle_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Middle-click")
            elif action_type in ['right click', 'right_click']:
                computer_tools.right_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Right-click")
            elif action_type in ['key', 'hotkey']:
                computer_tools.press_key(action_parameter['keys'])
                print(f"✓ Key press {action_parameter['keys']}")
            elif action_type == 'type':
                computer_tools.type(action_parameter['text'])
                print(f"✓ Type text: {action_parameter['text']}")
            elif action_type == 'drag':
                computer_tools.left_click_drag(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Drag")
            elif action_type == 'scroll':
                if 'coordinate' in action_parameter:
                    computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                computer_tools.scroll(action_parameter.get("pixels", 1))
                print(f"✓ Scroll {action_parameter.get('pixels', 1)} pixels")
            elif action_type in ['computer_double_click', 'double_click']:
                computer_tools.double_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Double-click")
            elif action_type == 'wait':
                time.sleep(action_parameter.get('time', 2))
                print(f"✓ Wait {action_parameter.get('time', 2)} seconds")
            elif action_type == 'answer':
                print(f"✓ Task complete: {action_parameter.get('text', '')}")
                stop_flag = True
                break
            elif action_type in ['stop', 'terminate', 'done']:
                print(f"✓ Task terminated: {action_parameter.get('status', 'success')}")
                stop_flag = True
                break
            else:
                print(f"Unknown action type: {action_type}")

        history.append({'output': output_text, 'image': screen_shot})
        time.sleep(2)

    print("\n" + "=" * 60)
    print(f"[Complete] Total executed {len(history)} steps")


if __name__ == '__main__':
    run_gui_automation(
        instruction='Open Chrome for me and search for Alibaba on Baidu',
        max_step=30
    )

Mobile GUI tasks

Mobile automation is achieved using the Android Debug Bridge (ADB) tool.

Environment setup:

  1. Download the Android Debug Bridge for your system and save it to a specified path.

  2. Enable "USB debugging" or "ADB debugging" on your phone. You usually need to enable Developer options first.

  3. Connect your phone to your computer with a data cable and select "File Transfer" mode.

  4. Download the ADB Keyboard APK, transfer it to your phone, open it, and choose to install it despite the risk warning.

  5. In the system settings, switch the default input method to ADB Keyboard.

  6. Test the connection in your computer's terminal: /path/to/adb devices. If the device list is not empty, the connection is successful.

  7. If you are using macOS or Linux, you need to grant execute permission: sudo chmod +x /path/to/adb

  8. Open an app on your phone, then run the command: /path/to/adb shell am start -a android.intent.action.MAIN -c android.intent.category.HOME. If your phone returns to the home screen, the setup is complete.

The mobile GUI example is similar to the computer example. The complete code is as follows:

Complete code example for mobile

  1. 1. Construct the mobile System Prompt

    import json, os, subprocess
    import dashscope, time, math
    from PIL import Image, ImageDraw
    import shutil, requests
    from datetime import datetime
    
    mobile_system_prompt = '''# Tools
            You may call one or more functions to assist with the user query.
            
            You are provided with function signatures within <tools></tools> XML tags:
            <tools>
            {"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.
            * This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.
            * Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.
            * The screen's resolution is 1000x1000.
            * Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:
            * `key`: Perform a key event on the mobile device.
                - This supports adb's `keyevent` syntax.
                - Examples: "volume_up", "volume_down", "power", "camera", "clear".
            * `click`: Click the point on the screen with coordinate (x, y).
            * `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.
            * `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).
            * `type`: Input the specified text into the activated input box.
            * `system_button`: Press the system button.
            * `open`: Open an app on the device.
            * `wait`: Wait specified seconds for the change to happen.
            * `answer`: Terminate the current task and output the answer.
            * `interact`: Resolve the blocking window by interacting with the user.
            * `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
            </tools>
            
            For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
            <tool_call>
            {"name": <function-name>, "arguments": <args-json-object>}
            </tool_call>
            
            # Response format
            
            Response format for every step:
            1) Action: a short imperative describing what to do in the UI.
            2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
            
            Rules:
            - Output exactly in the order: Action, <tool_call>.
            - Be brief: one for Action.
            - Do not output anything else outside those two parts.
            - If finishing, use action=terminate in the tool call.'''
  2. Constructing multi-turn conversation messages

    from datetime import datetime
    
    def get_messages(image, instruction, history_output, system_prompt):
        history_n = 4
        current_step = len(history_output)
    
        history_start_idx = max(0, current_step - history_n)
    
        previous_actions = []
        for i in range(history_start_idx):
            if i < len(history_output):
                history_output_str = history_output[i]['output']
                if 'Action:' in history_output_str and '<tool_call>':
                    history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
                previous_actions.append(f"Step {i + 1}: {history_output_str}")
    
        previous_actions_str = (
            "\n".join(previous_actions) if previous_actions else "None"
        )
        # Add background information
        today = datetime.today()
        weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
        weekday = weekday_names[today.weekday()]
        formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
        ground_info = f'''Today's date is: {formatted_date}.'''
    
    
        instruction_prompt = f"""
            Please generate the next move according to the UI screenshot, instruction and previous actions.
            
            Instruction: {ground_info}{instruction}
            
            Previous actions:
            {previous_actions_str}"""
    
        ## Call the model
        messages = [
            {
                "role": "system",
                "content": [
                    {"text": system_prompt}
                ],
            }
        ]
        history_len = min(history_n, len(history_output))
        if history_len > 0:
            for history_id, history_item in enumerate(history_output[-history_n:], 0):
                if history_id == 0:
                    messages.append({
                        "role": "user",
                        "content": [
                            {"text": instruction_prompt},
                            {"image": "file://" +history_item['image']}
                        ]
                    })
                else:
                    messages.append({
                        "role": "user",
                        "content": [
                            {"image": "file://" +history_item['image']}
                        ]
                    })
                messages.append({
                    "role": "assistant",
                    "content": [
                        {"text": history_item['output']},
                    ]
                })
            messages.append({
                "role": "user",
                "content": [
                    {"image": "file://" +image},
                ]
            })
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {
                            "text": instruction_prompt
                        },
                        {
                            "image": "file://" +image,
                        },
                    ],
                }
            )
    
        return messages
  3. Calculating the size of the scaled image

    The mobile and computer versions share the same smart_resize function. For more information, see Coordinate mapping function.

  4. Perform operations in the GUI

    Use ADB commands to perform the actual phone operations.

    import subprocess
    import os
    import time
    from PIL import Image
    
    class AdbTools:
        def __init__(self, adb_path, device=None):
            self.adb_path = adb_path
            self.device = device
            self.__device_str__ = f" -s {device} " if device is not None else ' '
            self.image_info = None
    
        def adb_shell(self, command):
            command = self.adb_path + self.__device_str__ + command
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        ## Load phone size
        def load_image_info(self, path):
            width, height = Image.open(path).size
            self.image_info = (width, height)
    
        ## Get screenshot
        def get_screenshot(self, image_path, retry_times=3):
            command = self.adb_path + (f" -s {self.device}" if self.device is not None else '') + f" exec-out screencap -p > {image_path}"
    
            for i in range(retry_times):
                subprocess.run(command, capture_output=True, text=True, shell=True)
                if os.path.exists(image_path):
                    self.load_image_info(image_path)
                    return True
                else:
                    time.sleep(0.1)
            else:
                return False
    
        ## Click (x,y)
        ## coordinate_size: The dimensions of the input image. Default is None, which uses the current phone's dimensions. Pass as {'x': int, 'y': int}.
        def click(self, x, y, coordinate_size=None):
            command = self.adb_path + self.__device_str__ + f" shell input tap {x} {y}"
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        def long_press(self, x, y, time=800):
            command = self.adb_path + self.__device_str__ + f" shell input swipe {x} {y} {x} {y} {time}"
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        ## Swipe from (x1,y1) to (x2,y2)
        ## coordinate_size: The dimensions of the input image. Default is None, which uses the current phone's dimensions. Pass as {'x': int, 'y': int}.
        def slide(self, x1, y1, x2, y2, coordinate_size=None, slide_time=800):
            command = self.adb_path + self.__device_str__ + f" shell input swipe {x1} {y1} {x2} {y2} {slide_time}"
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        ## Back
        def back(self):
            command = self.adb_path + self.__device_str__ + f"  shell input keyevent 4"
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        # Press the Home key
        def home(self):
            command = self.adb_path + self.__device_str__ + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME"
            subprocess.run(command, capture_output=True, text=True, shell=True)
    
        ## Type (supports both Chinese and English, other languages are not guaranteed). Note: ADB Keyboard must be installed on the phone first.
        def type(self, text):
            escaped_text = text.replace('"', '\\"').replace("'", "\\'")
            command_list = [
                f"shell ime enable com.android.adbkeyboard/.AdbIME ",
                f"shell ime set com.android.adbkeyboard/.AdbIME ",
                0.1,
                f'shell am broadcast -a ADB_INPUT_TEXT --es msg "{escaped_text}" ',
                0.1,
                f"shell ime disable com.android.adbkeyboard/.AdbIME"
            ]
    
            for command in command_list:
                if isinstance(command, float):
                    time.sleep(command)
                elif isinstance(command, str):
                    subprocess.run(self.adb_path + self.__device_str__ + command.strip(), capture_output=True, text=True, shell=True)
    
        def get_package_name(self, all_packages=False):
            try:
                if all_packages:
                    command = self.adb_path + self.__device_str__ + " shell pm list packages"
                else:
                    command = self.adb_path + self.__device_str__ + " shell pm list packages -3"
                res = subprocess.run(command, capture_output=True, text=True, shell=True)
                pkgs = []
                for line in res.stdout.splitlines():
                    s = line.strip()
                    if not s:
                        continue
                    # Remove the "package:" prefix
                    if s.startswith("package:"):
                        s = s[len("package:"):]
                    # If it contains "=", the package name is on the right
                    if "=" in s:
                        _, s = s.split("=", 1)
                    if s:
                        pkgs.append(s)
                return sorted(set(pkgs))
            except Exception as e:
                print(e)
                return []
    
        def open_app(self, package_name):
            command = self.adb_path + self.__device_str__ + f" shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1"
            subprocess.run(command, capture_output=True, text=True, shell=True)                             
  5. 5. Application package name mapping

    The package name is the unique identifier for an Android application (format: com.companyname.productname). For example, com.tencent.mm is for WeChat by Tencent (mm stands for Mobile Messenger).

    To support opening applications by name (action=open), you need to maintain a mapping from application names to package names.

    # Common application package name mapping (example, can be extended as needed)
    package_str_list = '''com.tencent.mm	WeChat	wechat			
    com.tencent.mobileqq	qq	Tencent QQ			
    com.sina.weibo	Weibo				
    com.taobao.taobao	Taobao				
    com.jingdong.app.mall	JD.com	JD Now			
    com.xunmeng.pinduoduo	Pinduoduo				
    com.xingin.xhs	Xiaohongshu				
    com.douban.frodo	Douban				
    com.zhihu.android	Zhihu				
    com.autonavi.minimap	AMAP	Gaode			
    com.baidu.BaiduMap	Baidu Maps				
    com.sankuai.meituan.takeoutnew	Meituan Waimai				
    com.sankuai.meituan	Meituan	Meituan Waimai			
    com.dianping.v1	Dazhong Dianping	Dianping			
    me.ele	Ele.me	Taobao Flash Purchase			
    com.yek.android.kfc.activitys	KFC				
    ctrip.android.view	Ctrip	Ctrip Travel			
    com.MobileTicket	China Railway 12306	12306			
    com.Qunar	Qunar Travel	Qunar.com	Qunar		
    com.sdu.didi.psnger	DiDi Chuxing	DiDi			
    tv.danmaku.bili	bilibili	B Site	Bilibili	B Site	bili
    com.ss.android.ugc.aweme	Douyin				
    com.smile.gifmaker	Kuaishou				
    com.tencent.qqlive	Tencent Video				
    com.qiyi.video	iQIYI				
    com.youku.phone	Youku	Youku Video			
    com.hunantv.imgo.activity	Mango TV	Mango			
    com.phoenix.read	Hongguo Short Video	Hongguo			
    com.netease.cloudmusic	NetEase Cloud Music	NetEase Cloud			
    com.tencent.qqmusic	QQ Music				
    com.luna.music	Qishui Music				
    com.ximalaya.ting.android	Ximalaya				
    com.dragon.read	Fanqie Free Novels	Fanqie Novels			
    com.kmxs.reader	Qimao Free Novels				
    com.ss.android.lark	Lark				
    com.tencent.androidqqmail	QQ Mailbox				
    com.larus.nova	Doubao	Doubao			
    com.gotokeep.keep	Keep				
    com.lingan.seeyou	Meiyou				
    com.tencent.news	Tencent News				
    com.ss.android.article.news	Toutiao				
    com.lianjia.beike	Beike Zhaofang				
    com.anjuke.android.app	Anjuke				
    com.hexin.plat.android	Tonghuashun				
    com.miHoYo.hkrpg	Honkai: Star Rail	Honkai			
    com.papegames.lysk.cn	Love and Deepspace				
    com.android.settings	settings	androidsystemsettings			
    com.android.soundrecorder	audiorecorder				
    com.rammigsoftware.bluecoins	bluecoins				
    com.flauschcode.broccoli	broccoli				
    com.booking	booking				
    com.android.chrome	Google Chrome	googlechrome	chrome		
    com.android.deskclock	Clock	Alarm Clock	clock		
    com.android.contacts	contacts				
    com.duolingo	duolingo	Duolingo			
    com.expedia.bookings	expedia				
    com.android.fileexplorer	files	filemanager			
    com.google.android.gm	gmail	googlemail			
    com.google.android.apps.nbu.files	googlefiles	filesbygoogle			
    com.google.android.calendar	googlecalendar				
    com.google.android.apps.dynamite	googlechat				
    com.google.android.deskclock	googleclock				
    com.google.android.contacts	googlecontacts				
    com.google.android.apps.docs.editors.docs	googledocs				
    com.google.android.apps.docs	googledrive				
    com.google.android.apps.fitness	googlefit				
    com.google.android.keep	googlekeep				
    com.google.android.apps.maps	googlemaps				
    com.google.android.apps.books	googleplaybooks				
    com.android.vending	googleplaystore				
    com.google.android.apps.docs.editors.slides	googleslides				
    com.google.android.apps.tasks	googletasks				
    net.cozic.joplin	joplin				
    com.mcdonalds.app	McDonald's	mcdonald			
    net.osmand	osmand				
    com.Project100Pi.themusicplayer	pimusicplayer				
    com.quora.android	quora				
    com.reddit.frontpage	reddit				
    code.name.monkey.retromusic	retromusic				
    com.scientificcalculatorplus.simplecalculator.basiccalculator.mathcalc	simplecalendarpro				
    com.simplemobiletools.smsmessenger	simplesmsmessenger				
    org.telegram.messenger	telegram				
    com.einnovation.temu	temu				
    com.zhiliaoapp.musically	tiktok				
    com.twitter.android	twitter	x			
    org.videolan.vlc	vlc				
    com.whatsapp	whatsapp				
    com.taobao.movie.android	Taopiaopiao				
    com.tongcheng.android	Tongcheng Travel	Tongcheng			
    com.sankuai.movie	Maoyan				
    com.wuba.zhuanzhuan	Zhuanzhuan				
    com.tencent.weread	WeChat Read				
    com.taobao.idlefish	Xianyu				
    com.wudaokou.hippo	Hema				
    com.eg.android.AlipayGphone	Alipay				
    com.jd.jrapp	JD Finance				
    com.achievo.vipshop	Vipshop				
    com.smzdm.client.android	SMZDM				
    cn.kuwo.player	Kuwo Music				
    com.taobao.trip	Fliggy	Fliggy Travel			
    com.jingdong.pdj	JD Daojia				
    com.tencent.map	Tencent Maps				
    com.shizhuang.duapp	Dewu				
    cn.damai	Damai	Damai.cn			
    com.ss.android.auto	Dongchedi				
    com.cubic.autohome	Autohome				
    com.wuba	58.com	58 Tongcheng			
    com.android.calendar	Calendar				
    com.alibaba.android.rimet	DingTalk				
    com.meituan.retail.v.android	Xiaoxiang Supermarket				
    com.aliyun.tongyi	Tongyi	Qwen	Tongyi Qwen		
    com.hupu.games	Hupu	Hupu Sports			
    com.quark.browser	Quark	Quark Browser			
    com.yuantiku.tutor	Yuanfudao				
    com.tencent.mtt	QQ Browser				
    com.umetrip.android.msky.app	Umetrip				
    com.UCMobile	UC Browser				
    com.ss.android.ugc.aweme.lite	Douyin Lite	Douyin			
    air.tv.douyu.android	Douyu				
    com.tencent.hunyuan.app.chat	Yuanbao				
    com.baidu.searchbox	Baidu				
    com.lemon.lv	Jianying				
    cn.soulapp.android	Soul				
    com.baidu.netdisk	Baidu Netdisk				
    com.tmri.app.main	Jiaoguan 12123	12123			
    com.kugou.android	Kugou	Kugou Music			
    com.ss.android.lark	Lark				
    com.tencent.android.qqdownloader	App宝				
    com.mt.mtxx.mtxx	Meitu	Meitu XiuXiu			
    com.tencent.karaoke	WeSing				
    com.intsig.camscanner	CamScanner				
    com.android.bankabc	Agricultural Bank of China	ABC			
    cmb.pb	China Merchants Bank	CMB			
    com.ganji.android.haoche_c	Guazi	Guazi			
    com.sf.activity	SF Express	SF Express	SF Express		
    com.ziroom.ziroomcustomer	Ziroom				
    com.yumc.phsuperapp	Pizza Hut				
    cn.dominos.pizza	Domino's Pizza	Domino's			
    cn.wps.moffice_eng	WPS Office	WPS			
    com.mfw.roadbook	Mafengwo				
    com.moonshot.kimichat	Kimi				
    com.tencent.wemeet.app	Tencent Meeting				
    com.deepseek.chat	DeepSeek				
    com.spdbccc.app	SPD Bank				
    cn.samsclub.app	Sam's Club	Sam's	Sam's Club	Sam's Club	
    com.tencent.qqsports	Tencent Sports				
    com.hanweb.android.zhejiang.activity	Zheli Ban				
    com.ss.android.article.video	Xigua Video				
    com.taou.maimai	Maimai	'''
    
    PACKAGES_NAME_DICT = {}
    NAME_PACKAGE_DICT = {}
    
    def normalize_package_name(name):
        name = name.lower().strip().replace(" ", "").replace("-", "")
        return name
    
    for package_str in package_str_list.split("\n"):
        package_name = package_str.strip().split("\t")
        PACKAGES_NAME_DICT[package_name[0]] = [normalize_package_name(i) for i in package_name[1:]]
        for name in package_name[1:]:
            name = normalize_package_name(name)
            if name not in NAME_PACKAGE_DICT:
                NAME_PACKAGE_DICT[name] = [package_name[0]]
            else:
                NAME_PACKAGE_DICT[name].append(package_name[0])
  6. Fully automated flow

    import os
    import uuid
    import dashscope
    import time
    import shutil
    import json
    from PIL import Image
    
    if __name__ == '__main__':
    
        add_info = ''
        ## Specify an app that is in the app list and installed locally
        # instruction = 'Book a high-speed train ticket from Shanghai to Beijing for the day after tomorrow on Ctrip'
        # instruction = 'Comment on today's NBA game in Hupu'
        # instruction = 'Check tomorrow's flights for me on Umetrip'
        # instruction = 'Look at past exam papers in Yuanfudao'
    
        ## Specify an app that is in the app list but not installed locally
        # instruction = 'Book a ticket for Jay Chou's concert on Maoyan'
        # instruction = 'Buy a pomelo in Xiaoxiang Supermarket'
        # instruction = 'Watch a live stream on Huya Live'
        # instruction = 'Play a song by Vae Xu on QQ Music'
    
        ## Do not specify an app, but it is installed locally
        # instruction = 'Navigate to the village'
        # instruction = 'Order a coffee for delivery'
        # instruction = 'Play a song by Vae Xu'
        instruction = 'Help me book a train ticket'
    
        ## Do not specify an app, and it is not installed locally
        # instruction = 'Check today's Shanghai Composite Index in a stock trading app'
        # instruction = 'Help me send a message to my wife: I won't be home for dinner tomorrow night.'
    
        history = []
        session_id = str(uuid.uuid4())
        print('session_id ', session_id)
        max_step = 50
    
        model_name = 'gui-plus-2026-02-26'
        dashscope.api_key = os.getenv("DASHSCOPE_API_KEY", None)
        print("DashScope API Key: ", dashscope.api_key)
        dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
        dashscope.base_websocket_api_url = 'https://dashscope.aliyuncs.com/api-ws/v1/inference'
    
        ## Note: You need to enter your own adb path
        adb_tools = AdbTools(adb_path="xxx/adb")
        # package_name_list = adb_tools.get_package_name()
        # adb_tools.home()
        # time.sleep(1)
        task_dir = instruction
        anno_dir = f"{instruction}_anno"
    
        if os.path.exists(task_dir):
            shutil.rmtree(task_dir)
        os.mkdir(task_dir)
    
        if os.path.exists(anno_dir):
            shutil.rmtree(anno_dir)
        os.mkdir(anno_dir)
    
        ## {"image": image, "output": model output}
        history = []
        open_app_retry = False
        # max_step = 1
        for step_id in range(max_step):
            print(f'\nSTEP {step_id}:\n------------------------------------')
            screen_shot = os.path.join(task_dir, f'screen_shot_{step_id}.png')
            adb_tools.get_screenshot(screen_shot)
    
            width, height = Image.open(screen_shot).size
            messages = get_messages(screen_shot, instruction, history, model_name)
    
            retry_time = 3
            for _ in range(retry_time):
                response = dashscope.MultiModalConversation.call(
                                                    model=model_name,
                                                    messages=messages,
                                                    vl_high_resolution_images=True,
                                                    headers={"x-dashscope-gui-session-id": session_id},
                                                    enable_thinking=False,
                                                    stream=False)
                print(response['request_id'])
                try:
                    output_text = response.output.choices[0].message.content[0]['text']
                    break
                except Exception as e:
                    print(response)
                    print(e)
            else:
                raise Exception('retry_time out')
    
            thought = response.output.choices[0].message.reasoning_content
            if thought != '':
                output_text = f"<think>\n{thought}\n</think>{output_text}"
            action = json.loads(output_text.split('<tool_call>\n')[1].split('}}\n')[0] + '}}\n')
            conclusion = output_text.split('<tool_call>')[0].strip()
    
            action_parameter = action['arguments']
            dummy_image = Image.open(screen_shot)
            resized_height, resized_width = smart_resize(dummy_image.height,
                                                         dummy_image.width,
                                                         factor=16,
                                                         min_pixels=3136,
                                                         max_pixels=1003520*200,
                                                         )
            for key in ['coordinate', 'coordinate1', 'coordinate2']:
                if key in action_parameter:
                    action_parameter[key][0] = int(action_parameter[key][0]/1000 * resized_width)
                    action_parameter[key][1] = int(action_parameter[key][1]/1000 * resized_height)
    
            print(output_text)
            action_type = action_parameter['action']
            if action_type == 'click':
                adb_tools.click(action_parameter['coordinate'][0],
                                action_parameter['coordinate'][1])
            elif action_type == 'long_press':
                adb_tools.long_press(action_parameter['coordinate'][0],
                                     action_parameter['coordinate'][1])
            elif action_type == 'type':
                adb_tools.type(action_parameter['text'])
            elif action_type in ['scroll', 'swipe']:
                adb_tools.slide(action_parameter['coordinate'][0],
                                action_parameter['coordinate'][1],
                                action_parameter['coordinate2'][0],
                                action_parameter['coordinate2'][1])
            elif action_type == 'system_button':
                system = action_parameter['button']
                if system == 'Back':
                    adb_tools.back()
                elif system == 'Home':
                    adb_tools.home()
            elif action_type == 'wait':
                time.sleep(2)
            elif action_type == 'terminate':
                print(f'Action completed')
                break
            elif action_type == 'open':
                app_name = action_parameter['text']
                package_name = NAME_PACKAGE_DICT.get(app_name, [])
                package_name_list = adb_tools.get_package_name()
                output_app_name = ''
                if app_name != '':
                    output_app_name = app_name
                for sub_package_name in package_name:
                    if sub_package_name in package_name_list:
                        adb_tools.open_app(sub_package_name)
                        break
                else:
                    input(f"Please install the relevant APP {output_app_name}")
                    continue
    
            elif action_type == 'answer':
                print(f'Answer: {conclusion}\nAction completed')
                break
            elif action_type in ['call_user', 'calluser', 'interact']:
                text = action_parameter['text']
                input(f"Please complete the related action for {text}")
                print("Action completed, continuing execution")
                pass
            else:
                raise Exception(f"mobile-e2e action_type not supported {action}")
    
            history.append({'output': output_text,
                            'image': screen_shot})
            show_screenshot(screen_shot, action_parameter, f"{anno_dir}/screenshot_anno_{step_id}.png")
            time.sleep(2)

Complete code example for mobile

import os
import re
import json
import math
import time
import uuid
import shutil
import subprocess
import dashscope
from PIL import Image
from datetime import datetime


# ===================== Step 1: System Prompt =====================

mobile_system_prompt = '''# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.\\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Perform a key event on the mobile device.\\n    - This supports adb's `keyevent` syntax.\\n    - Examples: \\"volume_up\\", \\"volume_down\\", \\"power\\", \\"camera\\", \\"clear\\".\\n* `click`: Click the point on the screen with coordinate (x, y).\\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\\n* `type`: Input the specified text into the activated input box.\\n* `system_button`: Press the system button.\\n* `open`: Open an app on the device.\\n* `wait`: Wait specified seconds for the change to happen.\\n* `answer`: Terminate the current task and output the answer.\\n* `interact`: Resolve the blocking window by interacting with the user.\\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call.'''


# ===================== Step 2: Construct multi-turn conversation messages =====================

def get_messages(image, instruction, history_output, system_prompt):
    history_n = 4
    current_step = len(history_output)
    history_start_idx = max(0, current_step - history_n)

    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    # Add background information
    today = datetime.today()
    weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    weekday = weekday_names[today.weekday()]
    formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
    ground_info = f'''Today's date is: {formatted_date}.'''

    instruction_prompt = f"""
Please generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {ground_info}{instruction}

Previous actions:
{previous_actions_str}"""

    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}],
            })
        messages.append({
            "role": "user",
            "content": [{"image": "file://" + image}]
        })
    else:
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })
    return messages


# ===================== Step 3: Parse model output and map coordinates =====================

def extract_tool_calls(text):
    pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
    blocks = pattern.findall(text)
    actions = []
    for blk in blocks:
        blk = blk.strip()
        try:
            actions.append(json.loads(blk))
        except json.JSONDecodeError as e:
            print(f'Parsing failed: {e} | Snippet: {blk[:80]}...')
    return actions


def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192):
    def round_by_factor(number, factor):
        return round(number / factor) * factor
    def ceil_by_factor(number, factor):
        return math.ceil(number / factor) * factor
    def floor_by_factor(number, factor):
        return math.floor(number / factor) * factor

    if height < 2 or width < 2:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    if max(height, width) > max_long_side:
        beta = max(height, width) / max_long_side
        height, width = int(height / beta), int(width / beta)

    h_bar = round_by_factor(height, factor)
    w_bar = round_by_factor(width, factor)

    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


# ===================== Step 4: ADB operation utility class =====================

class AdbTools:
    def __init__(self, adb_path, device=None):
        self.adb_path = adb_path
        self.device = device
        self.__device_str__ = f" -s {device} " if device is not None else ' '
        self.image_info = None

    def adb_shell(self, command):
        command = self.adb_path + self.__device_str__ + command
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def load_image_info(self, path):
        width, height = Image.open(path).size
        self.image_info = (width, height)

    def get_screenshot(self, image_path, retry_times=3):
        command = self.adb_path + (f" -s {self.device}" if self.device is not None else '') + f" exec-out screencap -p > {image_path}"
        for i in range(retry_times):
            subprocess.run(command, capture_output=True, text=True, shell=True)
            if os.path.exists(image_path):
                self.load_image_info(image_path)
                return True
            else:
                time.sleep(0.1)
        return False

    def click(self, x, y, coordinate_size=None):
        command = self.adb_path + self.__device_str__ + f" shell input tap {x} {y}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def long_press(self, x, y, time=800):
        command = self.adb_path + self.__device_str__ + f" shell input swipe {x} {y} {x} {y} {time}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def slide(self, x1, y1, x2, y2, coordinate_size=None, slide_time=800):
        command = self.adb_path + self.__device_str__ + f" shell input swipe {x1} {y1} {x2} {y2} {slide_time}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def back(self):
        command = self.adb_path + self.__device_str__ + f"  shell input keyevent 4"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def home(self):
        command = self.adb_path + self.__device_str__ + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def type(self, text):
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        command_list = [
            f"shell ime enable com.android.adbkeyboard/.AdbIME ",
            f"shell ime set com.android.adbkeyboard/.AdbIME ",
            0.1,
            f'shell am broadcast -a ADB_INPUT_TEXT --es msg "{escaped_text}" ',
            0.1,
            f"shell ime disable com.android.adbkeyboard/.AdbIME"
        ]
        for command in command_list:
            if isinstance(command, float):
                time.sleep(command)
            elif isinstance(command, str):
                subprocess.run(self.adb_path + self.__device_str__ + command.strip(), capture_output=True, text=True, shell=True)

    def get_package_name(self, all_packages=False):
        try:
            if all_packages:
                command = self.adb_path + self.__device_str__ + " shell pm list packages"
            else:
                command = self.adb_path + self.__device_str__ + " shell pm list packages -3"
            res = subprocess.run(command, capture_output=True, text=True, shell=True)
            pkgs = []
            for line in res.stdout.splitlines():
                s = line.strip()
                if not s:
                    continue
                if s.startswith("package:"):
                    s = s[len("package:"):]
                if "=" in s:
                    _, s = s.split("=", 1)
                if s:
                    pkgs.append(s)
            return sorted(set(pkgs))
        except Exception as e:
            print(e)
            return []

    def open_app(self, package_name):
        command = self.adb_path + self.__device_str__ + f" shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1"
        subprocess.run(command, capture_output=True, text=True, shell=True)


# ===================== Step 5: Application package name mapping =====================

package_str_list = '''com.tencent.mm\tWeChat\twechat\t\t\t
com.tencent.mobileqq\tqq\tTencent QQ\t\t\t
com.sina.weibo\tWeibo\t\t\t\t
com.taobao.taobao\tTaobao\t\t\t\t
com.jingdong.app.mall\tJD.com\tJD Now\t\t\t
com.xunmeng.pinduoduo\tPinduoduo\t\t\t\t
com.xingin.xhs\tXiaohongshu\t\t\t\t
com.douban.frodo\tDouban\t\t\t\t
com.zhihu.android\tZhihu\t\t\t\t
com.autonavi.minimap\tAMAP\tGaode\t\t\t
com.baidu.BaiduMap\tBaidu Maps\t\t\t\t
com.sankuai.meituan.takeoutnew\tMeituan Waimai\t\t\t\t
com.sankuai.meituan\tMeituan\tMeituan Waimai\t\t\t
com.dianping.v1\tDazhong Dianping\tDianping\t\t\t
me.ele\tEle.me\tTaobao Flash Purchase\t\t\t
com.yek.android.kfc.activitys\tKFC\t\t\t\t
ctrip.android.view\tCtrip\tCtrip Travel\t\t\t
com.MobileTicket\tChina Railway 12306\t12306\t\t\t
com.Qunar\tQunar Travel\tQunar.com\tQunar\t\t
com.sdu.didi.psnger\tDiDi Chuxing\tDiDi\t\t\t
tv.danmaku.bili\tbilibili\tB Site\tBilibili\tB Site\tbili
com.ss.android.ugc.aweme\tDouyin\t\t\t\t
com.smile.gifmaker\tKuaishou\t\t\t\t
com.tencent.qqlive\tTencent Video\t\t\t\t
com.qiyi.video\tiQIYI\t\t\t\t
com.youku.phone\tYouku\tYouku Video\t\t\t
com.hunantv.imgo.activity\tMango TV\tMango\t\t\t
com.phoenix.read\tHongguo Short Video\tHongguo\t\t\t
com.netease.cloudmusic\tNetEase Cloud Music\tNetEase Cloud\t\t\t
com.tencent.qqmusic\tQQ Music\t\t\t\t
com.luna.music\tQishui Music\t\t\t\t
com.ximalaya.ting.android\tXimalaya\t\t\t\t
com.dragon.read\tFanqie Free Novels\tFanqie Novels\t\t\t
com.kmxs.reader\tQimao Free Novels\t\t\t\t
com.ss.android.lark\tLark\t\t\t\t
com.tencent.androidqqmail\tQQ Mailbox\t\t\t\t
com.larus.nova\tDoubao\tDoubao\t\t\t
com.gotokeep.keep\tKeep\t\t\t\t
com.lingan.seeyou\tMeiyou\t\t\t\t
com.tencent.news\tTencent News\t\t\t\t
com.ss.android.article.news\tToutiao\t\t\t\t
com.lianjia.beike\tBeike Zhaofang\t\t\t\t
com.anjuke.android.app\tAnjuke\t\t\t\t
com.hexin.plat.android\tTonghuashun\t\t\t\t
com.miHoYo.hkrpg\tHonkai: Star Rail\tHonkai\t\t\t
com.papegames.lysk.cn\tLove and Deepspace\t\t\t\t'''

PACKAGES_NAME_DICT = {}
NAME_PACKAGE_DICT = {}

def normalize_package_name(name):
    name = name.lower().strip().replace(" ", "").replace("-", "")
    return name

for package_str in package_str_list.split("\\n"):
    package_name = package_str.strip().split("\\t")
    if len(package_name) < 2:
        continue
    package, *app_names = package_name
    PACKAGES_NAME_DICT[package] = [normalize_package_name(n) for n in app_names if n.strip()]
    for app_name in app_names:
        if not app_name.strip():
            continue
        normalized_name = normalize_package_name(app_name)
        if normalized_name not in NAME_PACKAGE_DICT:
            NAME_PACKAGE_DICT[normalized_name] = []
        NAME_PACKAGE_DICT[normalized_name].append(package)


# ===================== Step 6: Complete automation flow =====================

if __name__ == '__main__':
    instruction = 'Help me book a train ticket'  # Can be changed to other tasks
    history = []
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)
    max_step = 50

    model_name = 'gui-plus-2026-02-26'
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
    dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

    # Note: You need to enter your own adb path
    adb_tools = AdbTools(adb_path="/path/to/adb")  # Change to your actual path
    # adb_tools.home()
    # time.sleep(1)

    task_dir = instruction
    if os.path.exists(task_dir):
        shutil.rmtree(task_dir)
    os.mkdir(task_dir)

    print(f"[Task] {instruction}")
    print("=" * 60)

    for step_id in range(max_step):
        print(f'\\n[Step {step_id + 1}]')

        screen_shot = os.path.join(task_dir, f'screen_shot_{step_id}.png')
        adb_tools.get_screenshot(screen_shot)

        messages = get_messages(screen_shot, instruction, history, mobile_system_prompt)

        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name,
                messages=messages,
                vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(f"[Model output]\\n{output_text}\\n")

        actions = extract_tool_calls(output_text)
        if not actions:
            print("No valid operation extracted")
            break

        action = actions[0]
        action_parameter = action['arguments']

        dummy_image = Image.open(screen_shot)
        resized_height, resized_width = smart_resize(
            dummy_image.height, dummy_image.width,
            factor=16, min_pixels=3136, max_pixels=1003520*200
        )

        for key in ['coordinate', 'coordinate1', 'coordinate2']:
            if key in action_parameter:
                action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
                action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)

        action_type = action_parameter['action']
        if action_type == 'click':
            adb_tools.click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
            print(f"✓ Click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
        elif action_type == 'long_press':
            adb_tools.long_press(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
            print(f"✓ Long press")
        elif action_type == 'type':
            adb_tools.type(action_parameter['text'])
            print(f"✓ Type text: {action_parameter['text']}")
        elif action_type in ['scroll', 'swipe']:
            adb_tools.slide(action_parameter['coordinate'][0], action_parameter['coordinate'][1],
                           action_parameter['coordinate2'][0], action_parameter['coordinate2'][1])
            print(f"✓ Swipe")
        elif action_type == 'system_button':
            system = action_parameter['button']
            if system == 'Back':
                adb_tools.back()
                print(f"✓ Back")
            elif system == 'Home':
                adb_tools.home()
                print(f"✓ Home")
        elif action_type == 'wait':
            time.sleep(2)
            print(f"✓ Wait")
        elif action_type == 'terminate':
            print(f'✓ Task complete')
            break
        elif action_type == 'open':
            app_name = action_parameter['text']
            normalized_name = normalize_package_name(app_name)
            package_name = NAME_PACKAGE_DICT.get(normalized_name, [])
            package_name_list = adb_tools.get_package_name()
            for sub_package_name in package_name:
                if sub_package_name in package_name_list:
                    adb_tools.open_app(sub_package_name)
                    print(f"✓ Open app: {app_name}")
                    break
            else:
                print(f"⚠ App not found: {app_name}. Please install it manually.")
                continue
        elif action_type == 'answer':
            print(f'✓ Task complete: {action_parameter.get("text", "")}')
            break
        elif action_type in ['call_user', 'calluser', 'interact']:
            print(f"⚠ User interaction required: {action_parameter.get('text', '')}")
            input("Press Enter to continue after completing the action...")
        else:
            print(f"✗ Unknown action type: {action_type}")

        history.append({'output': output_text, 'image': screen_shot})
        time.sleep(2)

    print("\\n" + "=" * 60)
    print(f"[Complete] Total executed {len(history)} steps")

Browser GUI tasks

For browsers, automation is achieved by controlling the browser with Playwright. It uses Set-of-Mark (SoM) technology to automatically add numeric labels to page elements. The model then uses these labels to precisely control web elements.

Environment setup:

  1. Install dependencies: pip install playwright pillow dashscope playwright-stealth termcolor

  2. Install the Playwright browser: playwright install chromium

The complete code example includes constructing multi-turn conversation messages, parsing tool calls from the model output, and executing browser GUI operations. The complete code is as follows:

Complete code example for browser

  1. 1. Construct the System Prompt

    import json, os, base64
    import dashscope, time, math
    import uuid
    from PIL import Image, ImageDraw, ImageTk
    import asyncio
    from playwright.async_api import async_playwright, Page, Browser, BrowserContext
    from datetime import datetime
    
    # Define the action display function
    def show_screenshot(image_path, action_parameter, save_path="screenshot_anno.png"):
        image = Image.open(image_path)
        draw = ImageDraw.Draw(image)
        if 'coordinate' in action_parameter:
            radius = 15
            center_x, center_y = action_parameter['coordinate'][0], action_parameter['coordinate'][1]
            draw.ellipse((center_x - radius, center_y - radius, center_x + radius, center_y + radius), 
                         fill="red", outline="red")
        elif 'coordinate1' in action_parameter and 'coordinate2' in action_parameter:
            x1, y1 = action_parameter['coordinate1'][0], action_parameter['coordinate1'][1]
            x2, y2 = action_parameter['coordinate2'][0], action_parameter['coordinate2'][1]
            arrow_size = 10
            color = 'red'
            draw.line((x1, y1, x2, y2), fill=color, width=2)
            angle = math.atan2(y2 - y1, x2 - x1)
            arrow_x1 = x2 - arrow_size * math.cos(angle - math.pi / 6)
            arrow_y1 = y2 - arrow_size * math.sin(angle - math.pi / 6)
            arrow_x2 = x2 - arrow_size * math.cos(angle + math.pi / 6)
            arrow_y2 = y2 - arrow_size * math.sin(angle + math.pi / 6)
            draw.polygon([(x2, y2), (arrow_x1, arrow_y1), (arrow_x2, arrow_y2)], fill=color)
        else:
            return None
        image.save(save_path)
        return save_path
    
    # Prompt assembly function
    def get_messages(image, env_state, instruction, history_output):
        history_n = 1
        current_step = len(history_output)
        history_start_idx = max(0, current_step - history_n)
        SoM_format_ele_text = env_state["SoM"]["format_ele_text"]
    
        system_prompt = '''# Tools
    
    You may call one or more functions to assist with the user query.
    
    You are provided with function signatures within <tools></tools> XML tags:
    <tools>
    {"type": "function", "function": {"name_for_human": "browser_use", "name": "browser_use", "description": "Use a browser to interact with web pages and take labeled screenshots.\\n* This is an interface to a web browser. You can click elements, type into inputs, scroll, wait for loading, go back, etc.\\n* Each Observation screenshot contains Numerical Labels placed at the TOP LEFT of each Web Element. Use these labels to target elements.\\n* Some pages may take time to load; you may need to wait and take successive screenshots.\\n* Avoid clicking near element edges; target the center of the element.\\n* Execute exactly ONE interaction action per step; do not chain multiple interactions in one call.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `click`: Click a web element by numerical label.\\n* `type`: Clear existing content in a textbox/input and type content. The system will automatically press ENTER after typing.\\n* `scroll`: Scroll within WINDOW or within a specific scrollable element/area (by label).\\n* `select`: Selects a specific option from a menu or dropdown. Use the option text provided in the textual information.\\n* `wait`: Wait for page processes to finish (default 5 seconds unless specified).\\n* `go_back`: Go back to the previous page.\\n* `wikipedia`: Directly jump to the Wikipedia homepage to search for information.\\n* `answer`: Terminate the current task and output the final answer.", "enum": ["click", "type", "scroll", "select", "wait", "go_back", "wikipedia", "answer"], "type": "string"}, "label": {"description": "Numerical label of the target web element. Required only by `action=click`, `action=type`, `action=scroll`, and `action=select` when scrolling within a specific area. Use string value `WINDOW` to scroll the whole page.", "type": ["integer", "string"]}, "direction": {"description": "Scroll direction. Required only by `action=scroll`.", "enum": ["up", "down"], "type": "string"}, "text": {"description": "Required only by `action=type` and `action=answer`.", "type": "string"}, "option": {"description": "The option to select. Required only by `action=select`", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=wait` when overriding the default.", "type": "integer"}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
    </tools>
    
    For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
    <tool_call>
    {"name": <function-name>, "arguments": <args-json-object>}
    </tool_call>
    
    # Response format
    
    Response format for every step:
    1) Action: a short imperative describing what to do in the UI.
    2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
    
    Rules:
    - Output exactly in the order: Action, <tool_call>.
    - Be brief: one line for Action.
    - Do not output anything else outside those two parts.
    - Execute ONLY ONE interaction per iteration (one tool call).
    - If finishing, use action=answer in the tool call.'''
    
        today = datetime.today()
        weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
        weekday = weekday_names[today.weekday()]
        formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
        day_info = f'''Today's date is: {formatted_date}.'''
    
        previous_actions = []
        for i in range(history_start_idx):
            if i < len(history_output):
                history_output_str = history_output[i]['output']
                if 'Action:' in history_output_str and '<tool_call>':
                    history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
                previous_actions.append(f"Step {i + 1}: {history_output_str}")
    
        previous_actions_str = (
            "\\n".join(previous_actions) if previous_actions else "None"
        )
    
        instruction_prompt = f'''\nPlease generate the next move according to the UI screenshot, instruction and previous actions.
    
    Instruction: {day_info}{instruction}
    
    Previous actions:
    {previous_actions_str}'''
    
        messages = [
            {
                "role": "system",
                "content": [
                    {"text": system_prompt}
                ],
            }
        ]
        history_len = min(history_n, len(history_output))
        if history_len > 0:
            for history_id, history_item in enumerate(history_output[-history_n:], 0):
                if history_id == 0:
                    messages.append({
                        "role": "user",
                        "content": [
                            {"text": instruction_prompt},
                            {"text": "Current screenshot:"},
                            {"image": "file://" + history_item['image']},
                        ]
                    })
                else:
                    messages.append({
                        "role": "user",
                        "content": [
                            {"text": "Current screenshot:"},
                            {"image": "file://" + history_item['image']}
                        ]
                    })
                messages.append({
                    "role": "assistant",
                    "content": [
                        {"text": history_item['output']},
                    ]
                })
            messages.append({
                "role": "user",
                "content": [
                    {"image": "file://" +image},
                    {"text": SoM_format_ele_text}
                ]
            })
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {
                            "text": instruction_prompt
                        },
                        {
                            "image": "file://" +image,
                        },
                        {"text": SoM_format_ele_text}
                    ],
                }
            )
    
        return messages
    
    
    def extract_tool_calls(text: str):
        import re, ast
        """Extract all JSON strings between <tool_call> ... </tool_call> from the text"""
        pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
        blocks = pattern.findall(text)
        
        actions = []
        for blk in blocks:
            blk = blk.strip()
            try:
                actions.append(ast.literal_eval(blk))
            except json.JSONDecodeError as e:
                print(f'⚠️ Parsing failed: {e} | Snippet: {blk[:80]}...')
        return actions
  2. 2. Browser control

    PLAYWRIGHT_KEY_MAP = {
        "backspace": "Backspace", "tab": "Tab", "return": "Enter", "enter": "Enter",
        "shift": "Shift", "control": "ControlOrMeta", "alt": "Alt", "escape": "Escape",
        "space": "Space", "pageup": "PageUp", "pagedown": "PageDown", "end": "End",
        "home": "Home", "left": "ArrowLeft", "up": "ArrowUp", "right": "ArrowRight",
        "down": "ArrowDown", "insert": "Insert", "delete": "Delete", "semicolon": ";",
        "equals": "=", "multiply": "Multiply", "add": "Add", "separator": "Separator",
        "subtract": "Subtract", "decimal": "Decimal", "divide": "Divide",
        "f1": "F1", "f2": "F2", "f3": "F3", "f4": "F4", "f5": "F5", "f6": "F6",
        "f7": "F7", "f8": "F8", "f9": "F9", "f10": "F10", "f11": "F11", "f12": "F12",
        "command": "Meta",
    }
    
    class PlaywrightComputer:
        """Async Playwright wrapper for web agent"""
    
        def __init__(self, initial_url="https://baidu.com/", task_dir='data', highlight_mouse=False):
            self._initial_url = initial_url
            self._search_engine_url = "https://baike.baidu.com/"
            self.task_dir = task_dir
            self._highlight_mouse = highlight_mouse
            
            self._playwright = None
            self._browser: Browser | None = None
            self._context: BrowserContext | None = None
            self._page: Page | None = None
            self._storage_state_path = "storage_state.json"
    
        async def _handle_new_page(self, new_page: Page):
            """Only keep one tab: redirect new tab url into current page"""
            new_url = new_page.url
            await new_page.close()
            try:
                await self._page.goto(new_url)
            except Exception as e:
                if "interrupted by another navigation" in str(e):
                    pass
                else:
                    raise
    
        async def reset(self):
            print("Creating session...")
            self._playwright = await async_playwright().start()
            self._browser = await self._playwright.chromium.launch(
                args=["--start-maximized", "--disable-blink-features=AutomationControlled"],
                headless=False,
            )
    
            storage_state = None
            if os.path.exists(self._storage_state_path):
                print("Loading storage state")
                storage_state = self._storage_state_path
    
            if sys.platform == "darwin":
                user_agent = (
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 "
                    "Chrome/143.0.0.0 Safari/537.36 Edg/143.0.0.0 "
                )
                import tkinter as tk
                root = tk.Tk()
                root.withdraw()
                width = root.winfo_screenwidth()
                height = root.winfo_screenheight()
                print(f"Screen size: {width}x{height}", flush=True)
                root.destroy()
                self._context = await self._browser.new_context(
                    no_viewport=True,
                    user_agent=user_agent,
                    locale="en-US",
                    timezone_id="America/New_York",
                    storage_state=storage_state,
                )
            else:
                user_agent = (
                    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/122.0.0.0 Safari/537.36"
                )
                self._context = await self._browser.new_context(
                    no_viewport=True,
                    user_agent=user_agent,
                    locale="en-US",
                    timezone_id="America/New_York",
                    storage_state=storage_state,
                )
    
            self._page = await self._context.new_page()
            await self._page.goto(self._initial_url, timeout=60000, wait_until="domcontentloaded")
            self._context.on("page", lambda p: asyncio.create_task(self._handle_new_page(p)))
            print("Started local playwright (async).")
            return self
    
        async def close(self):
            if self._context:
                try:
                    await self._context.storage_state(path=self._storage_state_path)
                except Exception as e:
                    print(f"Failed to save storage state: {e}")
    
            if self._context:
                await self._context.close()
            try:
                if self._browser:
                    await self._browser.close()
            except Exception as e:
                if "Browser.close: Connection closed while reading from the driver" in str(e):
                    pass
                else:
                    raise
            if self._playwright:
                await self._playwright.stop()
    
        async def click_at(self, x: int, y: int):
            await self._page.mouse.click(x, y)
            await self._page.mouse.move(x, y)
            await self._page.mouse.down()
            await self._page.wait_for_timeout(100)
            await self._page.mouse.up()
            await self._page.wait_for_load_state()
    
        async def type_text_at(self, x: int, y: int, text: str, press_enter=True, clear_before_typing=True):
            await self.click_at(x, y)
            await asyncio.sleep(0.1)
            await self._page.wait_for_load_state()
    
            if clear_before_typing:
                if sys.platform == "darwin":
                    await self.key_combination(["Command", "A"])
                else:
                    await self.key_combination(["Control", "A"])
                await asyncio.sleep(0.1)
                await self.key_combination(["Delete"])
                await self._page.wait_for_load_state()
    
            await self.click_at(x, y)
            await asyncio.sleep(0.1)
            await self._page.keyboard.type(text)
            await self._page.wait_for_load_state()
    
            if press_enter:
                await self.key_combination(["Enter"])
            await self._page.wait_for_load_state()
    
        async def scroll_at(self, x: int, y: int, direction: str, magnitude: int = 400):
            await self._page.mouse.move(x, y)
            await asyncio.sleep(0.1)
    
            dx = dy = 0
            if direction == "up":
                dy = -magnitude
            elif direction == "down":
                dy = magnitude
            elif direction == "left":
                dx = -magnitude
            elif direction == "right":
                dx = magnitude
            else:
                raise ValueError("Unsupported direction: ", direction)
    
            await self._page.mouse.wheel(dx, dy)
            await self._page.wait_for_load_state()
    
        async def go_back(self):
            await self._page.go_back()
            await self._page.wait_for_load_state()
    
        async def navigate(self, url: str, normalize=True):
            normalized_url = url
            if normalize and not normalized_url.startswith(("http://", "https://")):
                normalized_url = "https://" + normalized_url
            await self._page.goto(normalized_url)
            await self._page.wait_for_load_state()
    
        async def key_combination(self, keys: list[str]):
            keys = [PLAYWRIGHT_KEY_MAP.get(k.lower(), k) for k in keys]
            for key in keys[:-1]:
                await self._page.keyboard.down(key)
            await self._page.keyboard.press(keys[-1])
            for key in reversed(keys[:-1]):
                await self._page.keyboard.up(key)
    
        async def current_state(self, it):
            try:
                await self._page.wait_for_load_state("networkidle", timeout=10000)
            except Exception:
                try:
                    await self._page.wait_for_load_state("load", timeout=5000)
                except:
                    pass
    
            await asyncio.sleep(1)
    
            os.makedirs(os.path.join(self.task_dir, "trajectory_som"), exist_ok=True)
            os.makedirs(os.path.join(self.task_dir, "trajectory"), exist_ok=True)
            img_path = os.path.join(self.task_dir, f"trajectory_som/screenshot{it}.png")
            img_path_no_box = os.path.join(self.task_dir, f"trajectory/{it}_full_screenshot.png")
    
            SoM_list, format_ele_text = await get_som(self._page, img_path, img_path_no_box)
            width, height = Image.open(img_path).size
            return {
                "img_path": img_path,
                "img_path_no_box": img_path_no_box,
                "SoM": {
                    "SoM_list": SoM_list,
                    "format_ele_text": format_ele_text
                },
                "current_url": self._page.url,
                "width": width,
                "height": height
            }
    
        async def _select(self, x, y, text):
            await self._page.mouse.click(x, y)
            target_text = text
            handle = await self._page.evaluate_handle("""
            ([x,y]) => document.elementFromPoint(x,y)
            """, [x, y])
            
            tag = await handle.evaluate("el => el && el.tagName")
            if tag == "SELECT":
                await handle.evaluate("""(sel, label) => {
                    const opt = [...sel.options].find(o => o.label.trim() === label.trim());
                    if (!opt) throw new Error("Option not found: " + label);
                    sel.value = opt.value;
                    sel.dispatchEvent(new Event('input', {bubbles:true}));
                    sel.dispatchEvent(new Event('change', {bubbles:true}));
                }""", target_text)
            else:
                raise RuntimeError(f"The element at the coordinates is not a SELECT, but a {tag}")
  3. SoM Annotation Module

    import asyncio
    import base64
    import io
    import json
    from typing import Any, Dict, List
    from PIL import Image, ImageDraw, ImageFont
    import random
    
    # Some structural/decorative tags, usually not directly interactive
    STRUCTURAL_TAGS = {
        "html", "head", "meta", "link", "style", "script", "base",
        "title", "body",
        "header", "footer", "main", "nav", "section", "article",
        "aside", "summary", "details",
    }
    
    # Obviously non-interactive ARIA roles
    NON_INTERACTIVE_ROLES = {
        "presentation", "none",
        "img", "banner", "main", "contentinfo",
        "navigation", "region",
    }
    
    def looks_interactive(el: Dict[str, Any]) -> bool:
        tag = (el.get("tag") or "").lower()
        role = (el.get("role") or "").lower()
        typ = (el.get("type") or "").lower()
        text = (el.get("text") or "").strip()
        cls = (el.get("cls") or "").lower()
        id_ = (el.get("id") or "").lower()
        aria_label = (el.get("ariaLabel") or "").strip()
    
        # Native interactive controls
        if tag in {"a", "button", "select", "textarea"}:
            return True
        if tag == "input" and typ not in {"hidden"}:
            return True
    
        # Has a typical interactive role
        if role in {"button", "link", "tab", "menuitem", "option", "switch", "checkbox", "radio", "textbox"}:
            return True
    
        # Has common events or can receive focus
        if any(x in el for x in ("onclick", "onmousedown", "onmouseup")):
            return True
    
        # Has obvious interactive keywords in class / id
        if any(k in cls or k in id_ for k in ("btn", "button", "link", "click", "nav", "tab", "menu")):
            return True
    
        # Has text/aria-label and is a small area
        if (text or aria_label) and tag in {"div", "span", "li"}:
            return True
    
        return False
    
    def is_obviously_non_interactive(el: Dict[str, Any]) -> bool:
        tag = (el.get("tag") or "").lower()
        role = (el.get("role") or "").lower()
        typ = (el.get("type") or "").lower()
        bbox = el.get("bbox") or {}
        w = bbox.get("width") or 0
        h = bbox.get("height") or 0
    
        if tag in STRUCTURAL_TAGS:
            if not (tag == "body" and (el.get("isContentEditable") or "ke-content" in (el.get("cls") or ""))):
                return True
    
        if w * h < 9:
            return True
        if tag in STRUCTURAL_TAGS or tag in {"video", "audio", "source"}:
            return True
        if tag == "input" and typ == "hidden":
            return True
        if role in NON_INTERACTIVE_ROLES:
            return True
        if not looks_interactive(el):
            return True
    
        return False
    
    JS_COLLECT_ALL_RECURSIVE = r"""(args) => {
      const selector = args[0];
      const maxDepth = args[1];
      const maxPerDoc = args[2];
      const minArea = args[3];
      const includeIframes = args[4];
    
      function oneLine(s){ return String(s||"").replace(/\\s+/g," ").trim(); }
    
      function describe(el){
        const tagName = (el.tagName || "").toLowerCase();
        const hrefAttr = el.getAttribute("href") || "";
        const typeAttr = (el.getAttribute("type") || "").toLowerCase();
        const roleAttr = (el.getAttribute("role") || "").toLowerCase();
        const ariaAttr = el.getAttribute("aria-label") || "";
    
        let clickable = true;
        if (tagName == "div") clickable = false;
    
        if (tagName === "a" && hrefAttr) clickable = true;
        if (tagName === "button") clickable = true;
        if (tagName === "input" && typeAttr !== "hidden") clickable = true;
    
        if (["button","link","tab","menuitem","option","switch","checkbox","radio"].includes(roleAttr)) {
            clickable = true;
        }
    
        const isContentEditable = el.isContentEditable || el.getAttribute('contenteditable') === 'true';
        if (isContentEditable) clickable = true;
    
        if (el.onclick === "function") clickable = true;
        try {
            if (!clickable && typeof el.onclick === "function") {
                clickable = true;
            }
        } catch(e) {}
    
        const cls = (el.getAttribute("class") || "").toLowerCase();
        if (cls.includes("ke-content") || cls.includes("ke-edit-textarea")) {
            clickable = true;
        }
    
        return {
            tag: tagName,
            id: oneLine(el.getAttribute("id")),
            cls: oneLine(el.getAttribute("class")),
            role: roleAttr,
            name: oneLine(el.getAttribute("name")),
            type: typeAttr,
            href: oneLine(hrefAttr),
            src: oneLine(el.getAttribute("src")),
            ariaLabel: oneLine(ariaAttr),
            text: oneLine(el.textContent).slice(0, 40),
            clickable: clickable,
            isContentEditable: el.isContentEditable || el.getAttribute('contenteditable') === 'true',
        };
      }
    
      const out = [];
      let counter = 0;
    
      function walk(doc, depth, ox, oy, path){
        if (!doc || depth > maxDepth) return;
        const win = doc.defaultView;
        if (!win) return;
    
        const sx = 0;
        const sy = 0;
        const dpr = win.devicePixelRatio || 1;
    
        const nodes = Array.from(doc.querySelectorAll(selector)).slice(0, maxPerDoc);
        for (const el of nodes){
          try {
            const r = el.getBoundingClientRect();
            if (!r) continue;
            const area = r.width * r.height;
            if (area < minArea) continue;
    
            const vw = win.innerWidth || doc.documentElement.clientWidth || 0;
            const vh = win.innerHeight || doc.documentElement.clientHeight || 0;
            if (r.right <= 0 || r.bottom <= 0 || r.left >= vw || r.top >= vh) {
              continue;
            }
    
            const desc = describe(el);
            if (!desc.clickable) continue;
    
            if (!el.checkVisibility({ checkOpacity: true, checkVisibilityCSS: true })) continue;
    
            out.push({
              id: `e_${counter++}`,
              path,
              depth,
              dpr,
              ...describe(el),
              bbox: {
                x: (ox + r.x - sx) * dpr,
                y: (oy + r.y - sy) * dpr,
                width: r.width * dpr,
                height: r.height * dpr
              }
            });
          } catch(e) {}
        }
    
        if (!includeIframes) return;
    
        const iframes = Array.from(doc.querySelectorAll("iframe")).slice(0, maxPerDoc);
        for (let i = 0; i < iframes.length; i++){
          const fr = iframes[i];
          try {
            const rr = fr.getBoundingClientRect();
            const iframeOx = ox + rr.x - sx;
            const iframeOy = oy + rr.y - sy;
            const nextPath = path + `/iframe[${i}]`;
    
            try {
              const childDoc = fr.contentDocument;
              if (!childDoc) {
                out.push({
                  id: `iframe_${counter++}`,
                  path: nextPath,
                  depth,
                  dpr,
                  tag: "iframe",
                  src: oneLine(fr.getAttribute("src")),
                  error: "iframe not ready (contentDocument is null)",
                  bbox: {x: iframeOx * dpr, y: iframeOy * dpr, width: rr.width * dpr, height: rr.height * dpr},
                });
                continue;
              }
              walk(childDoc, depth + 1, iframeOx, iframeOy, nextPath);
            } catch(e) {
              out.push({
                id: `iframe_${counter++}`,
                path: nextPath,
                depth,
                dpr,
                tag: "iframe",
                src: oneLine(fr.getAttribute("src")),
                error: String(e),
                bbox: {x: iframeOx * dpr, y: iframeOy * dpr, width: rr.width * dpr, height: rr.height * dpr},
              });
            }
          } catch(e) {}
        }
      }
    
      walk(document, 0, 0, 0, "root");
      return JSON.stringify(out);
    }"""
    
    def mark_containing_items_for_removal(items):
        """Mark containing items for removal"""
        def bbox_contains(a, b):
            ax1, ay1 = a["x"], a["y"]
            ax2, ay2 = ax1 + a["width"], ay1 + a["height"]
            bx1, by1 = b["x"], b["y"]
            bx2, by2 = bx1 + b["width"], by1 + b["height"]
            return (bx1 >= ax1 and by1 >= ay1 and bx2 <= ax2 and by2 <= ay2)
    
        for item in items:
            item["to_remove"] = False
    
        n = len(items)
        for i in range(n):
            a = items[i]
            for j in range(n):
                if i == j:
                    continue
                b = items[j]
                if (a.get("text") or "").strip() == (b.get("text") or "").strip():
                    if bbox_contains(a["bbox"], b["bbox"]):
                        a["to_remove"] = True
                        break
    
        return [item for item in items if not item["to_remove"]]
    
    def draw_dashed_line(draw, xy, dash_len=6, gap_len=4, fill=(255, 0, 0, 200), width=2):
        """Draw a dashed line"""
        x1, y1, x2, y2 = xy
        if y1 == y2:  # Horizontal line
            total_len = abs(x2 - x1)
            step = dash_len + gap_len
            n = max(1, int(total_len // step) + 1)
            direction = 1 if x2 >= x1 else -1
            for i in range(n):
                start = x1 + direction * i * step
                end = start + direction * dash_len
                if direction == 1:
                    if start > x2: break
                    end = min(end, x2)
                else:
                    if start < x2: break
                    end = max(end, x2)
                draw.line((start, y1, end, y2), fill=fill, width=width)
        elif x1 == x2:  # Vertical line
            total_len = abs(y2 - y1)
            step = dash_len + gap_len
            n = max(1, int(total_len // step) + 1)
            direction = 1 if y2 >= y1 else -1
            for i in range(n):
                start = y1 + direction * i * step
                end = start + direction * dash_len
                if direction == 1:
                    if start > y2: break
                    end = min(end, y2)
                else:
                    if start < y2: break
                    end = max(end, y2)
                draw.line((x1, start, x2, end), fill=fill, width=width)
        else:
            draw.line((x1, y1, x2, y2), fill=fill, width=width)
    
    def draw_dashed_rect(draw, x1, y1, x2, y2, dash_len=6, gap_len=4, fill=(255, 0, 0, 200), width=2):
        """Draw a dashed rectangle"""
        draw_dashed_line(draw, (x1, y1, x2, y1), dash_len, gap_len, fill, width)
        draw_dashed_line(draw, (x1, y2, x2, y2), dash_len, gap_len, fill, width)
        draw_dashed_line(draw, (x1, y1, x1, y2), dash_len, gap_len, fill, width)
        draw_dashed_line(draw, (x2, y1, x2, y2), dash_len, gap_len, fill, width)
    
    def screenshot_to_png_bytes(s: Any) -> bytes:
        if isinstance(s, (bytes, bytearray)):
            return bytes(s)
        if isinstance(s, str):
            ss = s.strip()
            if ss.startswith("data:image"):
                ss = ss.split(",", 1)[-1].strip()
            try:
                return base64.b64decode(ss, validate=True)
            except Exception:
                with open(s, "rb") as f:
                    return f.read()
        raise TypeError(f"Unsupported screenshot type: {type(s)}")
    
    def items_to_text(items_raw):
        format_ele_text = []
        for web_ele_id in range(len(items_raw)):
            item = items_raw[web_ele_id]
            is_menu = item.get('isMenu', False)
            menu_options = item.get('menuOptions', [])
            label_text = item.get('text', "")
            ele_tag_name = item.get("tag", "button")
            ele_type = item.get("type", "")
            ele_aria_label = item.get("ariaLabel", "")
            input_attr_types = ['text', 'search', 'password', 'email', 'tel']
    
            if is_menu and menu_options:
                trigger_text = label_text.split('\\n')[0].strip()
                options_str = ', '.join([f'"{opt}"' for opt in menu_options])
                base_text = f"[{web_ele_id}]: <{ele_tag_name}>"
                if trigger_text:
                    base_text += f' "{trigger_text}"'
                elif ele_aria_label:
                    base_text += f' "{ele_aria_label}"'
                format_ele_text.append(f"{base_text} is a menu with options: [{options_str}];")
                continue
    
            if not label_text:
                if (ele_tag_name.lower() == 'input' and ele_type in input_attr_types) or \\
                   ele_tag_name.lower() == 'textarea' or \\
                   (ele_tag_name.lower() == 'button' and ele_type in ['submit', 'button']):
                    if ele_aria_label:
                        format_ele_text.append(f'[{web_ele_id}]: <{ele_tag_name}> "{ele_aria_label}";')
                    else:
                        format_ele_text.append(f'[{web_ele_id}]: <{ele_tag_name}> "{label_text}";')
            elif label_text and len(label_text) < 200:
                if not ("<img" in label_text and "src=" in label_text):
                    if ele_tag_name in ["button", "input", "textarea"]:
                        if ele_aria_label and (ele_aria_label != label_text):
                            format_ele_text.append(f'[{web_ele_id}]: <{ele_tag_name}> "{label_text}", "{ele_aria_label}";')
                        else:
                            format_ele_text.append(f'[{web_ele_id}]: <{ele_tag_name}> "{label_text}";')
                    else:
                        if ele_aria_label and (ele_aria_label != label_text):
                            format_ele_text.append(f'[{web_ele_id}]: "{label_text}", "{ele_aria_label}";')
                        else:
                            format_ele_text.append(f'[{web_ele_id}]: "{label_text}";')
    
        return '\\t'.join(format_ele_text)
    
    def draw_som(items, overlay, max_draw):
        try:
            font = ImageFont.truetype(ImageFont.load_default().path, size=20)
        except Exception:
            font = ImageFont.load_default()
    
        placed_label_boxes = []
        draw = ImageDraw.Draw(overlay)
        for idx, it in enumerate(items[:max_draw]):
            b = it.get("bbox") or {}
            x, y, w, h = b.get("x"), b.get("y"), b.get("width"), b.get("height")
            if None in (x, y, w, h) or w <= 0 or h <= 0:
                continue
    
            r, g, b_color = random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)
            color = (r, g, b_color, 255)
    
            x1, y1, x2, y2 = x, y, x + w, y + h
            draw_dashed_rect(draw, x1, y1, x2, y2, dash_len=6, gap_len=4, fill=color, width=2)
    
            if idx < max_draw:
                label = f'{idx}'
                try:
                    tb = draw.textbbox((0, 0), label, font=font)
                    tw, th = tb[2] - tb[0], tb[3] - tb[1]
                except Exception:
                    tw, th = len(label) * 6, 12
    
                padding_x, padding_y = 6, 4
                label_w, label_h = tw + padding_x * 2, th + padding_y * 2
    
                candidates = [
                    ("top_left", x1, y1 - label_h - 2),
                    ("top_right", x2 - label_w, y1 - label_h - 2),
                    ("bottom_left", x1, y2 + 2),
                    ("bottom_right", x2 - label_w, y2 + 2),
                ]
    
                img_w, img_h = overlay.size
                normalized_candidates = [
                    (name, max(0, min(lx, img_w - label_w)), max(0, min(ly, img_h - label_h)))
                    for name, lx, ly in candidates
                ]
    
                best_pos = None
                best_overlap = None
                for name, lx_c, ly_c in normalized_candidates:
                    candidate_rect = (lx_c, ly_c, lx_c + label_w, ly_c + label_h)
                    total_overlap = sum(
                        rect_intersection_area(candidate_rect,placed)
                        for placed in placed_label_boxes
                    )
                    if total_overlap == 0:
                        best_pos = (lx_c, ly_c)
                        best_overlap = 0
                        break
                    if best_overlap is None or total_overlap < best_overlap:
                        best_overlap = total_overlap
                        best_pos = (lx_c, ly_c)
    
                lx, ly = best_pos
                label_box = (lx, ly, lx + label_w, ly + label_h)
                placed_label_boxes.append(label_box)
    
                color = list(color)
                color[-1] = 150
                draw.rectangle([lx, ly, lx + label_w, ly + label_h], fill=tuple(color))
                draw.text((lx + padding_x, ly + padding_y), label, fill=(255, 255, 255, 255), font=font)
    
    def rect_intersection_area(a, b):
        ax1, ay1, ax2, ay2 = a
        bx1, by1, bx2, by2 = b
        ix1, iy1 = max(ax1, bx1), max(ay1, by1)
        ix2, iy2 = min(ax2, bx2), min(ay2, by2)
        if ix2 <= ix1 or iy2 <= iy1:
            return 0
        return (ix2 - ix1) * (iy2 - iy1)
    
    def is_inside_strict(box_inner: dict, box_outer: dict) -> bool:
        x1, y1, w1, h1 = box_inner.get("x"), box_inner.get("y"), box_inner.get("width"), box_inner.get("height")
        x2, y2, w2, h2 = box_outer.get("x"), box_outer.get("y"), box_outer.get("width"), box_outer.get("height")
        i_left, i_top = x1, y1
        i_right, i_bottom = x1 + w1, y1 + h1
        o_left, o_top = x2, y2
        o_right, o_bottom = x2 + w2, y2 + h2
        return (i_left > o_left and i_top > o_top and i_right < o_right and i_bottom < o_bottom)
    
    def remove_outer_boxes(A: list) -> list:
        n = len(A)
        remove_flags = [False] * n
        for i in range(n):
            if remove_flags[i]:
                continue
            box_i = A[i]["bbox"]
            for j in range(n):
                if i == j:
                    continue
                box_j = A[j]["bbox"]
                if is_inside_strict(box_j, box_i):
                    remove_flags[i] = True
                    break
        return [box for k, box in enumerate(A) if not remove_flags[k]]
    
    async def get_css_som(page, selector="*", max_depth=16, max_per_doc=3000, min_area=-1.0, max_retry=3):
        items = []
        for _ in range(max_retry):
            try:
                items_json = await page.evaluate(
                    JS_COLLECT_ALL_RECURSIVE,
                    [selector, max_depth, max_per_doc, float(min_area), True],
                )
                items = json.loads(items_json)
                break
            except Exception:
                try:
                    await page.wait_for_load_state()
                except Exception:
                    pass
                await asyncio.sleep(1)
                continue
    
        if items:
            items = mark_containing_items_for_removal(items)
        return items
    
    def remove_neg_boxes(A: list) -> list:
        n = len(A)
        remove_flags = [False] * n
        for i in range(n):
            box_i = A[i]["bbox"]
            x, y, w, h = box_i.get("x"), box_i.get("y"), box_i.get("width"), box_i.get("height")
            if None in (x, y, w, h) or w <= 0 or h <= 0:
                remove_flags[i] = True
            if box_i.get("x", 0) < 0 or box_i.get("y", 0) < 0:
                remove_flags[i] = True
        return [box for k, box in enumerate(A) if not remove_flags[k]]
    
    async def get_som(page, img_path, img_path_no_box, selector="*", max_depth=16, 
                      max_per_doc=3000, min_area=-1.0, max_draw=2000, max_retry=3):
        """Get SoM annotations for the page"""
        items = await get_css_som(page, selector=selector, max_depth=max_depth, 
                                   max_per_doc=max_per_doc, min_area=min_area, max_retry=max_retry)
        
        shot = await page.screenshot()
        with open(img_path_no_box, "wb") as f:
            f.write(shot)
        
        items = remove_neg_boxes(items)
        
        png_bytes = screenshot_to_png_bytes(shot)
        img = Image.open(io.BytesIO(png_bytes)).convert("RGBA")
        overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
        draw_som(items, overlay, max_draw)
        img = Image.alpha_composite(img, overlay)
        img.save(img_path)
        
        return items, items_to_text(items)
  4. Primary Link Operation

    async def main():
        instruction = 'Search for the results of today's NBA Rockets game'
        history = []
        max_step = 30
        model_name = 'gui-plus-2026-02-26'
        task_dir = f'data_{time.strftime("%Y%m%d_%H%M%S")}'
        web_tools = PlaywrightComputer(initial_url='https://www.baidu.com/', task_dir=task_dir)
    
        await web_tools.reset()
        stop_flag = False
        dashscope.api_key = os.getenv("DASHSCOPE_API_KEY", None)
        session_id = str(uuid.uuid4())
        print('session_id ', session_id)
    
        for step_id in range(max_step):
            if stop_flag:
                break
            safe_instruction = "".join(c if c.isalnum() or c in (" ", "_", "-") else "_" for c in instruction).strip()
            env_state = await web_tools.current_state(it=step_id)
            screen_shot = env_state['img_path']
            messages = get_messages(screen_shot, env_state, instruction, history)
    
            retry_time = 3
            for _ in range(retry_time):
                response = dashscope.MultiModalConversation.call(
                    model=model_name,
                    messages=messages,
                    vl_high_resolution_images=True,
                    headers={"x-dashscope-gui-session-id": session_id},
                    stream=False
                )
                print(response['request_id'])
                try:
                    output_text = response.output.choices[0].message.content[0]['text']
                    break
                except Exception as e:
                    print(response)
                    print(e)
            else:
                raise Exception('retry_time out')
            print(output_text)
    
            thought = response.output.choices[0].message.reasoning_content
            if thought != '':
                output_text = f"<think>\\n{thought}\\n</think>{output_text}"
    
            action_list = extract_tool_calls(output_text)
            conclusion = output_text.split('<tool_call>')[0].strip()
    
            for action_id, action in enumerate(action_list):
                action_parameter = action['arguments']
                action_type = action_parameter['action']
                label = action_parameter.get('label', None)
                weki_url = "https://baike.baidu.com/"
                coordicate = []
                
                if label is not None:
                    if label == "WINDOW":
                        coordicate = [500, 500]
                    else:
                        ele = env_state["SoM"]["SoM_list"][label]
                        box = ele["bbox"]
                        x, y, w, h = box.get("x"), box.get("y"), box.get("width"), box.get("height")
                        nx = x + w / 2
                        ny = y + h / 2
                        coordicate = [nx, ny]
                        action_parameter['coordinate'] = coordicate
    
                if action_type == 'wait':
                    await asyncio.sleep(action_parameter.get('time', 2))
                elif action_type == 'scroll':
                    direction = action_parameter['direction']
                    await web_tools.scroll_at(coordicate[0], coordicate[1], direction=direction, magnitude=300)
                elif action_type == 'select':
                    text = action_parameter['option']
                    await web_tools._select(coordicate[0], coordicate[1], text=text)
                elif action_type == 'goback':
                    await web_tools.go_back()
                elif action_type == 'goto':
                    url = action_parameter.get('url', '')
                    await web_tools.navigate(url, normalize=True)
                elif action_type == 'click':
                    await web_tools.click_at(coordicate[0], coordicate[1])
                elif action_type == 'type':
                    text = action_parameter['text']
                    await web_tools.type_text_at(coordicate[0], coordicate[1], text)
                elif action_type == 'wikipedia':
                    await web_tools.navigate(weki_url, normalize=True)
                elif action_type == 'answer':
                    text = action_parameter['text']
                    print(f'Answer: {text}\\nAction completed')
                    stop_flag = True
                    break
    
                anno_dir = os.path.join(task_dir, 'anno')
                if not os.path.exists(anno_dir):
                    os.makedirs(anno_dir)
                anno_path = show_screenshot(
                    env_state['img_path_no_box'],
                    action_parameter,
                    f'{anno_dir}/anno_{step_id}_{action_id}.png'
                )
    
            history.append({'output': output_text, 'image': screen_shot})
            await asyncio.sleep(2)
    
    if __name__ == '__main__':
        asyncio.run(main())

Complete browser example code

"""
Complete example code for browser GUI automation
Combines browser_local.py and som.py into a single file that can be run directly

Install dependencies:
pip install playwright dashscope pillow termcolor playwright-stealth
playwright install chromium

Set environment variables before running:
export DASHSCOPE_API_KEY=your_api_key
"""

import json, os, sys, base64
import dashscope, time, math
import uuid
from PIL import Image, ImageDraw
import asyncio
from playwright.async_api import async_playwright, Page, Browser, BrowserContext
from typing import Literal, Any, List, Dict
import random
from datetime import datetime
import io
import re

# ===================== SoM annotation module =====================

# Structural/decorative tags, usually not directly interactive
STRUCTURAL_TAGS = {
    "html", "head", "meta", "link", "style", "script", "base",
    "title", "body", "header", "footer", "main", "nav", "section", "article",
    "aside", "summary", "details",
}

# Obviously non-interactive ARIA roles
NON_INTERACTIVE_ROLES = {
    "presentation", "none", "img", "banner", "main", "contentinfo", "navigation", "region",
}

def looks_interactive(el: Dict[str, Any]) -> bool:
    """Determine if an element looks interactive"""
    tag = (el.get("tag") or "").lower()
    role = (el.get("role") or "").lower()
    typ = (el.get("type") or "").lower()
    text = (el.get("text") or "").strip()
    cls = (el.get("cls") or "").lower()
    id_ = (el.get("id") or "").lower()
    aria_label = (el.get("ariaLabel") or "").strip()

    if tag in {"a", "button", "select", "textarea"}:
        return True
    if tag == "input" and typ not in {"hidden"}:
        return True
    if role in {"button", "link", "tab", "menuitem", "option", "switch", "checkbox", "radio", "textbox"}:
        return True
    if any(x in el for x in ("onclick", "onmousedown", "onmouseup")):
        return True
    if any(k in cls or k in id_ for k in ("btn", "button", "link", "click", "nav", "tab", "menu")):
        return True
    if (text or aria_label) and tag in {"div", "span", "li"}:
        return True
    return False

def is_obviously_non_interactive(el: Dict[str, Any]) -> bool:
    """Determine if an element is obviously non-interactive"""
    tag = (el.get("tag") or "").lower()
    role = (el.get("role") or "").lower()
    typ = (el.get("type") or "").lower()
    bbox = el.get("bbox") or {}
    w = bbox.get("width") or 0
    h = bbox.get("height") or 0

    if tag in STRUCTURAL_TAGS:
        if not (tag == "body" and (el.get("isContentEditable") or "ke-content" in (el.get("cls") or ""))):
            return True
    if w * h < 9:
        return True
    if tag in STRUCTURAL_TAGS or tag in {"video", "audio", "source"}:
        return True
    if tag == "input" and typ == "hidden":
        return True
    if role in NON_INTERACTIVE_ROLES:
        return True
    if not looks_interactive(el):
        return True
    return False

JS_COLLECT_ALL_RECURSIVE = r"""(args) => {
  const selector = args[0];
  const maxDepth = args[1];
  const maxPerDoc = args[2];
  const minArea = args[3];
  const includeIframes = args[4];

  function oneLine(s){ return String(s||"").replace(/\s+/g," ").trim(); }

  function describe(el){
    const tagName = (el.tagName || "").toLowerCase();
    const hrefAttr = el.getAttribute("href") || "";
    const typeAttr = (el.getAttribute("type") || "").toLowerCase();
    const roleAttr = (el.getAttribute("role") || "").toLowerCase();
    const ariaAttr = el.getAttribute("aria-label") || "";

    let clickable = true;
    if (tagName == "div") clickable = false;
    if (tagName === "a" && hrefAttr) clickable = true;
    if (tagName === "button") clickable = true;
    if (tagName === "input" && typeAttr !== "hidden") clickable = true;
    if (["button","link","tab","menuitem","option","switch","checkbox","radio"].includes(roleAttr)) {
        clickable = true;
    }
    const isContentEditable = el.isContentEditable || el.getAttribute('contenteditable') === 'true';
    if (isContentEditable) clickable = true;
    if (el.onclick === "function") clickable = true;
    try {
        if (!clickable && typeof el.onclick === "function") clickable = true;
    } catch(e) {}
    const cls = (el.getAttribute("class") || "").toLowerCase();
    if (cls.includes("ke-content") || cls.includes("ke-edit-textarea")) clickable = true;

    return {
        tag: tagName, id: oneLine(el.getAttribute("id")), cls: oneLine(el.getAttribute("class")),
        role: roleAttr, name: oneLine(el.getAttribute("name")), type: typeAttr,
        href: oneLine(hrefAttr), src: oneLine(el.getAttribute("src")), ariaLabel: oneLine(ariaAttr),
        text: oneLine(el.textContent).slice(0, 40), clickable: clickable,
        isContentEditable: el.isContentEditable || el.getAttribute('contenteditable') === 'true',
    };
  }

  const out = [];
  let counter = 0;

  function walk(doc, depth, ox, oy, path){
    if (!doc || depth > maxDepth) return;
    const win = doc.defaultView;
    if (!win) return;
    const sx = 0, sy = 0;
    const dpr = win.devicePixelRatio || 1;

    const nodes = Array.from(doc.querySelectorAll(selector)).slice(0, maxPerDoc);
    for (const el of nodes){
      try {
        const r = el.getBoundingClientRect();
        if (!r) continue;
        const area = r.width * r.height;
        if (area < minArea) continue;
        const vw = win.innerWidth || doc.documentElement.clientWidth || 0;
        const vh = win.innerHeight || doc.documentElement.clientHeight || 0;
        if (r.right <= 0 || r.bottom <= 0 || r.left >= vw || r.top >= vh) continue;
        const desc = describe(el);
        if (!desc.clickable) continue;
        if (!el.checkVisibility({ checkOpacity: true, checkVisibilityCSS: true })) continue;

        out.push({
          id: `e_${counter++}`, path, depth, dpr, ...describe(el),
          bbox: { x: (ox + r.x - sx) * dpr, y: (oy + r.y - sy) * dpr, width: r.width * dpr, height: r.height * dpr }
        });
      } catch(e) {}
    }
    if (!includeIframes) return;
    const iframes = Array.from(doc.querySelectorAll("iframe")).slice(0, maxPerDoc);
    for (let i = 0; i < iframes.length; i++){
      const fr = iframes[i];
      try {
        const rr = fr.getBoundingClientRect();
        const iframeOx = ox + rr.x - sx;
        const iframeOy = oy + rr.y - sy;
        const nextPath = path + `/iframe[${i}]`;
        try {
          const childDoc = fr.contentDocument;
          if (!childDoc) {
            out.push({ id: `iframe_${counter++}`, path: nextPath, depth, dpr, tag: "iframe",
              src: oneLine(fr.getAttribute("src")), error: "iframe not ready",
              bbox: {x: iframeOx * dpr, y: iframeOy * dpr, width: rr.width * dpr, height: rr.height * dpr} });
            continue;
          }
          walk(childDoc, depth + 1, iframeOx, iframeOy, nextPath);
        } catch(e) {
          out.push({ id: `iframe_${counter++}`, path: nextPath, depth, dpr, tag: "iframe",
            src: oneLine(fr.getAttribute("src")), error: String(e),
            bbox: {x: iframeOx * dpr, y: iframeOy * dpr, width: rr.width * dpr, height: rr.height * dpr} });
        }
      } catch(e) {}
    }
  }
  walk(document, 0, 0, 0, "root");
  return JSON.stringify(out);
}"""

def mark_containing_items_for_removal(items):
    """Remove duplicate elements that are contained"""
    def bbox_contains(a, b):
        ax1, ay1, ax2, ay2 = a["x"], a["y"], a["x"] + a["width"], a["y"] + a["height"]
        bx1, by1, bx2, by2 = b["x"], b["y"], b["x"] + b["width"], b["y"] + b["height"]
        return (bx1 >= ax1 and by1 >= ay1 and bx2 <= ax2 and by2 <= ay2)

    for item in items:
        item["to_remove"] = False
    n = len(items)
    for i in range(n):
        a = items[i]
        for j in range(n):
            if i == j: continue
            b = items[j]
            if (a.get("text") or "").strip() == (b.get("text") or "").strip():
                if bbox_contains(a["bbox"], b["bbox"]):
                    a["to_remove"] = True
                    break
    return [item for item in items if not item["to_remove"]]

def screenshot_to_png_bytes(s: Any) -> bytes:
    if isinstance(s, (bytes, bytearray)):
        return bytes(s)
    if isinstance(s, str):
        ss = s.strip()
        if ss.startswith("data:image"):
            ss = ss.split(",", 1)[-1].strip()
        try:
            return base64.b64decode(ss, validate=True)
        except Exception:
            with open(s, "rb") as f:
                return f.read()
    raise TypeError(f"Unsupported screenshot type: {type(s)}")

def items_to_text(items_raw):
    """Convert a list of elements to a text description"""
    format_ele_text = []
    for web_ele_id in range(len(items_raw)):
        item = items_raw[web_ele_id]
        label_text = item.get('text', "")
        ele_tag_name = item.get("tag", "button")
        ele_type = item.get("type", "")
        ele_aria_label = item.get("ariaLabel", "")
        input_attr_types = ['text', 'search', 'password', 'email', 'tel']

        if not label_text:
            if (ele_tag_name.lower() == 'input' and ele_type in input_attr_types) or \
               ele_tag_name.lower() == 'textarea' or \
               (ele_tag_name.lower() == 'button' and ele_type in ['submit', 'button']):
                format_ele_text.append(f"[{web_ele_id}]: <{ele_tag_name}> \"{ele_aria_label or label_text}\";")
        elif label_text and len(label_text) < 200:
            if not ("<img" in label_text and "src=" in label_text):
                if ele_tag_name in ["button", "input", "textarea"]:
                    format_ele_text.append(f"[{web_ele_id}]: <{ele_tag_name}> \"{label_text}\";")
                else:
                    format_ele_text.append(f"[{web_ele_id}]: \"{label_text}\";")
    return '\t'.join(format_ele_text)

def draw_dashed_rect(draw, x1, y1, x2, y2, dash_len=6, gap_len=4, fill=(255, 0, 0, 200), width=2):
    """Draw a dashed rectangle"""
    def draw_dashed_line(xy):
        px1, py1, px2, py2 = xy
        if py1 == py2:  # Horizontal line
            total_len = abs(px2 - px1)
            step = dash_len + gap_len
            direction = 1 if px2 >= px1 else -1
            for i in range(max(1, int(total_len // step) + 1)):
                start = px1 + direction * i * step
                end = start + direction * dash_len
                if direction == 1:
                    if start > px2: break
                    end = min(end, px2)
                else:
                    if start < px2: break
                    end = max(end, px2)
                draw.line((start, py1, end, py2), fill=fill, width=width)
        elif px1 == px2:  # Vertical line
            total_len = abs(py2 - py1)
            step = dash_len + gap_len
            direction = 1 if py2 >= py1 else -1
            for i in range(max(1, int(total_len // step) + 1)):
                start = py1 + direction * i * step
                end = start + direction * dash_len
                if direction == 1:
                    if start > py2: break
                    end = min(end, py2)
                else:
                    if start < py2: break
                    end = max(end, py2)
                draw.line((px1, start, px2, end), fill=fill, width=width)
    draw_dashed_line((x1, y1, x2, y1))
    draw_dashed_line((x1, y2, x2, y2))
    draw_dashed_line((x1, y1, x1, y2))
    draw_dashed_line((x2, y1, x2, y2))

def draw_som(items, overlay, max_draw):
    """Draw SoM annotations on the screenshot"""
    try:
        font = Image.ImageFont.load_default()
    except Exception:
        font = None
    draw = ImageDraw.Draw(overlay)
    placed_label_boxes = []

    for idx, it in enumerate(items[:max_draw]):
        b = it.get("bbox") or {}
        x, y, w, h = b.get("x"), b.get("y"), b.get("width"), b.get("height")
        if None in (x, y, w, h) or w <= 0 or h <= 0:
            continue

        color = (random.randint(0, 255), random.randint(0, 255), random.randint(0, 255), 255)
        x1, y1, x2, y2 = x, y, x + w, y + h
        draw_dashed_rect(draw, x1, y1, x2, y2, fill=color, width=2)

        if idx < max_draw:
            label = f'{idx}'
            try:
                tb = draw.textbbox((0, 0), label, font=font) if font else (0, 0, len(label) * 6, 12)
                tw, th = tb[2] - tb[0], tb[3] - tb[1]
            except Exception:
                tw, th = len(label) * 6, 12

            padding_x, padding_y = 6, 4
            label_w, label_h = tw + padding_x * 2, th + padding_y * 2

            # Prioritize top-left corner
            lx = max(0, min(x1, overlay.size[0] - label_w))
            ly = max(0, y1 - label_h - 2)

            color_list = list(color)
            color_list[-1] = 150
            draw.rectangle([lx, ly, lx + label_w, ly + label_h], fill=tuple(color_list))
            draw.text((lx + padding_x, ly + padding_y), label, fill=(255, 255, 255, 255), font=font)

def remove_neg_boxes(A: list) -> list:
    """Remove boxes with negative coordinates"""
    return [box for box in A if not (box["bbox"].get("x", 0) < 0 or box["bbox"].get("y", 0) < 0)]

async def get_css_som(page, selector="*", max_depth=16, max_per_doc=3000, min_area=-1.0, max_retry=3):
    """Get CSS elements from the page"""
    items = []
    for _ in range(max_retry):
        try:
            items_json = await page.evaluate(JS_COLLECT_ALL_RECURSIVE, [selector, max_depth, max_per_doc, float(min_area), True])
            items = json.loads(items_json)
            break
        except Exception:
            try:
                await page.wait_for_load_state()
            except Exception:
                pass
            await asyncio.sleep(1)
    if items:
        items = mark_containing_items_for_removal(items)
    return items

async def get_som(page, img_path, img_path_no_box, selector="*", max_depth=16, max_per_doc=3000, min_area=-1.0, max_draw=2000, max_retry=3):
    """Get SoM annotations for the page"""
    items = await get_css_som(page, selector=selector, max_depth=max_depth, max_per_doc=max_per_doc, min_area=min_area, max_retry=max_retry)
    shot = await page.screenshot()
    with open(img_path_no_box, "wb") as f:
        f.write(shot)
    items = remove_neg_boxes(items)
    png_bytes = screenshot_to_png_bytes(shot)
    img = Image.open(io.BytesIO(png_bytes)).convert("RGBA")
    overlay = Image.new("RGBA", img.size, (0, 0, 0, 0))
    draw_som(items, overlay, max_draw)
    img = Image.alpha_composite(img, overlay)
    img.save(img_path)
    return items, items_to_text(items)


# ===================== PlaywrightComputer class =====================

PLAYWRIGHT_KEY_MAP = {
    "backspace": "Backspace", "tab": "Tab", "return": "Enter", "enter": "Enter",
    "shift": "Shift", "control": "ControlOrMeta", "alt": "Alt", "escape": "Escape",
    "space": "Space", "pageup": "PageUp", "pagedown": "PageDown", "end": "End", "home": "Home",
    "left": "ArrowLeft", "up": "ArrowUp", "right": "ArrowRight", "down": "ArrowDown",
    "insert": "Insert", "delete": "Delete", "f1": "F1", "f2": "F2", "f3": "F3", "f4": "F4",
    "f5": "F5", "f6": "F6", "f7": "F7", "f8": "F8", "f9": "F9", "f10": "F10", "f11": "F11", "f12": "F12",
    "command": "Meta",
}

class PlaywrightComputer:
    """Playwright browser automation wrapper"""

    def __init__(self, initial_url="https://www.baidu.com/", task_dir='data', highlight_mouse: bool = False):
        self._initial_url = initial_url
        self.task_dir = task_dir
        self._highlight_mouse = highlight_mouse
        self._playwright = None
        self._browser: Browser | None = None
        self._context: BrowserContext | None = None
        self._page: Page | None = None

    async def reset(self):
        """Initialize the browser"""
        self._playwright = await async_playwright().start()
        self._browser = await self._playwright.chromium.launch(
            args=["--start-maximized", "--disable-blink-features=AutomationControlled"],
            headless=False,
        )
        user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36"
        self._context = await self._browser.new_context(no_viewport=True, user_agent=user_agent, locale="en-US")
        self._page = await self._context.new_page()
        await self._page.goto(self._initial_url, timeout=60000, wait_until="domcontentloaded")
        print("Browser started successfully")

    async def close(self):
        """Close the browser"""
        if self._context:
            await self._context.close()
        if self._browser:
            await self._browser.close()
        if self._playwright:
            await self._playwright.stop()

    async def click_at(self, x: int, y: int):
        await self._page.mouse.click(x, y)
        await self._page.wait_for_load_state()

    async def type_text_at(self, x: int, y: int, text: str, press_enter: bool = True, clear_before_typing: bool = True):
        await self._page.mouse.click(x, y)
        await asyncio.sleep(0.1)
        if clear_before_typing:
            await self.key_combination(["Control", "A"])
            await asyncio.sleep(0.1)
            await self.key_combination(["Delete"])
        await self._page.keyboard.type(text)
        if press_enter:
            await self.key_combination(["Enter"])
        await self._page.wait_for_load_state()

    async def scroll_at(self, x: int, y: int, direction: Literal["up", "down"], magnitude: int = 400):
        await self._page.mouse.move(x, y)
        await asyncio.sleep(0.1)
        dy = -magnitude if direction == "up" else magnitude
        await self._page.mouse.wheel(0, dy)
        await self._page.wait_for_load_state()

    async def go_back(self):
        await self._page.go_back()
        await self._page.wait_for_load_state()

    async def navigate(self, url: str, normalize=True):
        if normalize and not url.startswith(("http://", "https://")):
            url = "https://" + url
        await self._page.goto(url)
        await self._page.wait_for_load_state()

    async def key_combination(self, keys: list[str]):
        keys = [PLAYWRIGHT_KEY_MAP.get(k.lower(), k) for k in keys]
        for key in keys[:-1]:
            await self._page.keyboard.down(key)
        await self._page.keyboard.press(keys[-1])
        for key in reversed(keys[:-1]):
            await self._page.keyboard.up(key)

    async def current_state(self, it):
        """Get the current page state and SoM annotations"""
        try:
            await self._page.wait_for_load_state("networkidle", timeout=10000)
        except Exception:
            pass
        await asyncio.sleep(1)

        os.makedirs(os.path.join(self.task_dir, "trajectory_som"), exist_ok=True)
        os.makedirs(os.path.join(self.task_dir, "trajectory"), exist_ok=True)
        img_path = os.path.join(self.task_dir, f"trajectory_som/screenshot{it}.png")
        img_path_no_box = os.path.join(self.task_dir, f"trajectory/{it}_full_screenshot.png")

        SoM_list, format_ele_text = await get_som(self._page, img_path, img_path_no_box)
        width, height = Image.open(img_path).size
        return {
            "img_path": img_path, "img_path_no_box": img_path_no_box,
            "SoM": {"SoM_list": SoM_list, "format_ele_text": format_ele_text},
            "current_url": self._page.url, "width": width, "height": height
        }

    async def _select(self, x, y, text):
        await self._page.mouse.click(x, y)
        handle = await self._page.evaluate_handle("([x,y]) => document.elementFromPoint(x,y)", [x, y])
        tag = await handle.evaluate("el => el && el.tagName")
        if tag == "SELECT":
            await handle.evaluate("""(sel, label) => {
                const opt = [...sel.options].find(o => o.label.trim() === label.trim());
                if (!opt) throw new Error("Option not found: " + label);
                sel.value = opt.value;
                sel.dispatchEvent(new Event('input', {bubbles:true}));
                sel.dispatchEvent(new Event('change', {bubbles:true}));
            }""", text)
        else:
            raise RuntimeError(f"The element at the coordinates is not a SELECT, but a {tag}")


# ===================== Prompt construction =====================

def get_messages(image, env_state, instruction, history_output):
    """Construct multi-turn conversation messages"""
    history_n = 1
    current_step = len(history_output)
    history_start_idx = max(0, current_step - history_n)
    SoM_format_ele_text = env_state["SoM"]["format_ele_text"]

    system_prompt = '''# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "browser_use", "name": "browser_use", "description": "Use a browser to interact with web pages and take labeled screenshots.\\n* This is an interface to a web browser. You can click elements, type into inputs, scroll, wait for loading, go back, etc.\\n* Each Observation screenshot contains Numerical Labels placed at the TOP LEFT of each Web Element. Use these labels to target elements.\\n* Some pages may take time to load; you may need to wait and take successive screenshots.\\n* Avoid clicking near element edges; target the center of the element.\\n* Execute exactly ONE interaction action per step; do not chain multiple interactions in one call.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `click`: Click a web element by numerical label.\\n* `type`: Clear existing content in a textbox/input and type content. The system will automatically press ENTER after typing.\\n* `scroll`: Scroll within WINDOW or within a specific scrollable element/area (by label).\\n* `select`: Selects a specific option from a menu or dropdown. Use the option text provided in the textual information.\\n* `wait`: Wait for page processes to finish (default 5 seconds unless specified).\\n* `go_back`: Go back to the previous page.\\n* `wikipedia`: Directly jump to the Wikipedia homepage to search for information.\\n* `answer`: Terminate the current task and output the final answer.", "enum": ["click", "type", "scroll", "select", "wait", "go_back", "wikipedia", "answer"], "type": "string"}, "label": {"description": "Numerical label of the target web element. Required only by `action=click`, `action=type`, `action=scroll`, and `action=select` when scrolling within a specific area. Use string value `WINDOW` to scroll the whole page.", "type": ["integer", "string"]}, "direction": {"description": "Scroll direction. Required only by `action=scroll`.", "enum": ["up", "down"], "type": "string"}, "text": {"description": "Required only by `action=type` and `action=answer`.", "type": "string"}, "option": {"description": "The option to select. Required only by `action=select`", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=wait` when overriding the default.", "type": "integer"}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one line for Action.
- Do not output anything else outside those two parts.
- Execute ONLY ONE interaction per iteration (one tool call).
- If finishing, use action=answer in the tool call.'''

    today = datetime.today()
    weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    weekday = weekday_names[today.weekday()]
    formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
    day_info = f'''Today's date is: {formatted_date}.'''

    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")
    previous_actions_str = "\n".join(previous_actions) if previous_actions else "None"

    instruction_prompt = f'''\nPlease generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {day_info}{instruction}

Previous actions:
{previous_actions_str}'''

    messages = [{"role": "system", "content": [{"text": system_prompt}]}]
    history_len = min(history_n, len(history_output))

    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({"role": "user", "content": [
                    {"text": instruction_prompt}, {"text": "Current screenshot:"},
                    {"image": "file://" + history_item['image']}
                ]})
            else:
                messages.append({"role": "user", "content": [
                    {"text": "Current screenshot:"}, {"image": "file://" + history_item['image']}
                ]})
            messages.append({"role": "assistant", "content": [{"text": history_item['output']}]})
        messages.append({"role": "user", "content": [
            {"image": "file://" + image}, {"text": SoM_format_ele_text}
        ]})
    else:
        messages.append({"role": "user", "content": [
            {"text": instruction_prompt}, {"image": "file://" + image}, {"text": SoM_format_ele_text}
        ]})
    return messages


def extract_tool_calls(text: str):
    """Extract tool_call from text"""
    pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
    blocks = pattern.findall(text)
    actions = []
    for blk in blocks:
        blk = blk.strip()
        try:
            actions.append(json.loads(blk))
        except json.JSONDecodeError as e:
            print(f'Parsing failed: {e}')
    return actions


# ===================== Main program =====================

def show_screenshot(image_path, action_parameter, save_path="screenshot_anno.png"):
    """Annotate the action point on the screenshot"""
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)
    if 'coordinate' in action_parameter:
        radius = 15
        center_x, center_y = action_parameter['coordinate'][0], action_parameter['coordinate'][1]
        draw.ellipse((center_x - radius, center_y - radius, center_x + radius, center_y + radius), fill="red", outline="red")
    else:
        return None
    image.save(save_path)
    return save_path

async def main():
    instruction = 'Search for the results of today's NBA Rockets game'
    history = []
    max_step = 30
    model_name = 'gui-plus-2026-02-26'
    task_dir = f'data_{time.strftime("%Y%m%d_%H%M%S")}'
    web_tools = PlaywrightComputer(initial_url='https://www.baidu.com/', task_dir=task_dir)

    await web_tools.reset()
    stop_flag = False
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY", None)
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)

    for step_id in range(max_step):
        if stop_flag:
            break
        env_state = await web_tools.current_state(it=step_id)
        screen_shot = env_state['img_path']
        messages = get_messages(screen_shot, env_state, instruction, history)

        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name, messages=messages, vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(output_text)

        thought = response.output.choices[0].message.reasoning_content
        if thought != '':
            output_text = f"<think>\n{thought}\n</think>{output_text}"

        action_list = extract_tool_calls(output_text)

        for action_id, action in enumerate(action_list):
            action_parameter = action['arguments']
            action_type = action_parameter['action']
            label = action_parameter.get('label', None)
            weki_url = "https://baike.baidu.com/"
            coordicate = []

            if label is not None:
                if label == "WINDOW":
                    coordicate = [500, 500]
                else:
                    ele = env_state["SoM"]["SoM_list"][label]
                    box = ele["bbox"]
                    x, y, w, h = box.get("x"), box.get("y"), box.get("width"), box.get("height")
                    nx = x + w / 2
                    ny = y + h / 2
                    coordicate = [nx, ny]
                    action_parameter['coordinate'] = coordicate

            if action_type == 'wait':
                await asyncio.sleep(action_parameter.get('time', 2))
            elif action_type == 'scroll':
                direction = action_parameter['direction']
                await web_tools.scroll_at(coordicate[0], coordicate[1], direction=direction, magnitude=300)
            elif action_type == 'select':
                text = action_parameter['option']
                await web_tools._select(coordicate[0], coordicate[1], text=text)
            elif action_type == 'goback':
                await web_tools.go_back()
            elif action_type == 'goto':
                url = action_parameter.get('url', '')
                await web_tools.navigate(url, normalize=True)
            elif action_type == 'click':
                await web_tools.click_at(coordicate[0], coordicate[1])
            elif action_type == 'type':
                text = action_parameter['text']
                await web_tools.type_text_at(coordicate[0], coordicate[1], text)
            elif action_type == 'wikipedia':
                await web_tools.navigate(weki_url, normalize=True)
            elif action_type == 'answer':
                text = action_parameter['text']
                print(f'Answer: {text}\nTask completed')
                stop_flag = True
                break

            anno_dir = os.path.join(task_dir, 'anno')
            if not os.path.exists(anno_dir):
                os.makedirs(anno_dir)
            show_screenshot(env_state['img_path_no_box'], action_parameter, f'{anno_dir}/anno_{step_id}_{action_id}.png')

        history.append({'output': output_text, 'image': screen_shot})
        await asyncio.sleep(2)

if __name__ == '__main__':
    asyncio.run(main())

Tool calling

Step 1. Construct the system prompt

Computer GUI tasks

# Define a list of custom tools (optional, pass as needed)
tool_list = [
    {
        "name": "save_to_file",
        "description": "Save text content to a file on the local filesystem.",
        "parameters": {
            "type": "object",
            "properties": {
                "file_path": {
                    "type": "string",
                    "description": "The file path to save to"
                },
                "content": {
                    "type": "string",
                    "description": "The text content to save"
                }
            },
            "required": ["file_path", "content"]
        }
    }
]

# Wrap custom tools in the standard structure
tools_def = {"type": "function", "function": tool_list}

# Define the GUI tool
gui_tool = {
    "type": "function",
    "function": {
        "name": "computer_use",
        "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1000x1000.\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
        "parameters": {
            "properties": {
                "action": {
                    "description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.\n* `answer`: Answer a question.",
                    "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer"],
                    "type": "string"
                },
                "keys": {"description": "Required only by `action=key`.", "type": "array"},
                "text": {"description": "Required only by `action=type` and `action=answer`.", "type": "string"},
                "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"},
                "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"},
                "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"},
                "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}
            },
            "required": ["action"],
            "type": "object"
        }
    }
}

# Combine the GUI tool and custom tools to build the System Prompt
system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one sentence for Action.
- Do not output anything else outside those two parts.
- If finishing, use 'terminate' function in the tool call."""

Mobile GUI tasks

# Define a list of custom tools (optional, pass as needed)
tool_list = [
    {
        "name": "save_to_file",
        "description": "Save text content to a file on the local filesystem.",
        "parameters": {
            "type": "object",
            "properties": {
                "file_path": {
                    "type": "string",
                    "description": "The file path to save to"
                },
                "content": {
                    "type": "string",
                    "description": "The text content to save"
                }
            },
            "required": ["file_path", "content"]
        }
    }
]

# Wrap custom tools in the standard structure
tools_def = {"type": "function", "function": tool_list}

# Define the GUI tool
gui_tool = {
    "type": "function",
    "function": {
        "name_for_human": "mobile_use",
        "name": "mobile_use",
        "description": "Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
        "parameters": {
            "properties": {
                "action": {
                    "description": 'The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n    - This supports adb\'s `keyevent` syntax.\n    - Examples: "volume_up", "volume_down", "power", "camera", "clear".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.',
                    "enum": [
                        "key",
                        "click",
                        "long_press",
                        "swipe",
                        "type",
                        "system_button",
                        "open",
                        "wait",
                        "terminate",
                    ],
                    "type": "string",
                },
                "coordinate": {
                    "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.",
                    "type": "array",
                },
                "coordinate2": {
                    "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.",
                    "type": "array",
                },
                "text": {
                    "description": "Required only by `action=key`, `action=type`, and `action=open`.",
                    "type": "string",
                },
                "time": {
                    "description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.",
                    "type": "number",
                },
                "button": {
                    "description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`",
                    "enum": ["Back", "Home", "Menu", "Enter"],
                    "type": "string",
                },
                "status": {
                    "description": "The status of the task. Required only by `action=terminate`.",
                    "type": "string",
                    "enum": ["success", "failure"],
                },
            },
            "required": ["action"],
            "type": "object",
        },
        "args_format": "Format the arguments as a JSON object.",
    },
}

# Combine the GUI tool and custom tools to build the System Prompt
system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one sentence for Action.
- Do not output anything else outside those two parts.
- If finishing, use 'terminate' function in the tool call."""

Step 2. Construct multi-turn conversation messages

In tool calling scenarios, you must correctly construct multi-turn conversation messages. This includes managing history and passing tool call results. The last user message must wrap the tool execution result with the <tool_response> tag.

Computer GUI tasks

def get_messages(image, instruction, history_output, system_prompt):
    """
    Construct multi-turn conversation messages

    Parameters:
        image: Path to the current screenshot
        instruction: User instruction
        history_output: Historical conversation records [{"output": "...", "tool_response": "...", "image": "..."}]
        system_prompt: System prompt
    """
    history_n = 4  # Keep the last 4 turns of history
    current_step = len(history_output)
    history_start_idx = max(0, current_step - history_n)

    # Construct a summary of historical operations
    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    instruction_prompt = f"""Please generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {instruction}

Previous actions:
{previous_actions_str}"""

    # Construct the messages array
    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        # Add historical conversation
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}]
            })
        # Add the current screenshot, including the tool execution result from the previous turn
        messages.append({
            "role": "user",
            "content": [
                {"text": "<tool_response>\n"},
                {"text": history_output[-1]['tool_response']},
                {"image": "file://" + image},
                {"text": "\n</tool_response>"}
            ]
        })
    else:
        # First turn of the conversation
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })

    return messages

Mobile GUI tasks

def get_messages(image, instruction, history_output, system_prompt):
    """
    Construct multi-turn conversation messages (mobile)

    Parameters:
        image: Path to the current screenshot
        instruction: User instruction
        history_output: Historical conversation records [{"output": "...", "tool_response": "...", "image": "..."}]
        system_prompt: System prompt
    """
    from datetime import datetime

    history_n = 4  # Keep the last 4 turns of history
    current_step = len(history_output)
    history_start_idx = max(0, current_step - history_n)

    # Construct a summary of historical operations
    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    # Add background information (date)
    today = datetime.today()
    weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    weekday = weekday_names[today.weekday()]
    formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
    ground_info = f'''Today's date is: {formatted_date}.'''

    instruction_prompt = f"""Please generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {ground_info}{instruction}

Previous actions:
{previous_actions_str}"""

    # Construct the messages array
    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}]
            })
        messages.append({
            "role": "user",
            "content": [
                {"text": "<tool_response>\n"},
                {"text": history_output[-1]['tool_response']},
                {"image": "file://" + image},
                {"text": "\n</tool_response>"}
            ]
        })
    else:
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })

    return messages

Step 3. Parse the model output

Extract the <tool_call> block from the model output, parse it as JSON, and then map the coordinates based on the image scaling ratio. For more information, see Parse model output and map coordinates.

Step 4. Execute the GUI operation

Implement a GUI operation utility class to perform actual interface operations such as clicking, typing, and scrolling. For more information, see Computer GUI operation utility class and Mobile GUI operation utility class.

Step 5. Complete automation flow

Integrate the preceding steps into a complete automation flow. This flow loops through the sequence of taking a screenshot, performing model inference, and executing the GUI operation until the task is complete. For more information, see the complete code example below.

Complete code for computer tool calling

import os
import re
import json
import math
import time
import uuid
import pyautogui
import pyperclip
import dashscope
from PIL import Image

# ===================== Step 1: System Prompt =====================

# User-defined tool list (pass as needed)
tool_list = [{
    "name": "osworld_mcp_libreoffice_calc.adjust_column_width",
    "description": "Adjust the width of specified columns.",
    "parameters": {
        "type": "object",
        "properties": {
            "columns": {
                "type": "string",
                "description": "Column range to adjust (e.g., 'A:C')"
            },
            "width": {
                "type": "number",
                "description": "Width to set (in characters)"
            },
            "autofit": {
                "type": "boolean",
                "description": "Whether to autofit columns to content"
            }
        },
        "required": ["columns"]
    }
}]

tools_def = {
    "type": "function",
    "function": tool_list
}

# Basic action space for computer
gui_tool = {"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}

# Concatenate the system prompt
system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""


# ===================== Step 2: Construct multi-turn conversation messages =====================

def get_messages(image, instruction, history_output, system_prompt):
    history_n = 4
    current_step = len(history_output)

    history_start_idx = max(0, current_step - history_n)
    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    instruction_prompt = f"""
      Please generate the next move according to the UI screenshot, instruction and previous actions.

      Instruction: {instruction}

      Previous actions:
      {previous_actions_str}"""

    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}],
            })
        messages.append({
            "role": "user",
            "content": [
                {"text": "<tool_response>\n"},
                {"text": history_output[-1]['tool_response']},
                {"image": "file://" + image},
                {"text": "\n</tool_response>"}
            ]
        })
    else:
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })
    return messages


# ===================== Step 3: Parse model output and map coordinates =====================

def extract_tool_calls(text):
    pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
    blocks = pattern.findall(text)
    actions = []
    for blk in blocks:
        blk = blk.strip()
        try:
            actions.append(json.loads(blk))
        except json.JSONDecodeError as e:
            print(f'Parsing failed: {e} | Snippet: {blk[:80]}...')
    return actions

def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192):
    def round_by_factor(number, factor):
        return round(number / factor) * factor
    def ceil_by_factor(number, factor):
        return math.ceil(number / factor) * factor
    def floor_by_factor(number, factor):
        return math.floor(number / factor) * factor

    if height < 2 or width < 2:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}")

    if max(height, width) > max_long_side:
        beta = max(height, width) / max_long_side
        height, width = int(height / beta), int(width / beta)

    h_bar = round_by_factor(height, factor)
    w_bar = round_by_factor(width, factor)

    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


# ===================== Step 4: GUI operation utility class =====================

class ComputerTools:
    def __init__(self):
        self.image_info = None

    def load_image_info(self, path):
        width, height = Image.open(path).size
        self.image_info = (width, height)

    def get_screenshot(self, image_path, retry_times=3):
        if os.path.exists(image_path):
            os.remove(image_path)
        for i in range(retry_times):
            screenshot = pyautogui.screenshot()
            screenshot.save(image_path)
            if os.path.exists(image_path):
                self.load_image_info(image_path)
                return True
            else:
                time.sleep(0.1)
        return False

    def reset(self):
        pyautogui.hotkey('win', 'd')

    def press_key(self, keys):
        if isinstance(keys, list):
            cleaned_keys = []
            for key in keys:
                if isinstance(key, str):
                    if key.startswith("keys=["): key = key[6:]
                    if key.endswith("]"): key = key[:-1]
                    if key.startswith("['") or key.startswith('["'): key = key[2:] if len(key) > 2 else key
                    if key.endswith("']") or key.endswith('"]'): key = key[:-2] if len(key) > 2 else key
                    key = key.strip()
                    key_map = {"arrowleft": "left", "arrowright": "right", "arrowup": "up", "arrowdown": "down"}
                    key = key_map.get(key, key)
                    cleaned_keys.append(key)
                else:
                    cleaned_keys.append(key)
            keys = cleaned_keys
        else:
            keys = [keys]
        if len(keys) > 1:
            pyautogui.hotkey(*keys)
        else:
            pyautogui.press(keys[0])

    def type(self, text):
        pyperclip.copy(text)
        pyautogui.keyDown('ctrl')
        pyautogui.keyDown('v')
        pyautogui.keyUp('v')
        pyautogui.keyUp('ctrl')

    def mouse_move(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.moveTo(x, y)

    def left_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.click()

    def left_click_drag(self, x, y):
        pyautogui.dragTo(x, y, duration=0.5)
        pyautogui.moveTo(x, y)

    def right_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.rightClick()

    def middle_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.middleClick()

    def double_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.doubleClick()

    def triple_click(self, x, y):
        pyautogui.moveTo(x, y)
        time.sleep(0.1)
        pyautogui.tripleClick()

    def scroll(self, pixels):
        pyautogui.scroll(pixels)


# ===================== Step 5: Complete automation flow =====================

def run_gui_automation(instruction, max_step=30):
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
    dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
    model_name = 'gui-plus-2026-02-26'

    computer_tools = ComputerTools()
    computer_tools.reset()

    output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation")
    os.makedirs(output_dir, exist_ok=True)

    history = []
    stop_flag = False
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)

    print(f"[Task] {instruction}")
    print("=" * 60)

    for step_id in range(max_step):
        if stop_flag:
            break

        print(f"\n[Step {step_id + 1}]")

        screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png')
        computer_tools.get_screenshot(screen_shot)

        messages = get_messages(screen_shot, instruction, history, system_prompt)

        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name,
                messages=messages,
                vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(f"[Model output]\n{output_text}\n")

        action_list = extract_tool_calls(output_text)
        if not action_list:
            print("No valid operation extracted")
            break

        tool_response = ""
        for action_id, action in enumerate(action_list):
            action_parameter = action['arguments']
            action_type = action_parameter['action']

            dummy_image = Image.open(screen_shot)
            resized_height, resized_width = smart_resize(
                dummy_image.height, dummy_image.width,
                factor=16, min_pixels=3136, max_pixels=1003520 * 200
            )

            for key in ['coordinate', 'coordinate1', 'coordinate2']:
                if key in action_parameter:
                    action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width)
                    action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height)

            if action_type in ['click', 'left_click']:
                computer_tools.left_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Left-click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
            elif action_type == 'mouse_move':
                computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Move mouse")
            elif action_type == 'middle_click':
                computer_tools.middle_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Middle-click")
            elif action_type in ['right click', 'right_click']:
                computer_tools.right_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Right-click")
            elif action_type in ['key', 'hotkey']:
                computer_tools.press_key(action_parameter['keys'])
                print(f"✓ Key press {action_parameter['keys']}")
            elif action_type == 'type':
                computer_tools.type(action_parameter['text'])
                print(f"✓ Type text: {action_parameter['text']}")
            elif action_type == 'drag':
                computer_tools.left_click_drag(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Drag")
            elif action_type == 'scroll':
                if 'coordinate' in action_parameter:
                    computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                computer_tools.scroll(action_parameter.get("pixels", 1))
                print(f"✓ Scroll {action_parameter.get('pixels', 1)} pixels")
            elif action_type in ['computer_double_click', 'double_click']:
                computer_tools.double_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
                print(f"✓ Double-click")
            elif action_type == 'wait':
                time.sleep(action_parameter.get('time', 2))
                print(f"✓ Wait {action_parameter.get('time', 2)} seconds")
            elif action_type == 'answer':
                print(f"✓ Task complete: {action_parameter.get('text', '')}")
                tool_response = action_parameter.get('text', '')
                stop_flag = True
                break
            elif action_type in ['stop', 'terminate', 'done']:
                print(f"✓ Task terminated: {action_parameter.get('status', 'success')}")
                tool_response = f"Task terminated with status: {action_parameter.get('status', 'success')}"
                stop_flag = True
                break
            else:
                print(f"Unknown action type: {action_type}")
                tool_response = f"Unknown action type: {action_type}"

        history.append({'output': output_text, 'tool_response': tool_response, 'image': screen_shot})
        time.sleep(2)

    print("\n" + "=" * 60)
    print(f"[Complete] Total executed {len(history)} steps")


if __name__ == '__main__':
    run_gui_automation(
        instruction='Open Chrome for me and search for Alibaba on Baidu',
        max_step=30
    )

Complete code for mobile tool calling

import json, os, subprocess
import dashscope, time, math, re
import uuid
from PIL import Image
from datetime import datetime

# ===================== Step 1: System Prompt =====================

# User-defined tool list (pass as needed)
tool_list = [{
    "name": "osworld_mcp_libreoffice_calc.adjust_column_width",
    "description": "Adjust the width of specified columns.",
    "parameters": {
        "type": "object",
        "properties": {
            "columns": {
                "type": "string",
                "description": "Column range to adjust (e.g., 'A:C')"
            },
            "width": {
                "type": "number",
                "description": "Width to set (in characters)"
            },
            "autofit": {
                "type": "boolean",
                "description": "Whether to autofit columns to content"
            }
        },
        "required": ["columns"]
    }
}]

tools_def = {
    "type": "function",
    "function": tool_list
}

# Basic action space for mobile
gui_tool = {"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.\\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Perform a key event on the mobile device.\\n    - This supports adb's `keyevent` syntax.\\n    - Examples: \\"volume_up\\", \\"volume_down\\", \\"power\\", \\"camera\\", \\"clear\\".\\n* `click`: Click the point on the screen with coordinate (x, y).\\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\\n* `type`: Input the specified text into the activated input box.\\n* `system_button`: Press the system button.\\n* `open`: Open an app on the device.\\n* `wait`: Wait specified seconds for the change to happen.\\n* `answer`: Terminate the current task and output the answer.\\n* `interact`: Resolve the blocking window by interacting with the user.\\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}

# Concatenate the system prompt
mobile_system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
""" + json.dumps(gui_tool) + json.dumps(tools_def) + """
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""


# ===================== Step 2: Construct multi-turn conversation messages =====================

def get_messages(image, instruction, history_output, system_prompt):
    history_n = 4
    current_step = len(history_output)
    history_start_idx = max(0, current_step - history_n)

    previous_actions = []
    for i in range(history_start_idx):
        if i < len(history_output):
            history_output_str = history_output[i]['output']
            if 'Action:' in history_output_str and '<tool_call>':
                history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip()
            previous_actions.append(f"Step {i + 1}: {history_output_str}")

    previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None"

    # Add background information
    today = datetime.today()
    weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
    weekday = weekday_names[today.weekday()]
    formatted_date = today.strftime("%Y-%m-%d") + " " + weekday
    ground_info = f'''Today's date is: {formatted_date}.'''

    instruction_prompt = f"""
Please generate the next move according to the UI screenshot, instruction and previous actions.

Instruction: {ground_info}{instruction}

Previous actions:
{previous_actions_str}"""

    messages = [{"role": "system", "content": [{"text": system_prompt}]}]

    history_len = min(history_n, len(history_output))
    if history_len > 0:
        for history_id, history_item in enumerate(history_output[-history_n:], 0):
            if history_id == 0:
                messages.append({
                    "role": "user",
                    "content": [
                        {"text": instruction_prompt},
                        {"image": "file://" + history_item['image']}
                    ]
                })
            else:
                messages.append({
                    "role": "user",
                    "content": [{"image": "file://" + history_item['image']}]
                })
            messages.append({
                "role": "assistant",
                "content": [{"text": history_item['output']}],
            })
        messages.append({
            "role": "user",
            "content": [
                {"text": "<tool_response>\n"},
                {"text": history_output[-1]['tool_response']},
                {"image": "file://" + image},
                {"text": "\n</tool_response>"}
            ]
        })
    else:
        messages.append({
            "role": "user",
            "content": [
                {"text": instruction_prompt},
                {"image": "file://" + image}
            ]
        })
    return messages


# ===================== Step 3: Parse model output and map coordinates =====================

def extract_tool_calls(text):
    pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE)
    blocks = pattern.findall(text)
    actions = []
    for blk in blocks:
        blk = blk.strip()
        try:
            actions.append(json.loads(blk))
        except json.JSONDecodeError as e:
            print(f'Parsing failed: {e} | Snippet: {blk[:80]}...')
    return actions

def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192):
    def round_by_factor(number, factor):
        return round(number / factor) * factor
    def ceil_by_factor(number, factor):
        return math.ceil(number / factor) * factor
    def floor_by_factor(number, factor):
        return math.floor(number / factor) * factor

    if height < 2 or width < 2:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}")

    if max(height, width) > max_long_side:
        beta = max(height, width) / max_long_side
        height, width = int(height / beta), int(width / beta)

    h_bar = round_by_factor(height, factor)
    w_bar = round_by_factor(width, factor)

    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = floor_by_factor(height / beta, factor)
        w_bar = floor_by_factor(width / beta, factor)
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = ceil_by_factor(height * beta, factor)
        w_bar = ceil_by_factor(width * beta, factor)
    return h_bar, w_bar


# ===================== Step 4: GUI operation utility class =====================

class AdbTools:
    def __init__(self, adb_path, device=None):
        self.adb_path = adb_path
        self.device = device
        self.__device_str__ = f" -s {device} " if device is not None else ' '
        self.image_info = None

    def adb_shell(self, command):
        command = self.adb_path + self.__device_str__ + command
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def load_image_info(self, path):
        width, height = Image.open(path).size
        self.image_info = (width, height)

    def get_screenshot(self, image_path, retry_times=3):
        command = self.adb_path + (f" -s {self.device}" if self.device is not None else '') + f" exec-out screencap -p > {image_path}"

        for i in range(retry_times):
            subprocess.run(command, capture_output=True, text=True, shell=True)
            if os.path.exists(image_path):
                self.load_image_info(image_path)
                return True
            else:
                time.sleep(0.1)
        else:
            return False

    def click(self, x, y, coordinate_size=None):
        command = self.adb_path + self.__device_str__ + f" shell input tap {x} {y}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def long_press(self, x, y, time=800):
        command = self.adb_path + self.__device_str__ + f" shell input swipe {x} {y} {x} {y} {time}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def slide(self, x1, y1, x2, y2, coordinate_size=None, slide_time=800):
        command = self.adb_path + self.__device_str__ + f" shell input swipe {x1} {y1} {x2} {y2} {slide_time}"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def back(self):
        command = self.adb_path + self.__device_str__ + f"  shell input keyevent 4"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def home(self):
        command = self.adb_path + self.__device_str__ + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME"
        subprocess.run(command, capture_output=True, text=True, shell=True)

    def type(self, text):
        escaped_text = text.replace('"', '\\"').replace("'", "\\'")
        command_list = [
            f"shell ime enable com.android.adbkeyboard/.AdbIME ",
            f"shell ime set com.android.adbkeyboard/.AdbIME ",
            0.1,
            f'shell am broadcast -a ADB_INPUT_TEXT --es msg "{escaped_text}" ',
            0.1,
            f"shell ime disable com.android.adbkeyboard/.AdbIME"
        ]

        for command in command_list:
            if isinstance(command, float):
                time.sleep(command)
            elif isinstance(command, str):
                subprocess.run(self.adb_path + self.__device_str__ + command.strip(), capture_output=True, text=True, shell=True)


# ===================== Step 5: Complete automation flow =====================

def run_mobile_automation(instruction, adb_path, max_step=50):
    dashscope.api_key = os.getenv("DASHSCOPE_API_KEY")
    dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'
    model_name = 'gui-plus-2026-02-26'
    
    adb_tools = AdbTools(adb_path=adb_path)
    adb_tools.home()
    time.sleep(1)
    
    task_dir = "mobile_automation"
    if os.path.exists(task_dir):
        import shutil
        shutil.rmtree(task_dir)
    os.mkdir(task_dir)
    
    history = []
    session_id = str(uuid.uuid4())
    print('session_id ', session_id)

    print(f"[Task] {instruction}")
    print("=" * 60)

    for step_id in range(max_step):
        print(f'\n[Step {step_id + 1}]')
        screen_shot = os.path.join(task_dir, f'screen_shot_{step_id}.png')
        adb_tools.get_screenshot(screen_shot)

        messages = get_messages(screen_shot, instruction, history, mobile_system_prompt)

        retry_time = 3
        for _ in range(retry_time):
            response = dashscope.MultiModalConversation.call(
                model=model_name,
                messages=messages,
                vl_high_resolution_images=True,
                headers={"x-dashscope-gui-session-id": session_id},
                stream=False
            )
            print(response['request_id'])
            try:
                output_text = response.output.choices[0].message.content[0]['text']
                break
            except Exception as e:
                print(response)
                print(e)
        else:
            raise Exception('retry_time out')
        print(f"[Model output]\n{output_text}\n")

        action = json.loads(output_text.split('<tool_call>\n')[1].split('}}\n')[0] + '}}\n')
        action_parameter = action['arguments']
        
        dummy_image = Image.open(screen_shot)
        resized_height, resized_width = smart_resize(
            dummy_image.height, dummy_image.width,
            factor=16, min_pixels=3136, max_pixels=1003520*200
        )
        
        for key in ['coordinate', 'coordinate1', 'coordinate2']:
            if key in action_parameter:
                action_parameter[key][0] = int(action_parameter[key][0]/1000 * resized_width)
                action_parameter[key][1] = int(action_parameter[key][1]/1000 * resized_height)

        action_type = action_parameter['action']
        tool_response = ''

        if action_type == 'click':
            adb_tools.click(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
            print(f"✓ Click ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})")
        elif action_type == 'long_press':
            adb_tools.long_press(action_parameter['coordinate'][0], action_parameter['coordinate'][1])
            print(f"✓ Long press")
        elif action_type == 'type':
            adb_tools.type(action_parameter['text'])
            print(f"✓ Type text: {action_parameter['text']}")
        elif action_type in ['scroll', 'swipe']:
            adb_tools.slide(action_parameter['coordinate'][0], action_parameter['coordinate'][1],
                           action_parameter['coordinate2'][0], action_parameter['coordinate2'][1])
            print(f"✓ Swipe")
        elif action_type == 'system_button':
            system = action_parameter['button']
            if system == 'Back':
                adb_tools.back()
            elif system == 'Home':
                adb_tools.home()
            print(f"✓ System button: {system}")
        elif action_type == 'wait':
            time.sleep(action_parameter.get('time', 2))
            print(f"✓ Wait {action_parameter.get('time', 2)} seconds")
        elif action_type == 'terminate':
            print(f"✓ Task complete")
            tool_response = f"Task terminated with status: {action_parameter.get('status', 'success')}"
            break
        else:
            print(f"Unknown action type: {action_type}")
            tool_response = f"Unknown action type: {action_type}"

        history.append({'output': output_text, 'tool_response': tool_response, 'image': screen_shot})
        time.sleep(2)
    
    print("\n" + "=" * 60)
    print(f"[Complete] Total executed {len(history)} steps")


if __name__ == '__main__':
    # Note: You need to enter your own ADB path
    run_mobile_automation(
        instruction='Help me search for the price of Jinan Sheraton Hotel today on Ctrip',
        adb_path="/path/to/adb",
        max_step=50
    )

More usage

Usage notes

Image restrictions

The gui-plus model has the following specific requirements for input images:

  • Supported image formats:

    Image format

    Common extensions

    Multipurpose Internet Mail Extensions (MIME) Type

    BMP

    .bmp

    image/bmp

    JPEG

    .jpe, .jpeg, .jpg

    image/jpeg

    PNG

    .png

    image/png

    TIFF

    .tif, .tiff

    image/tiff

    WEBP

    .webp

    image/webp

    HEIC

    .heic

    image/heic

  • Image size: The size of a single image cannot exceed 10 MB. If you provide a Base64-encoded image, ensure that the encoded string is less than 10 MB. For more information, see Pass local files. To compress the file size, see Image or video compression methods.

  • Dimensions and aspect ratio: The width and height of the image must both be greater than 10 pixels. The aspect ratio (the ratio of the long side to the short side) of the image must not exceed 200.

  • Total pixels: The model accepts images with any total number of pixels but will internally scale them down to a specific processing limit. Images exceeding this limit will lose detail.

Image input methods

  • Public URL: Provide a publicly accessible image address that supports the HTTP or HTTPS protocol. You can upload a local image to OSS or upload a file to get a temporary URL to obtain a public URL.

  • Base64 encoding: Convert the image to a Base64-encoded string.

  • Local file path: Provide the path to the local image directly.

Billing and rate limiting

  • Rate limiting: For the rate limiting conditions of the Qwen GUI-Plus model, see Rate limiting.

  • Free quota: The model provides a free quota of 1 million tokens, valid for 90 days from the date you activate Model Studio or your model application is approved.

  • Billing: Total cost = Input tokens × Model input price + Model output tokens × Model output price. For input/output prices, see the Model list.

    Rules for converting images to tokens

    Image tokens = (h_bar * w_bar) / token_pixels + 2.

    • h_bar, w_bar: The height and width of the scaled image. The model preprocesses the image before processing, scaling it down to a specific pixel limit. This limit depends on the values of the max_pixels and vl_high_resolution_images parameters. For more information, see GUI-Plus API reference.

    • token_pixels represents the number of pixels per visual token, currently fixed at 28 * 28 (which is 784).

    The following code demonstrates the approximate image scaling logic used by the model, which can be used to estimate the token count for an image. For actual billing details, refer to the API response.

    # Install the Pillow library with the following command: pip install Pillow
    import math
    from PIL import Image
    
    factor = 28
    def token_calculate(image_path, max_pixels, vl_high_resolution_images):
        # Open the specified PNG image file
        image = Image.open(image_path)
    
        # Get the original dimensions of the image
        height = image.height
        width = image.width
    
        # Adjust the height and width to be multiples of the factor, depending on the model
        h_bar = round(height / factor) * factor
        w_bar = round(width / factor) * factor
    
        # Lower limit for image tokens: 4 tokens
        min_pixels = 4 * factor * factor
        # If vl_high_resolution_images is set to True, the upper limit for input image tokens is 16386, corresponding to a maximum pixel value of 16384 * 28 * 28. Otherwise, it is the value set for max_pixels.
        if vl_high_resolution_images:
            max_pixels = 16384 * factor * factor
        else:
            max_pixels = max_pixels
    
        # Scale the image to ensure the total number of pixels is within the range [min_pixels, max_pixels]
        if h_bar * w_bar > max_pixels:
            # Calculate the scaling factor beta so that the total pixels of the scaled image do not exceed max_pixels
            beta = math.sqrt((height * width) / max_pixels)
            # Recalculate the adjusted height and width
            h_bar = math.floor(height / beta / factor) * factor
            w_bar = math.floor(width / beta / factor) * factor
        elif h_bar * w_bar < min_pixels:
            # Calculate the scaling factor beta so that the total pixels of the scaled image are not less than min_pixels
            beta = math.sqrt(min_pixels / (height * width))
            # Recalculate the adjusted height
            h_bar = math.ceil(height * beta / factor) * factor
            w_bar = math.ceil(width * beta / factor) * factor
        return h_bar, w_bar
    
    if __name__ == "__main__":
        # Replace test.png with the path to your local image
        h_bar, w_bar = token_calculate("xxx/test.jpg", vl_high_resolution_images=False, max_pixels=16384*28*28, )
        print(f"The scaled image dimensions are: height {h_bar}, width {w_bar}")
        # The system will automatically add <vision_bos> and <vision_eos> visual markers (1 token each)
        token = int((h_bar * w_bar) / (28 * 28))+2
        print(f"The number of tokens for the image is {token}")
  • View bills: You can view your bills or top up your account on the Expenses and Costs page in the Alibaba Cloud Management Console.

API reference

For the input and output parameters of the Qwen GUI-Plus model, see GUI-Plus API reference.

Error codes

If the model call fails and returns an error message, see Error codes for resolution.

Recommended prompts for the GUI-Plus model

Computer System Prompt

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:

<tools>
{
  "type": "function",
  "function": {
    "name": "computer_use",
    "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.
* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.
* The screen's resolution is {resized_width}x{resized_height}.
* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.
* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
    "parameters": {
      "properties": {
        "action": {
          "description": "The action to perform. The available actions are:
* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.
* `type`: Input a string of text. Use the `clear` parameter to decide whether to overwrite the existing text, and use the `enter` parameter to decide whether the enter key should be pressed after typing the text.
* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.
* `click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.
* `drag`: Click at a specified (x, y) pixel coordinate on the screen, and drag the cursor to another specified (x2, y2) pixel coordinate on the screen.
* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.
* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.
* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.
* `scroll`: Performs a scroll of the mouse scroll wheel.
* `wait`: Wait specified seconds for the change to happen.
* `call_user`: Call the user when the task is unsolvable, or when you need the user's help, such as log in or close the pop up.
* `terminate`: Terminate the current task and report its completion status.",
          "enum": ["key", "type", "mouse_move", "click", "drag", "right_click", "middle_click", "double_click", "scroll", "wait", "call_user", "terminate"],
          "type": "string"
        },
        "keys": {
          "description": "Required only by `action=key`.",
          "type": "array"
        },
        "text": {
          "description": "Required only by `action=type`.",
          "type": "string"
        },
        "clear": {
          "description": "Assign it to 1 if the text should overwrite the existing text, otherwise assign it to 0. Using this argument clears all text in an element. Required only by `action=type`.",
          "type": "number"
        },
        "enter": {
          "description": "Assign it to 1 if the enter key should be pressed after typing the text, otherwise assign it to 0. Required only by `action=type`.",
          "type": "number"
        },
        "coordinate": {
          "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to.",
          "type": "array"
        },
        "coordinate2": {
          "description": "(x2, y2): The x2 (pixels from the left edge) and y2 (pixels from the top edge) coordinates to drag the cursor to. Required only by `action=drag`.",
          "type": "array"
        },
        "pixels": {
          "description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. This value should be between -5 and 5. Required only by `action=scroll`.",
          "type": "number"
        },
        "time": {
          "description": "The seconds to wait. Required only by `action=wait`.",
          "type": "number"
        },
        "status": {
          "description": "The status of the task. Required only by `action=terminate`.",
          "type": "string",
          "enum": ["success", "failure"]
        }
      },
      "required": ["action"],
      "type": "object"
    }
  }
}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

Mobile System Prompt

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:

<tools>
{
  "type": "function",
  "function": {
    "name": "mobile_use",
    "description": "Use a touchscreen to interact with a mobile device, and take screenshots.
* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.
* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.
* The screen's resolution is {resized_width}x{resized_height}.
* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.",
    "parameters": {
      "properties": {
        "action": {
          "description": "The action to perform. The available actions are:
* `key`: Perform a key event on the mobile device.
    - This supports adb's `keyevent` syntax.
    - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".
* `click`: Click the point on the screen with coordinate (x, y).
* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.
* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).
* `type`: Input the specified text into the activated input box.
* `system_button`: Press the system button.
* `open`: Open an app on the device.
* `wait`: Wait specified seconds for the change to happen.
* `terminate`: Terminate the current task and report its completion status.",
          "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "terminate"],
          "type": "string"
        },
        "coordinate": {
          "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.",
          "type": "array"
        },
        "coordinate2": {
          "description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.",
          "type": "array"
        },
        "text": {
          "description": "Required only by `action=key`, `action=type`, and `action=open`.",
          "type": "string"
        },
        "time": {
          "description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.",
          "type": "number"
        },
        "button": {
          "description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`",
          "enum": ["Back", "Home", "Menu", "Enter"],
          "type": "string"
        },
        "status": {
          "description": "The status of the task. Required only by `action=terminate`.",
          "type": "string",
          "enum": ["success", "failure"]
        }
      },
      "required": ["action"],
      "type": "object"
    }
  }
}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>