AI-powered O&M

更新时间:
复制 MD 格式

DataWorks AI-powered O&M is an operational capability driven by DataWorks Copilot, designed to provide comprehensive health assessments and issue diagnosis for task instances. By analyzing multiple dimensions—including dependency chains, resource levels, historical run trends, change impacts, log anomalies, and data quality—it automatically generates structured diagnostic reports. These reports not only reveal the root cause of a problem but also provide specific solutions and one-click operational actions. The goal is to help you shift from reactive troubleshooting to proactive problem detection and prevention, significantly improving O&M efficiency.

Features

AI-powered O&M is a one-stop, intelligent tool for task operations in DataWorks. It is an upgrade to the original intelligent O&M feature. When you encounter issues such as task failures, slow execution, or resource contention, DataWorks AI-powered O&M automatically analyzes the entire task lifecycle, quickly pinpoints the root cause, and provides solutions with one-click operational actions.

Core capabilities:

  • Comprehensive diagnosis: Covers every task state, from not running and waiting to running and completed, whether successful or failed. The diagnostic scope extends from a single instance or workflow to an entire workspace. It provides a comprehensive diagnosis by analyzing dependencies, resource usage, historical performance, and log content, and supports contextual follow-up questions.

  • Root cause analysis: Goes beyond presenting error logs by correlating multi-dimensional information to pinpoint the fundamental cause of an issue.

  • Interactive operations: Allows you to issue O&M commands—such as to rerun an instance, set it as successful, or modify a resource group—directly in a chat dialog. It simplifies complex operations into one-click buttons, significantly improving O&M efficiency.

Quick start

This section guides you through a complete diagnostic process for a typical scenario: troubleshooting a failed task instance.

  1. Start a diagnosis

    1. Navigate to Operation Center > Scheduled Instance and locate the failed target instance.

    2. Click the instance name to expand its DAG. Hover over the instance and, in the shortcut menu that appears, click the AI Diagnostics button.

      In the DAG, failed nodes such as ods_user_info_d are marked with a red × icon and a red border, while their downstream nodes are shown in a 'not run' state. The AI diagnosis button is located near the line connecting the failed node to its upstream node.

  2. Wait for the AI analysis

    After you click, the DataWorks Copilot assistant automatically opens on the right side of the page and displays "DataWorks Copilot is processing...". While you wait, Copilot shows the analysis steps it is performing, which helps you understand the AI's "thought" process. The following is a typical diagnostic flow; you can expand any step to view details.

    The analysis process sequentially performs steps such as Query task instance status and information, Query internal nodes of the workflow, Analyze failed instance logs, Analyze and diagnose, Query task instance logs, Query task publishing changes, Query instance operation records, and Query instance code. Once complete, it displays DataWorks Copilot has completed execution!.

    When you expand the Analyze and diagnose step, Copilot presents a detailed diagnosis, including three sections: Problem Analysis, Possible Causes, and Solution Suggestions. For example, if a task fails because an exclusive resource group has expired, Problem Analysis indicates that the error message is com.alibaba.phoenix.error.BillException, showing that a billing check failed. Possible Causes lists reasons such as the resource group's subscription period ending without renewal, configuration errors or accidental deletion, account arrears, or permission changes. Solution Suggestions guides you to the Resource Group Management page to check the resource group status and renew it, or to temporarily switch to another available resource group in the task's scheduling configuration.

  3. Interpret the diagnostic report

    After about 10 seconds, Copilot returns a structured diagnostic report. You should focus on the following sections:

    • Anomaly Detection: This is the primary conclusion to review. It identifies instance anomalies through a comprehensive analysis and deduces the root cause based on the context.

    • Analysis Process: This section lists the evidence chain that the AI used to reach its conclusions, increasing the credibility of the results.

    • Solution and prevention suggestions: This provides specific, actionable steps to fix the issue and long-term recommendations to prevent it from recurring.

    Anomaly Detection
    • Root Cause: The task failed because the exclusive resource group (ID: 70b78c03-d6b9-4234-bde9-e0da1e6e8c33) has expired. The system detected that the resource group was unavailable when trying to start the task, which threw a BillException and terminated the task execution.
    Analysis Process
    1. Initial Status Check:
       • Called get_task_instance and found the instance status was "Failed".
    2. Log Analysis:
       • Used analyze_task_instance_log and get_task_instance_log to view detailed error logs, confirming the failure reason as "exclusive resource group has expired".
    3. Change and Operation Record Check:
       • Used list_task_changes and found the last change was on 2026-02-01, which was an update to the task itself.
       • list_task_operations shows that the resource group for this task was changed on 2026-02-01 (from yunshi_dataxcdc_pre_resource to dgc_checker_test_rescource_dlf), but this did not resolve the issue.
    4. Code Review:
       • Reviewed the task code (get_task_instance_code) and found it to be a standard PyODPS data writing script with no apparent logical errors or improper resource requests.
  4. Implement the solution

    Follow the recommendations provided in the report.

    • Immediate measures: The report typically provides shortcut actions directly related to the problem. For example, for a resource group issue, it may offer a shortcut to modify the task's resource group. You can simply reply with Yes, and the AI will assist you in modifying the resource group.

      Copilot's Solution and prevention suggestions section includes both Immediate measures (go to the Exclusive Resource Group Management page to check the resource group status, renew the service, or reassign the task to a valid resource group) and Long-term optimization (establish a monitoring mechanism for resource group expiration and periodically review task resource group dependencies).

      Copilot displays a Modify Instance Resource Group confirmation form, automatically filling in parameters such as Environment, Task Instance ID List, Workspace, Resource Group, and Is Business Process Instance. After confirming the information is correct, click Confirm and Execute to complete the modification.

    • Interactive operations: If the report does not provide a specific action, you can continue to enter commands in the dialog box to resolve the issue. For example, you can type "modify the resource group for task xxx", and Copilot will guide you through the process. Through natural language interaction, the AI can dynamically understand complex contextual requirements, simplifying operations and making them suitable for unstructured O&M scenarios.

      Copilot automatically queries the task instance status and displays a confirmation form with fields such as Environment, Task Instance ID List, Workspace, Resource Group, and Is Business Process Instance. After confirming the information is correct, click Confirm and Execute to complete the operation.

Note

The diagnostic report and suggested solutions vary depending on the cause of failure. Results may differ based on your specific situation. For a list of supported operations, see Supported O&M Operations.

Usage notes

  • For workspace-level diagnostics or when many instances are involved, the response may be delayed by 1 to 5 minutes.

  • Cross-workspace dependency analysis is supported, but you must be a member of the target workspace to view analysis details.

Enable AI diagnosis

AI-powered O&M can be accessed from multiple entry points in DataWorks.

Global entry point (Copilot)

On any DataWorks page, open the Copilot dialog box in the upper-right corner, switch Copilot to the Agent mode, and select /Data O&M.

You can enter Diagnose instance: <Instance ID>, use @<Instance ID> to provide context, to start a diagnosis.

Note

At the global entry point, you must use /Data O&M to specify the agent. In contextual entry points, this is not required as the O&M agent is used by default.

Contextual entry points

Location

Actions

Operation Center > AI operations and maintenance

In Operation Center, click AI-powered O&M in the left-side navigation pane.

Operation Center > Instance List

In the Actions column, click More > AI Diagnostics. This allows you to diagnose cycle, test, and backfill instances.

Operation Center > DAG

Hover over a node instance and click the AI Diagnostics button.

Instance Run Log tab

On the log diagnosis page, click the AI Diagnostics button at the top to automatically open Copilot and submit the diagnosis command.

Log Diagnosis page

In the dialog box in the middle of the page, enable AI Diagnostics, enter an instance or workspace ID, and start the diagnosis.

Note: The original "Intelligent Diagnosis" button has been renamed to Log Diagnosis and now analyzes the current log content.

Supported diagnostic scenarios

Instance-level issues

Issue type

Example command

Task failure

Diagnose instance: <Instance ID> or use @<Instance ID> to provide context.

Slow execution

Why is instance <Instance ID> running slower today?

Long wait time

Check why instance <Instance ID> is still waiting.

Dependency blocking

Which parent nodes of instance <Instance ID> have failed?

Supported O&M operations

In the diagnostic report or Copilot dialog, you can perform the following operations on tasks or instances within a workspace, either individually or in batches:

Important

You must review and authorize any operation in the AI dialog box before it is executed.

Actions

Description

Rerun instance

Reruns the current instance.

Set as successful

Forcibly marks the instance as successful.

Suspend/Resume instance

Controls the scheduling state.

Modify resource group

Switches the resource group.

Modify priority

Adjusts the scheduling priority, which affects baseline scheduling.

Refresh instance

Refreshes the instance configuration to its latest state.

You must have the Project Owner or O&M role for the target workspace.