Node development

更新时间:
复制 MD 格式

DataWorks Data Studio provides various nodes for different data processing tasks: data integration nodes for synchronization; engine compute nodes such as MaxCompute SQL, Hologres SQL, and EMR Hive for data cleaning; and general-purpose nodes such as zero-load nodes and do-while loop nodes for complex logic processing.

Supported node types

The following table lists the node types supported by recurring schedules. Supported node types for one-time tasks and manually triggered workflows may differ. For the most up-to-date list, refer to the UI.

Note
  • Node availability varies by edition and region. For the most accurate information, see the UI.

  • Some nodes cannot be run in a workflow. See the node details for specifics.

Node type

Node name

Description

Node code

Task type

data integration

batch synchronization

Synchronizes data in recurring batches between various data sources.

For more information about the data sources supported by batch synchronization, see Supported data sources and synchronization solutions.

23

DI

real-time synchronization

Synchronizes data changes from a source to a destination database in real time. You can synchronize a single table or an entire database to maintain data consistency.

For more information about the data sources supported by real-time synchronization, see Supported data sources and synchronization solutions.

900

RI

Notebook

Notebook

Notebook offers a flexible and interactive platform for data processing and analysis. Its intuitive, modular, and interactive environment streamlines data processing, exploration, visualization, and model building.

1323

NOTEBOOK

MaxCompute

MaxCompute SQL

Schedules recurring MaxCompute SQL tasks. These tasks use an SQL-like syntax for distributed processing of massive (terabyte-scale) datasets where real-time performance is not critical.

10

ODPS_SQL

SQL script template

An SQL code template with multiple input and output parameters. It filters, joins, and aggregates data from source tables to generate a result table. Use these predefined components to quickly build data processing flows, which significantly improves development efficiency.

1010

COMPONENT_SQL

MaxCompute Script

Combines multiple SQL statements into a single script for compilation and execution. This is ideal for complex query scenarios, such as nested subqueries or multi-step operations. Submitting the entire script at once generates a unified execution plan, meaning the job is queued and run only once, which improves resource utilization.

24

ODPS_SQL_SCRIPT

PyODPS 2

Integrates the Python SDK for MaxCompute. Use this node to write and edit Python code for data processing and analysis tasks in MaxCompute.

221

PY_ODPS

PyODPS 3

Use a PyODPS 3 node to write MaxCompute jobs directly in Python and configure them for recurring scheduling.

1221

PYODPS3

MaxCompute Spark

Runs offline Spark jobs in cluster mode on MaxCompute.

225

ODPS_SPARK

MaxCompute MR

Write and schedule MapReduce programs using the Java API to process large-scale datasets in MaxCompute.

11

ODPS_MR

Map metadata to Hologres

To accelerate queries on MaxCompute data, use this feature to map MaxCompute table metadata to Hologres. You can then use Hologres foreign tables to query the data in MaxCompute directly.

-

-

Sync data to Hologres

Synchronizes data from a single MaxCompute table to Hologres, enabling efficient big data analysis and real-time queries.

-

-

Hologres

Hologres SQL

Queries data in Hologres instances. Because Hologres and MaxCompute are seamlessly connected, you can use this node to query and analyze large-scale data in MaxCompute by using standard PostgreSQL statements, delivering rapid results without data migration.

1093

HOLOGRES_SQL

Sync data to MaxCompute

Migrates data from a single Hologres table to MaxCompute.

1070

HOLOGRES_SYNC_DATA_TO_MC

MaxCompute table schema synchronization

Quickly creates Hologres foreign tables in batches by importing the schemas of source MaxCompute tables.

1094

HOLOGRES_SYNC_DDL

MaxCompute data synchronization

Quickly synchronizes data from MaxCompute to a Hologres database.

1095

HOLOGRES_SYNC_DATA

Serverless Spark

Serverless Spark Batch

A Serverless Spark node for large-scale data processing.

2100

SERVERLESS_SPARK_BATCH

Serverless Spark SQL

An SQL query node that is based on Serverless Spark. It supports standard SQL syntax and provides high-performance data analysis capabilities.

2101

SERVERLESS_SPARK_SQL

Serverless Kyuubi

Connects to Serverless Spark through the Kyuubi JDBC/ODBC interface to provide a multi-tenant Spark SQL service.

2103

SERVERLESS_KYUUBI

Serverless StarRocks

Serverless StarRocks SQL

An SQL node that is based on E-MapReduce Serverless StarRocks. It is compatible with the SQL syntax of open source StarRocks and provides high-speed online analytical processing (OLAP) and data lakehouse query analysis.

2104

SERVERLESS_STARROCKS

Large language model (LLM)

LLM node

Uses a built-in engine that intelligently performs data cleaning, processing, analysis, and mining based on your natural language instructions.

2200

LLM_NODE

Flink

Flink SQL Streaming

Defines real-time task processing logic with standard SQL. This node is easy to use and features rich SQL support, powerful state management, and fault tolerance. It is compatible with both event time and processing time, offers flexible scalability, integrates with systems like Kafka and HDFS, and provides detailed logs and performance monitoring tools.

2012

FLINK_SQL_STREAM

Flink SQL Batch

Allows you to use standard SQL statements to define and execute data processing tasks. It is suitable for analyzing and transforming large datasets, including data cleaning and aggregation. This node supports visual configuration and provides an efficient and flexible solution for large-scale batch processing.

2011

FLINK_SQL_BATCH

E-MapReduce

EMR Hive

Uses SQL-like statements to read, write, and manage large datasets, enabling efficient analysis and development of massive log data.

227

EMR_HIVE

EMR Impala

An interactive SQL query engine for fast, real-time queries on petabyte-scale data.

260

EMR_IMPALA

EMR MR

Breaks down large datasets into multiple parallel Map tasks, which significantly improves data processing efficiency.

230

EMR_MR

EMR Presto

A flexible and scalable distributed SQL query engine that supports interactive analysis of big data using the standard SQL query language.

259

EMR_PRESTO

EMR Shell

Lets you edit custom Shell scripts to use advanced features such as data processing, calling Hadoop components, and file operations.

257

EMR_SHELL

EMR Spark

A general-purpose big data analysis engine known for its high performance, ease of use, and wide applicability. It supports complex in-memory computing and is ideal for building large-scale, low-latency data analysis applications.

228

EMR_SPARK

EMR Spark SQL

Uses a distributed SQL query engine to process structured data and improve job execution efficiency.

229

EMR_SPARK_SQL

EMR Spark Streaming

Processes high-throughput, real-time streaming data. It features a fault tolerance mechanism for quick recovery of failed data streams.

264

EMR_SPARK_STREAMING

EMR Trino

A distributed SQL query engine suitable for interactive analysis across multiple data sources.

267

EMR_TRINO

EMR Kyuubi

A distributed and multi-tenant gateway that provides SQL and other query services for data lake query engines such as Spark, Flink, or Trino.

268

EMR_KYUUBI

ADB

AnalyticDB for PostgreSQL

Lets you develop and schedule recurring AnalyticDB for PostgreSQL tasks.

1000090

-

AnalyticDB for MySQL

Lets you develop and schedule recurring AnalyticDB for MySQL tasks.

1000126

-

AnalyticDB Spark

Lets you develop and schedule recurring AnalyticDB Spark tasks.

1990

ADB_SPARK

ADB Spark SQL

Lets you develop and schedule recurring AnalyticDB Spark SQL tasks.

1991

ADB_SPARK_SQL

CDH

CDH Hive

Use this node if you have deployed a CDH cluster and want to use DataWorks to run Hive tasks.

270

CDH_HIVE

CDH Spark

A general-purpose big data analysis engine that features high performance, ease of use, and wide applicability. It supports complex in-memory analysis and is ideal for building large-scale, low-latency data analysis applications.

271

CDH_SPARK

CDH Spark SQL

Uses a distributed SQL query engine to process structured data and improve job execution efficiency.

272

CDH_SPARK_SQL

CDH MR

Processes massive datasets.

273

CDH_MR

CDH Presto

This node provides a distributed SQL query engine, which enhances the data analysis capabilities of the CDH environment.

278

CDH_PRESTO

CDH Impala

The CDH Impala node allows you to write and run Impala SQL scripts for faster query performance.

279

CDH_IMPALA

Lindorm

Lindorm Spark

Lets you develop and schedule recurring Lindorm Spark tasks.

1800

LINDORM_SPARK

Lindorm Spark SQL

Lets you develop and schedule recurring Lindorm Spark SQL tasks.

1801

LINDORM_SPARK_SQL

ClickHouse

ClickHouse SQL

Performs distributed SQL queries and processes structured data to improve job execution efficiency.

1301

CLICK_SQL

data quality

data quality monitoring

You can configure data quality monitoring rules to monitor the data quality of tables in a data source, such as checking for bad data. You can also customize scheduling policies to periodically run monitoring jobs for data validation.

1333

DATA_QUALITY_MONITOR

data comparison

Use this node to compare data from different tables in various ways.

1331

DATA_SYNCHRONIZATION_QUALITY_CHECK

general

virtual node

A control-type, dry-run node that produces no data. It is typically used as the root node in a workflow to organize nodes and business processes.

99

VIRTUAL

assignment node

Passes parameters between nodes. Its built-in output parameter passes the result of the last query or its own output to downstream nodes through the node context.

1100

CONTROLLER_ASSIGNMENT

Shell node

The Shell node supports standard Shell syntax but does not support interactive syntax.

6

DIDE_SHELL

Parameter node

Aggregates parameters from ancestor nodes and passes them to descendant nodes.

1115

PARAM_HUB

OSS object check node

Triggers the execution of descendant nodes by monitoring an OSS object.

239

OSS_INSPECT

Python node

Supports Python 3. It allows you to obtain upstream parameters and configure custom parameters by using the scheduling parameters in the scheduling configuration. You can also pass its own output as parameters to downstream nodes.

1322

PYTHON

Merge node

Merges the running statuses of ancestor nodes to resolve dependency and execution trigger issues for the descendant nodes of a branch node.

1102

CONTROLLER_JOIN

Branch node

Evaluates the result of an ancestor node to determine which branch of logic to follow. You can use this node together with an assignment node.

1101

CONTROLLER_BRANCH

For-each node

Traverses the result set passed by an assignment node.

1106

CONTROLLER_TRAVERSE

Do-while node

Loops through a part of the node logic. You can also use it with an assignment node to loop through the results passed by the assignment node.

1103

CONTROLLER_CYCLE

Check node

Checks whether a target object is available. If the check policy is met, the node runs successfully and triggers downstream tasks. The following target objects are supported:

  • MaxCompute partitioned table

  • FTP file

  • OSS file

  • HDFS

  • OSS-HDFS

241

CHECK_NODE

Function Compute

Used for recurring scheduling of event-driven functions.

1330

FUNCTION_COMPUTE

HTTP trigger

Triggers a DataWorks task upon the completion of a task in an external scheduling system.

Note

DataWorks no longer supports creating cross-tenant collaboration nodes. If you are using a cross-tenant collaboration node, replace it with an HTTP trigger node, which provides the same capabilities.

1114

SCHEDULER_TRIGGER

SSH

Allows you to specify an SSH data source to remotely access the host connected to that data source from DataWorks and trigger a script to run on the remote host.

1321

SSH

data push

Pushes query results from other nodes in a DataStudio workflow to a configured destination. Supported destinations include DingTalk groups, Lark groups, WeCom groups, Teams, and email.

1332

DATA_PUSH

database node

MySQL node

Lets you develop and schedule recurring MySQL tasks.

1000125

-

SQL Server

Lets you develop and schedule recurring SQL Server tasks.

10001

-

Oracle node

Lets you develop and schedule recurring Oracle tasks.

10002

-

PostgreSQL node

Lets you develop and schedule recurring PostgreSQL tasks.

10003

-

StarRocks node

Lets you develop and schedule recurring StarRocks tasks.

10004

-

DRDS node

Lets you develop and schedule recurring DRDS tasks.

10005

-

PolarDB for MySQL node

Lets you develop and schedule recurring PolarDB for MySQL tasks.

10006

-

PolarDB for PostgreSQL node

Lets you develop and schedule recurring PolarDB for PostgreSQL tasks.

10007

-

Doris node

Lets you develop and schedule recurring Doris tasks.

10008

-

MariaDB node

Lets you develop and schedule recurring MariaDB tasks.

10009

-

SelectDB node

Lets you develop and schedule recurring SelectDB tasks.

10010

-

Redshift node

Lets you develop and schedule recurring Redshift tasks.

10011

-

SAP HANA node

Lets you develop and schedule recurring SAP HANA tasks.

10012

-

Vertica node

Lets you develop and schedule recurring Vertica tasks.

10013

-

DM (Dameng) node

Lets you develop and schedule recurring DM tasks.

10014

-

KingbaseES node

Lets you develop and schedule recurring KingbaseES tasks.

10015

-

OceanBase node

Lets you develop and schedule recurring OceanBase tasks.

10016

-

DB2 node

Lets you develop and schedule recurring DB2 tasks.

10017

-

GBase 8a node

Lets you develop and schedule recurring GBase 8a tasks.

10018

-

Algorithm

PAI Designer

PAI Designer is a visual modeling tool for building end-to-end machine learning development workflows.

1117

PAI_STUDIO

PAI DLC

PAI DLC is a container-based training service used to run distributed training tasks.

1119

PAI_DLC

PAI Flow

Generates a PAI Flow node in DataWorks for a PAI knowledge base indexing workflow.

1250

PAI_FLOW

logic node

SUB_PROCESS node

The SUB_PROCESS node integrates multiple workflows into a unified whole for management and scheduling.

1122

SUB_PROCESS

Create a node

Create a node for a scheduled workflow

To run a task automatically on a recurring schedule, such as hourly, daily, or weekly, create a node for a scheduled workflow. You can do this by creating a new scheduled task node, adding an inner node to a scheduled workflow, or cloning an existing node.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left navigation pane, click the image icon to go to the Data Studio page.

Create a scheduled task node

  1. Click the image icon to the right of the Workspace Directories, select New Node, and then select a node type.

    Important

    You can choose between Common Nodes and All Nodes. To see all available types, select All Nodes at the bottom of the list. You can then use the search box or filter by category (such as MaxCompute, Data Integration, and General) to find the node you need.

    You can create folders in advance to organize and manage your nodes.
  2. Enter a name for the node and save it to open the node editor page.

Create an inner node in a workflow

  1. Create a scheduled workflow.

  2. On the workflow canvas, click Create Node in the top toolbar. Select a node type for your task and drag it onto the canvas.

  3. Enter a name for the node and save it.

Create a node by cloning

You can use the clone feature to quickly create a new node from an existing one. Cloning copies the node's Scheduling Settings, such as Scheduling Parameters, Scheduling time, and Scheduling Dependency.

  1. In the Project Directory pane on the left, right-click the node that you want to clone and select Cloning from the context menu.

  2. In the dialog box, change the node's Name and Path or accept the defaults, then click Confirm.

  3. After the node is cloned, it appears in the Project Directory pane.

Create a node for a manual workflow

If a task does not require a recurring schedule but must be published to the production environment for on-demand execution, create an inner node in a manually triggered workflow.

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left navigation pane, click the image icon to go to the manually triggered workflow page.

    1. Create a manually triggered workflow.

    2. On the manually triggered workflow editor page, click New Internal Node in the top toolbar and select a node type.

    3. Enter a name for the node and save it.

Create a manual task node

  1. Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.

  2. In the left navigation pane, click the image icon to go to the manual task page.

  3. In the Manual Tasks section at the bottom of the page, click the image icon next to Manually Triggered Task, select New Node, and then choose a node type.

    Note

    Manual tasks support only the following node types: Offline synchronization, Notebook, MaxCompute SQL, MaxCompute Script, PyODPS 2, MaxCompute MR, Hologres SQL, Python node, and Shell node.

  4. Enter a name for the node and save it to open the node editor page.

Batch edit nodes

Editing nodes individually in a large workflow is inefficient. The DataWorks Inner Node List feature allows you to quickly preview, search, and batch edit all nodes from a list on the right side of the canvas.

How to use

  1. Click the Show Inner Node List button in the toolbar at the top of the workflow canvas to open the right-side panel.

    image

  2. When the panel opens, it lists all nodes in the current workflow.

    • Code preview and sorting:

      • For nodes that support code editing, such as MaxCompute SQL, the code editor expands by default.

      • Nodes without code editing support, such as zero load nodes, appear as cards and are automatically sorted to the bottom of the list.

    • Quick search and locating:

      • Search: Enter a keyword in the search box at the top to fuzzy search by node name.

      • Synchronized focus: Focus is synchronized between the canvas and the sidebar. Selecting a node on the canvas highlights it in the sidebar, and conversely, clicking a node in the sidebar focuses the canvas on that node.

    • Online editing:

      • Operations: Each node card's upper-right corner contains shortcuts, including Load Latest Code, Open Node, and Edit.

      • Auto-save: In edit mode, your changes save automatically when you move the cursor outside the code editor.

      • Conflict detection: If another user updates the code while you are editing it, a failure notification appears when you save. This prevents you from accidentally overwriting their changes.

    • Focus mode:

      • Select a node and click image in the upper-right corner of the floating window to enable Focus Mode. In this mode, the sidebar displays only the selected node, freeing up more space for code editing.

Version management

Version management lets you restore a node to a previous version. It also provides tools for viewing and comparing versions to help you analyze differences and make adjustments.

  1. In the Project Directory pane on the left, double-click the target node name to open its editor.

  2. On the right side of the node editor, click Version. On the Version page, you can view and manage the node's Developer Record and Publish Record.

    • View a version:

      1. On the Developer Record or Publish Record tab, find the node version you want to view.

      2. Click View in the Operation column to open the details page. This page shows the node code and Scheduling Settings information.

        Note

        You can view the Scheduling Settings information in code editor or visualization mode. You can switch the view mode in the upper-right corner of the Scheduling Settings tab.

    • Compare versions:

      You can compare different versions of a node on the Developer Record or Publish Record tab. The following example shows how to compare versions from the development history.

      • Compare versions in the same environment: On the Developer Record tab, select two versions and click Select Comparison at the top to compare the node code and scheduling information of the selected versions.

      • Compare versions between different environments:

        1. On the Developer Record tab, locate a version of the node.

        2. Click Compare in the Operation column. On the details page, select a version from the Publish Record or Build Records to compare with.

    • Restore a version:

      You can only restore a node to a previous version from the Developer Record tab. On this tab, find the target version and click Restore in the Operation column to restore the node's code and Scheduling Settings information to that version.

References

FAQ

Can I download node code, such as SQL or Python, to a local machine?

  • Answer: Direct downloads are not supported. As a workaround, you can copy the code to your local machine during development. Alternatively, the new DataStudio lets you add a local file to your personal folder for development. When you complete development, submit the code to the workspace directories.