DataWorks Data Development (DataStudio) provides various nodes for your data processing needs, including data integration nodes, computing resource nodes such as ODPS SQL, Hologres SQL, and EMR Hive, and general-purpose nodes such as virtual nodes and Check nodes.
If you cannot create a computing resource node, such as an ODPS SQL, Hologres SQL, or EMR Hive node, in Data Development, click Computing Resource in the navigation pane on the left to check whether the corresponding computing resource is attached. If a resource is attached but you still cannot create the node, refresh the current page to update the cached data. You can also try to enable incognito mode in your browser.
Data synchronization nodes
Data integration node | Usage | Node code | TaskType |
Used for periodic offline (batch) data synchronization. It supports data synchronization between various disparate data sources in complex scenarios. For more information about the data sources that offline synchronization supports, see Supported data sources and synchronization solutions. | 23 | DI | |
Used for real-time synchronization of incremental data. Real-time synchronization includes three basic plug-ins: real-time read, transform, and write. These plug-ins interact through a defined intermediate data format. For more information about the data sources that real-time synchronization supports, see Supported data sources and synchronization solutions. | 900 | RI |
In addition to the nodes that you create in the Data Development (DataStudio) interface, the Data Integration primary site supports various types of synchronization solutions. Examples include real-time synchronization of full and incremental data and offline synchronization of entire databases. For more information, see Synchronization task capabilities in Data Integration. The code for tasks on the Data Integration primary site is typically 24.
Engine compute nodes
In a business workflow, you can create a node that corresponds to a specific engine. You can then use this node for data development and send the code to the corresponding compute engine for execution.
Engine integrated with DataWorks | How DataWorks encapsulates engine capabilities | Node code | TaskType |
MaxCompute | 10 | ODPS_SQL | |
225 | ODPS_SPARK | ||
221 | PY_ODPS | ||
1221 | PYODPS3 | ||
24 | ODPS_SQL_SCRIPT | ||
11 | ODPS_MR | ||
1010 | COMPONENT_SQL | ||
E-MapReduce | 227 | EMR_HIVE | |
230 | EMR_MR | ||
229 | EMR_SPARK_SQL | ||
228 | EMR_SPARK | ||
257 | EMR_SHELL | ||
259 | EMR_PRESTO | ||
260 | EMR_IMPALA | ||
264 | EMR_SPARK_STREAMING | ||
268 | EMR_KYUUBI | ||
267 | EMR_TRINO | ||
EMR JAR | 231 | EMR_JAR | |
EMR File | 232 | EMR_FILE | |
CDH | 270 | CDH_HIVE | |
271 | CDH_SPARK | ||
273 | CDH_MR | ||
278 | CDH_PRESTO | ||
279 | CDH_IMPALA | ||
272 | CDH_SPARK_SQL | ||
AnalyticDB for PostgreSQL | - | - | |
AnalyticDB for MySQL | 1000126 | - | |
Hologres | 1093 | HOLOGRES_SQL | |
1094 | HOLOGRES_SYNC_DDL | ||
1095 | HOLOGRES_SYNC_DATA | ||
ClickHouse | 1301 | CLICK_SQL | |
Algorithm (machine learning) | 1117 | PAI_STUDIO | |
1119 | PAI_DLC | ||
Database | 1000125 | - | |
10001 | - | ||
10002 | - | ||
10003 | - | ||
10004 | - | ||
10005 | - | ||
10006 | - | ||
10007 | - | ||
10008 | - | ||
10009 | - | ||
10010 | - | ||
10011 | - | ||
10012 | - | ||
10013 | - | ||
10014 | - | ||
10015 | - | ||
10016 | - | ||
10017 | - | ||
10018 | - | ||
Other | 1000023 | - |
General-purpose nodes
Engine nodes can be combined with general-purpose nodes to handle complex logic. In a business workflow, you can create the required nodes in the general-purpose node group and combine them with engine nodes to implement complex logic.
Business scenario | Node type | Usage notes | Node code | TaskType |
Business management | A virtual node is a control plane node. It is a dry-run node that does not generate any data. It is often used as the root node of a business workflow to help you manage nodes and workflows. | 99 | VIRTUAL | |
Event trigger | You can use this node to trigger a task in DataWorks after a task in another scheduling system completes. Note DataWorks no longer supports the creation of cross-tenant collaboration nodes. If you use cross-tenant collaboration nodes, replace them with HTTP trigger nodes. HTTP trigger nodes provide the same features. | 1114 | SCHEDULER_TRIGGER | |
Triggers the execution of descendant nodes by monitoring the creation of OSS objects. | 239 | OSS_INSPECT | ||
Triggers the execution of descendant nodes by monitoring the creation of FTP files. Note DataWorks officially recommends that you use a Check node instead of an FTP Check node. | 1320 | FTP_CHECK | ||
Used to check whether a target object is active. When the Check node meets the check policy, it returns a successful running state. If there are downstream dependencies, descendant tasks are triggered. The following target objects can be checked:
| 241 | CHECK_NODE | ||
Data Quality | You can configure Data Quality monitoring rules to monitor the data quality of tables in related data sources, for example, to check for dirty data. You can also customize a scheduling policy to periodically run monitoring jobs for data validation. | 1333 | DATA_QUALITY_MONITOR | |
A Data Comparison node lets you compare data from different tables in a workflow in multiple ways. | 1331 | DATA_SYNCHRONIZATION_QUALITY_CHECK | ||
Parameter assignment and passing | Used for passing parameters. The built-in output of the assignment node passes the result of the last query or output to descendant nodes through the node context feature. This implements cross-node parameter passing. | 1100 | CONTROLLER_ASSIGNMENT | |
Used to aggregate parameters from ancestor nodes and distribute them to descendant nodes. | 1115 | PARAM_HUB | ||
Control | Used to traverse the result set passed by an assignment node. | 1106 | CONTROLLER_TRAVERSE | |
Used to loop the execution of some node logic. You can also use it with an assignment node to loop the output of the result passed by the assignment node. | 1103 | CONTROLLER_CYCLE | ||
Used to evaluate the result of an ancestor node and determine which branch logic to follow for different results. You can use it with an assignment node. | 1101 | CONTROLLER_BRANCH | ||
Used to merge the running states of ancestor nodes. This resolves issues with dependency attachment and execution triggering for descendant nodes of a branch node. | 1102 | CONTROLLER_JOIN | ||
Other | A Shell node supports standard Shell syntax but does not support interactive syntax. | 6 | DIDE_SHELL | |
Used to periodically schedule and process event functions, and to integrate and jointly schedule with other types of nodes. | 1330 | FUNCTION_COMPUTE | ||
Used to push query data from a business workflow to DingTalk groups, Lark groups, WeCom groups, and Teams. This allows team members to promptly receive and follow the latest data. | 1332 | DATA_PUSH |