A same-cycle dependency requires a node to wait for its ancestor instance from the same cycle to complete successfully before it can run. Use this dependency when a node consumes data produced by an ancestor on the same day (the current cycle). DataWorks offers several ways to configure these dependencies, including a preview feature that helps you review and correct them to ensure your tasks are scheduled as expected.
How it works
You create a scheduling dependency by matching a node output with a node input. This links an ancestor node's output to a descendant node's input. After you configure the dependency, the descendant node starts only after the ancestor node runs successfully. Before you begin, we recommend you identify the dependency targets and types based on your table lineage. For more information, see Scheduling dependency configuration guide.
Node output
A node output, also known as the output name of the current node, identifies the node for other nodes to depend on. It does not represent the actual data the node produces. Other nodes use this output name to specify it as an ancestor dependency.
DataWorks automatically generates two output names for each node:
-
projectName.randomNumber_out: This output is globally unique and cannot be modified or deleted.
-
projectName.nodeName: This output includes the node name and can be modified. This output name remains unchanged even if the node is renamed.
DataWorks also allows you to add outputs manually or by automatically parsing inputs and outputs from your code. Support for automatic parsing varies by node type. For details, see Comparison of automatic parsing results.
Node input
A node input specifies the ancestor nodes that the current node depends on. You can specify a dependency by using the ancestor's output name (recommended), node name, or node ID.
A node ID is generated only after the ancestor node is submitted to the production environment.
Configuration guidelines
To improve development efficiency, we recommend using the automatic parsing feature to quickly configure dependencies for your nodes. When using automatic parsing, follow these guidelines:
-
Node creation: We recommend naming each node after its output table.
-
Code development: Avoid having multiple nodes write data to the same table.
-
Dependency configuration: We recommend setting the table that a node produces as that node's output.
Entry points and methods
Go to the edit page of the Data Studio node, click Schedule Settings in the right-side navigation pane, and configure the scheduling dependencies for the node in the Scheduling Dependency section.
-
Parent Nodes: Specifies the ancestor nodes that the current task depends on.
-
Output Name of Current Node: Defines the output names through which other tasks can establish dependencies on this node.
-
During code editing, dependencies are configured based on table lineage by default. DataWorks automatically checks whether the dependencies match the data lineage upon submission. You can choose whether to enable the automatic parsing before submission feature. For more information, see Configure automatic parsing before submission.
-
If the current node needs to depend on data produced by an upstream node yesterday, or if an hourly/minutely task needs to depend on its own instance from the previous cycle, configure cross-cycle dependencies.
-
If the current node and its upstream node have different scheduling frequencies, such as a daily task depending on an hourly task or depending on hourly tasks with different frequencies, see Configure dependencies between tasks with different scheduling frequencies.
You can configure dependencies in the following three ways. Regardless of the method, the underlying mechanism remains the same.
Configure node dependencies by parsing table lineage from code
Automatic parsing analyzes the table lineage in a node's code and automatically configures the node's output names and upstream dependencies. After parsing, tables that the node writes to are automatically added as node outputs in the projectname.tablename format, and tables that the node reads from are automatically added as node inputs.
For example, when a node uses SELECT on a table, that table is automatically parsed as an upstream dependency of the node. When a node uses INSERT on a table, that table is automatically parsed as an output of the node. For the keywords supported by automatic parsing for each node type, see Keywords supported by automatic parsing for each node type.
-
Configure dependencies
Automatic parsing supports two methods: Parse Inputs and Outputs from Code and Automatic Parsing Before Committing. Both methods work on the same principle. Automatic parsing before committing automatically parses inputs and outputs when you submit the code and prompts you to configure dependencies.
For example, ODPS node mc2 depends on the output table
dws_user_info_all_diof node mc1. The code of node mc2 is as follows:INSERT OVERWRITE TABLE ads_user_info_1d PARTITION (dt='${workflow.var}') SELECT uid , MAX(region) , MAX(device) , COUNT(0) AS pv , MAX(gender) , MAX(age_range) , MAX(zodiac) FROM dws_user_info_all_di WHERE dt = '${workflow.var}' GROUP BY uid;After you click Parse Inputs and Outputs from Code, the input of this node is parsed as the
dws_user_info_all_ditable, and the output table name and name of the upstream node are automatically matched:Upstream node output name
Upstream node output table name
Upstream node name
Node ID
Workspace
Owner
Schedule
Method
Recent run status
Action
yunwan_lingyi.dws_user_info_all_di
mc1
-
Test workspace
lingyi01_testcloud_com
Day
Code parsing
No data
Delete
At the same time, the output of this node is parsed as the
ads_user_info_1dtable. The parsing results are as follows:Output name
Output table name
Downstream node name
Owner
Method
Downstream node affected baselines
Action
old_ide.505487297_out
-
-
-
Added by system
-
Delete
old_ide.mc2
-
-
-
Manually added
-
Delete
yunwan_lingyi.ads_user_info_1d
yunwan_lingyi.ads_user_info_1d
-
-
Code parsing
-
Delete
Node mc2 is now configured with a dependency on node mc1.
-
Modify dependencies from code parsing
When the dependencies from code parsing do not meet your expectations, or when scenarios that do not support scheduling dependencies (tables whose data is not produced by periodic scheduling) require you to manually remove dependencies, refer to the following instructions to modify the automatically parsed dependencies.
Action
Description
Manually delete parsing results
Delete the unexpected input from the upstream node dependency list, perform the delete operation, and re-parse. After deletion, a corresponding comment is automatically added to the code to prevent the dependency from being re-added during the next parsing:
--@exclude_input=Remove input --@exclude_output=Remove outputManually add inputs and outputs
Right-click a table name in the code editor and select Add Input or Add Output. After the input or output is added, a corresponding comment is automatically added to the code.
--@extra_output=Add output --@extra_input=Add inputAlternatively, you can add dependencies by using the methods described in Manually add upstream node dependencies from the schedule settings panel or Set node dependencies by dragging connections in the workflow panel.
ImportantDataWorks does not allow you to directly delete a node output that has existing downstream dependencies. Doing so will cause downstream task execution or data retrieval errors. We recommend that you first adjust the downstream business logic, remove the upstream dependency from the downstream node, and then delete the node output from the upstream node.
-
Scenarios excluded from automatic parsing
Temporary tables defined in workspace table management with a fixed naming format (for example, tables prefixed with
t_) in DataWorks are not automatically parsed as node outputs or upstream dependencies. -
Considerations for automatic parsing
When you use automatic parsing to configure dependencies, make sure that node outputs are unique within the current region. When developing in DataWorks, note the following:
-
Node creation: Each node has a default output with the same name as the node. If nodes with the same name exist in the same workspace, you must manually modify the node output of one of them.
-
Code development: Automatic parsing uses the output table of a node as the node output. If two scheduled nodes in the same workspace insert data into the same table, automatic parsing will cause an error for one of the nodes. For more information, see Multiple nodes write data to the same table, and automatic parsing reports duplicate node output names.
-
Dependency configuration: When you use SQL tasks to process the output tables of batch synchronization tasks, to enable SQL tasks to quickly depend on batch synchronization tasks through lineage-based automatic parsing, you must manually configure the output table of the batch synchronization node as a node output, or name the batch synchronization task node after its output table (the platform automatically creates a node output with the same name as the node). Otherwise, when submitting the downstream SQL node, you may encounter the error The parent node output name ${projectname.tablename} that the current node depends on does not exist. The current node cannot be submitted. Make sure that the parent node with this output name has been submitted.
-
Manually add upstream node dependencies from the schedule settings panel
In the Schedule Settings > Scheduling Dependency > Parent Nodes section, manually add upstream node dependencies by entering the node output, node name, or node ID of the target node. Because node names can be duplicated, we recommend using node outputs to configure dependencies.
Set node dependencies by dragging connections in the workflow panel
When you set dependencies by dragging connections in the DAG panel of a workflow, DataWorks automatically adds the upstream node's _out output to the downstream node, creating a node dependency.
When a dependency connection is deleted from the workflow panel, the corresponding dependency is also removed from the node's schedule settings.
Impact of deleting or changing node outputs
When changes to the data produced by a node result in changes to the node output, or when you manually modify a node output, note the following:
-
Deleting a node output does not directly affect the data produced by the node.
-
If a node output already has downstream dependencies, changing or deleting it may severely impact downstream tasks.
-
Output table deletion: When a node output configured through automatic parsing changes because the output table changes, downstream nodes may become orphan nodes that are not scheduled, or downstream data may be corrupted because of missing data dependencies.
-
Output table change: If the table produced by the current node needs to be transferred to another node, see Transfer a node output table to another node.
If a node output has downstream dependencies, before deleting the output name, we recommend that you communicate with the owners of the downstream tasks in advance, informing them that the output will be deleted so that they can adjust the downstream task dependencies in time to prevent downstream tasks from becoming orphan tasks.
-
Next steps: Verify that dependencies are as expected
After you complete the configuration, verify that the dependencies are correct to ensure that tasks are scheduled as expected:
-
Preview dependencies: Verify that the dependencies are correct to prevent task scheduling delays caused by unexpected dependencies.
-
Submission check: Confirm that dependency changes are as expected when you submit a node.
-
Scheduled task dependency verification: After you deploy a node, verify that the dependencies of the production scheduled task in Operation Center are as expected. A scheduled task represents the latest state of the task in the production environment. The instance dependencies of scheduled instances are related to the instance generation method.
For more information, see Verify scheduling dependencies.
FAQ
For more FAQ, see Scheduling dependencies.
Best practices
To configure node dependencies across workspaces or across workflows within the same workspace, see Configure node dependencies across workspaces or workflows.