Configure same-cycle scheduling dependencies

更新时间:
复制 MD 格式

A same-cycle dependency requires a node to wait for its ancestor instance from the same cycle to complete successfully before it can run. Use this dependency when a node consumes data produced by an ancestor on the same day (the current cycle). DataWorks offers several ways to configure these dependencies, including a preview feature that helps you review and correct them to ensure your tasks are scheduled as expected.

How it works

You create a scheduling dependency by matching a node output with a node input. This links an ancestor node's output to a descendant node's input. After you configure the dependency, the descendant node starts only after the ancestor node runs successfully. Before you begin, we recommend you identify the dependency targets and types based on your table lineage. For more information, see Scheduling dependency configuration guide.

Node output

A node output, also known as the output name of the current node, identifies the node for other nodes to depend on. It does not represent the actual data the node produces. Other nodes use this output name to specify it as an ancestor dependency.

DataWorks automatically generates two output names for each node:

  • projectName.randomNumber_out: This output is globally unique and cannot be modified or deleted.

  • projectName.nodeName: This output includes the node name and can be modified. This output name remains unchanged even if the node is renamed.

DataWorks also allows you to add outputs manually or by automatically parsing inputs and outputs from your code. Support for automatic parsing varies by node type. For details, see Comparison of automatic parsing results.

Node input

A node input specifies the ancestor nodes that the current node depends on. You can specify a dependency by using the ancestor's output name (recommended), node name, or node ID.

A node ID is generated only after the ancestor node is submitted to the production environment.

Configuration guidelines

To improve development efficiency, we recommend using the automatic parsing feature to quickly configure dependencies for your nodes. When using automatic parsing, follow these guidelines:

  • Node creation: We recommend naming each node after its output table.

  • Code development: Avoid having multiple nodes write data to the same table.

  • Dependency configuration: We recommend setting the table that a node produces as that node's output.

Entry points and methods

Go to the edit page of the Data Studio node, click Schedule Settings in the right-side navigation pane, and configure the scheduling dependencies for the node in the Scheduling Dependency section.

  • Parent Nodes: Specifies the ancestor nodes that the current task depends on.

  • Output Name of Current Node: Defines the output names through which other tasks can establish dependencies on this node.

Note
  • During code editing, dependencies are configured based on table lineage by default. DataWorks automatically checks whether the dependencies match the data lineage upon submission. You can choose whether to enable the automatic parsing before submission feature. For more information, see Configure automatic parsing before submission.

  • If the current node needs to depend on data produced by an upstream node yesterday, or if an hourly/minutely task needs to depend on its own instance from the previous cycle, configure cross-cycle dependencies.

  • If the current node and its upstream node have different scheduling frequencies, such as a daily task depending on an hourly task or depending on hourly tasks with different frequencies, see Configure dependencies between tasks with different scheduling frequencies.

You can configure dependencies in the following three ways. Regardless of the method, the underlying mechanism remains the same.

Configure node dependencies by parsing table lineage from code

Automatic parsing analyzes the table lineage in a node's code and automatically configures the node's output names and upstream dependencies. After parsing, tables that the node writes to are automatically added as node outputs in the projectname.tablename format, and tables that the node reads from are automatically added as node inputs.

For example, when a node uses SELECT on a table, that table is automatically parsed as an upstream dependency of the node. When a node uses INSERT on a table, that table is automatically parsed as an output of the node. For the keywords supported by automatic parsing for each node type, see Keywords supported by automatic parsing for each node type.
  • Configure dependencies

    Automatic parsing supports two methods: Parse Inputs and Outputs from Code and Automatic Parsing Before Committing. Both methods work on the same principle. Automatic parsing before committing automatically parses inputs and outputs when you submit the code and prompts you to configure dependencies.

    For example, ODPS node mc2 depends on the output table dws_user_info_all_di of node mc1. The code of node mc2 is as follows:

    INSERT OVERWRITE TABLE ads_user_info_1d PARTITION (dt='${workflow.var}')
    SELECT uid
      , MAX(region)
      , MAX(device)
      , COUNT(0) AS pv
      , MAX(gender)
      , MAX(age_range)
      , MAX(zodiac)
    FROM dws_user_info_all_di
    WHERE dt = '${workflow.var}'
    GROUP BY uid;

    After you click Parse Inputs and Outputs from Code, the input of this node is parsed as the dws_user_info_all_di table, and the output table name and name of the upstream node are automatically matched:

    Upstream node output name

    Upstream node output table name

    Upstream node name

    Node ID

    Workspace

    Owner

    Schedule

    Method

    Recent run status

    Action

    yunwan_lingyi.dws_user_info_all_di

    yunwan_lingyi.dws_user_info_all_di

    mc1

    -

    Test workspace

    lingyi01_testcloud_com

    Day

    Code parsing

    No data

    Delete

    At the same time, the output of this node is parsed as the ads_user_info_1d table. The parsing results are as follows:

    Output name

    Output table name

    Downstream node name

    Owner

    Method

    Downstream node affected baselines

    Action

    old_ide.505487297_out

    -

    -

    -

    Added by system

    -

    Delete

    old_ide.mc2

    -

    -

    -

    Manually added

    -

    Delete

    yunwan_lingyi.ads_user_info_1d

    yunwan_lingyi.ads_user_info_1d

    -

    -

    Code parsing

    -

    Delete

    Node mc2 is now configured with a dependency on node mc1.

  • Modify dependencies from code parsing

    When the dependencies from code parsing do not meet your expectations, or when scenarios that do not support scheduling dependencies (tables whose data is not produced by periodic scheduling) require you to manually remove dependencies, refer to the following instructions to modify the automatically parsed dependencies.

    Action

    Description

    Manually delete parsing results

    Delete the unexpected input from the upstream node dependency list, perform the delete operation, and re-parse. After deletion, a corresponding comment is automatically added to the code to prevent the dependency from being re-added during the next parsing:

    --@exclude_input=Remove input
    --@exclude_output=Remove output

    Manually add inputs and outputs

    Right-click a table name in the code editor and select Add Input or Add Output. After the input or output is added, a corresponding comment is automatically added to the code.

    --@extra_output=Add output
    --@extra_input=Add input

    Alternatively, you can add dependencies by using the methods described in Manually add upstream node dependencies from the schedule settings panel or Set node dependencies by dragging connections in the workflow panel.

    Important

    DataWorks does not allow you to directly delete a node output that has existing downstream dependencies. Doing so will cause downstream task execution or data retrieval errors. We recommend that you first adjust the downstream business logic, remove the upstream dependency from the downstream node, and then delete the node output from the upstream node.

  • Scenarios excluded from automatic parsing

    Temporary tables defined in workspace table management with a fixed naming format (for example, tables prefixed with t_) in DataWorks are not automatically parsed as node outputs or upstream dependencies.

  • Considerations for automatic parsing

    When you use automatic parsing to configure dependencies, make sure that node outputs are unique within the current region. When developing in DataWorks, note the following:

    • Node creation: Each node has a default output with the same name as the node. If nodes with the same name exist in the same workspace, you must manually modify the node output of one of them.

    • Code development: Automatic parsing uses the output table of a node as the node output. If two scheduled nodes in the same workspace insert data into the same table, automatic parsing will cause an error for one of the nodes. For more information, see Multiple nodes write data to the same table, and automatic parsing reports duplicate node output names.

    • Dependency configuration: When you use SQL tasks to process the output tables of batch synchronization tasks, to enable SQL tasks to quickly depend on batch synchronization tasks through lineage-based automatic parsing, you must manually configure the output table of the batch synchronization node as a node output, or name the batch synchronization task node after its output table (the platform automatically creates a node output with the same name as the node). Otherwise, when submitting the downstream SQL node, you may encounter the error The parent node output name ${projectname.tablename} that the current node depends on does not exist. The current node cannot be submitted. Make sure that the parent node with this output name has been submitted.

Manually add upstream node dependencies from the schedule settings panel

In the Schedule Settings > Scheduling Dependency > Parent Nodes section, manually add upstream node dependencies by entering the node output, node name, or node ID of the target node. Because node names can be duplicated, we recommend using node outputs to configure dependencies.

Set node dependencies by dragging connections in the workflow panel

When you set dependencies by dragging connections in the DAG panel of a workflow, DataWorks automatically adds the upstream node's _out output to the downstream node, creating a node dependency.

Note

When a dependency connection is deleted from the workflow panel, the corresponding dependency is also removed from the node's schedule settings.

Impact of deleting or changing node outputs

When changes to the data produced by a node result in changes to the node output, or when you manually modify a node output, note the following:

  • Deleting a node output does not directly affect the data produced by the node.

  • If a node output already has downstream dependencies, changing or deleting it may severely impact downstream tasks.

    • Output table deletion: When a node output configured through automatic parsing changes because the output table changes, downstream nodes may become orphan nodes that are not scheduled, or downstream data may be corrupted because of missing data dependencies.

    • Output table change: If the table produced by the current node needs to be transferred to another node, see Transfer a node output table to another node.

    If a node output has downstream dependencies, before deleting the output name, we recommend that you communicate with the owners of the downstream tasks in advance, informing them that the output will be deleted so that they can adjust the downstream task dependencies in time to prevent downstream tasks from becoming orphan tasks.

Next steps: Verify that dependencies are as expected

After you complete the configuration, verify that the dependencies are correct to ensure that tasks are scheduled as expected:

  • Preview dependencies: Verify that the dependencies are correct to prevent task scheduling delays caused by unexpected dependencies.

  • Submission check: Confirm that dependency changes are as expected when you submit a node.

  • Scheduled task dependency verification: After you deploy a node, verify that the dependencies of the production scheduled task in Operation Center are as expected. A scheduled task represents the latest state of the task in the production environment. The instance dependencies of scheduled instances are related to the instance generation method.

For more information, see Verify scheduling dependencies.

FAQ

For more FAQ, see Scheduling dependencies.

Best practices

To configure node dependencies across workspaces or across workflows within the same workspace, see Configure node dependencies across workspaces or workflows.