Configuring scheduling dependencies

更新时间:
复制 MD 格式

In DataWorks, a scheduling dependency defines the upstream and downstream relationship between periodically scheduled nodes. After you configure a dependency, the system ensures that a downstream node instance is triggered only after all its upstream node instances have succeeded. This guarantees that data is produced and consumed in the correct order. This document introduces the basic concepts, types, and configuration methods for scheduling dependencies to help you understand the basics and find relevant instructions.

How it works

A scheduling dependency is a mechanism in DataWorks that defines the upstream and downstream relationships between nodes. By configuring a dependency, you can specify that a node starts only after its upstream nodes succeed, ensuring the correct order of data processing. Once a dependency is configured, the DataWorks scheduling system automatically orchestrates the execution order. A downstream instance is triggered only when all its upstream instances have succeeded and all other conditions, such as time and resource availability, are met.

DataWorks establishes dependencies by matching the output names of upstream nodes with the input names of downstream nodes. The core workflow for configuring a dependency is as follows:

  1. Configure the output on the upstream node: Add an output name to the upstream node, usually in the format project_name.table_name (for example, my_project.dim_user), to represent the data table produced by the node.

  2. Configure the input on the downstream node: In the downstream node, search for and select the output name of the upstream node as its input (dependency). This establishes the dependency relationship.

  3. Automatic parsing (optional): For SQL-based nodes, DataWorks can automatically parse INSERT and SELECT statements in your code to identify input and output tables and then generate the dependency configuration. You can also manually adjust the automatically parsed configuration. For a list of node types that support automatic parsing, see Automatic parsing scenarios for different node types.

Important

Each node must have at least one output name. The system automatically generates a default output for each node in the format project_name.nodeID_out. This default output is retained even if you delete all custom outputs.

Rules and limitations

  • Effective upon deployment: A scheduling dependency takes effect only after you commit and deploy the node to Operation Center. Configurations made in the development environment are not automatically synchronized to the production environment.

  • Upstream and downstream scheduling status: For a dependency to work, both the upstream and downstream node instances must be generated and in a normal scheduling state. If a node is configured incorrectly or an upstream instance fails, the downstream node may become isolated and fail to run as scheduled.

  • Circular dependency restriction: The system prohibits circular dependencies, both direct (A depends on B, and B depends on A) and indirect. If a circular dependency is detected during a commit, the system blocks the deployment and reports an error.

Dependency types

DataWorks supports two types of scheduling dependency: same-cycle dependency and cross-cycle dependency. Each type is suitable for different business scenarios.

Prerequisite concepts

A cycle is a relative concept defined by a node's scheduling time. A "scheduling cycle" is the time offset between two consecutive scheduled instances of a node and is determined by its scheduling frequency. For example, for a daily scheduled task, the previous cycle refers to the instance from the previous day. For an hourly scheduled task, it refers to the instance from the previous hour.

Scheduling frequency

One cycle

Daily, weekly, monthly, yearly

1 day

Note

For weekly, monthly, and yearly tasks, instances are still generated on a daily basis, but they are dry-run instances on non-scheduled days. Therefore, dependency calculations are based on daily granularity, and a previous-cycle instance may be a dry-run instance.

Hourly

An hourly interval

By minute

A minute-level interval (for example, every 5 minutes)

Two dependency types

For example, a daily scheduled node A produces the table dim_user, and a downstream node B consumes this table:

  • Same-cycle dependency: Node B's instance for today waits for Node A's instance for today to succeed. This means Node B consumes the data produced by Node A on the same day.

  • Cross-cycle dependency: Node B's instance for today waits for Node A's instance from yesterday to succeed. This means Node B consumes the data produced by Node A on the previous day.

Item

Same-cycle dependency

Cross-cycle dependency (dependency on the previous cycle)

Definition

The current-cycle instance of a node depends on the result of the current-cycle instance of its upstream node.

The current-cycle instance of a node depends on the result of the previous-cycle instance of a specified node. The specified node can be the node itself (a self-dependency), a direct downstream node, or any other node.

DAG representation

Represented by a solid line.

Represented by a dashed line.

Typical use case

Node B needs to read the data produced by Node A today.

A node depends on data produced yesterday (for example, T-1 reads). An hourly or minute-level task uses self-dependency to ensure serial execution and avoid concurrent runs.

Configuration method

Supports automatic parsing, drag-and-drop connection in the workflow editor, and manual configuration.

In the scheduling configuration panel, in the "Previous Cycle" section, select the dependency form and specify the node ID.

Note: A same-cycle dependency and a cross-cycle dependency can exist between the same pair of nodes, but you must understand their respective business logic. If you only need a cross-cycle dependency, remember to delete the same-cycle dependency that may have been automatically generated. Otherwise, the downstream instance will still wait for the upstream current-cycle instance to complete, which can cause unexpected delays.

Configuration

To ensure the integrity and maintainability of your scheduling pipelines, all nodes must have an upstream dependency configured before they can be deployed to Operation Center for automatic scheduling. If a node has no data dependencies, it must depend on a zero load node or the root node. To configure a scheduling dependency, analyze the node's business logic, identify the dependency object and type, and select the best method to build a robust data workflow.

1. Identify the dependency object

Before you configure a dependency, complete the following preparations:

  • Analyze the data lineage: Confirm that the table or partition produced by the upstream node matches the table or partition read by the downstream node.

  • Check scheduling properties: Ensure that the node's scheduling cycle, effective date, and scheduling parameter are correctly set, as these properties directly affect how dependencies behave.

Select a dependency object based on your node's data dependency requirements.

Scenario 1: Dependency on direct output

  • Use case: The data required by a downstream node is directly sourced from a table produced by another upstream node that is automatically scheduled by DataWorks.

  • Configuration strategy: Base node dependencies on data lineage.

  • Core value: This is the most direct and robust method. The scheduling system ensures that the downstream node starts only after the upstream data is ready, which guarantees end-to-end data consistency.

Scenario 2: Dependency on non-scheduled data

  • Use case: The upstream data is not managed by the DataWorks scheduling system and does not generate schedulable instances for downstream dependencies. Examples include:

    • Files pushed from an external system to OSS or FTP.

    • Tables generated by real-time synchronization.

    • Tables generated by a third-party synchronization tool not scheduled by DataWorks.

    • Temporary tables generated by manual uploads or runs.

  • Configuration strategy: Configure a checker node (such as a Check node) to actively verify if the data is ready (for example, by checking if a file exists or a table partition has been generated). Downstream business nodes can depend on this checker node.

  • Core value: This approach converts a "data produced" state into a "schedulable event," allowing subsequent processes to be driven by data readiness. This ensures data correctness even in non-scheduled workflows.

Scenario 3: Business logic dependency

  • Use case: A node is completely independent in its data processing and code logic, but it needs to belong to a specific business process or be scheduled at a regular interval.

  • Configuration strategy:

    • Depend on a zero load node: Grouping related tasks into a logical unit simplifies starting, stopping, monitoring, and maintenance, which helps keep your business logic organized.

    • Depend on the root node of a workspace: This ensures the task can be instantiated and executed on time by the scheduling system, preventing it from becoming an isolated node that cannot be automatically scheduled.

  • Core value: This prevents nodes from becoming isolated, clarifies process control and monitoring, and ensures the integrity of the business logic flow.

2. Select the dependency type

If your node depends on the direct output of an upstream node (Scenario 1), you need to determine whether it depends on the output from the same cycle or from a different cycle.

Key decision

Check which cycle's data the downstream node actually reads. In most cases, a node uses a scheduling parameter to dynamically write data to a specific partition of a table. You can refer to Supported formats of scheduling parameters to understand how scheduling parameter substitution works. If you need to depend on a node in the same workspace, you can check its scheduling parameter configuration.

How to confirm

  • For nodes in the same workspace: Check the scheduling parameter in the upstream node's code to see if the written partition corresponds to "today" or "yesterday" after parameter substitution.

    • In the development environment, check the upstream node's scheduling parameter configuration and code. In the production environment, check the parameter substitution results in the instance details.

  • For nodes in a different workspace: Use Data Map to view the upstream table's partition information and change history.

    • Confirm the actual partition value written each day.

Select a type

  • Downstream code fetches the upstream partition for the current day/cycle: same-cycle.

  • Downstream code fetches the upstream partition from the previous day/cycle: cross-cycle.

  • An hourly/minute task needs to run serially without concurrency: cross-cycle (self-dependency on the node itself).

Important

Consequences of incorrect data lineage verification:

  1. Risk of missing dependencies: If a data lineage relationship exists but a scheduling dependency is not configured, the downstream task may start before the upstream instance succeeds, leading to missing or incomplete data.

  2. Risk of parameter mismatch: If a dependency is configured but the partition parameters are mismatched (for example, the upstream node produces today's partition, but the downstream node reads yesterday's partition), this can lead to data logic errors and quality issues.

3. Configure dependencies

Based on the dependency object and type identified in Steps 1 and 2, choose the appropriate method to configure the dependency.

DataWorks allows you to create dependencies between tasks with different scheduling frequencies. By combining same-cycle or cross-cycle dependencies with a scheduling parameter, you can implement a wide range of advanced scheduling patterns. For more information, see:

4. Verify the dependency

After configuration and before deployment, you must verify the dependency:

Verification method

Description

During configuration: Preview scheduling dependencies

Preview the configured scheduling dependencies to ensure they meet your expectations.

  • Currently, you can view only the direct upstream and downstream dependencies of the current node.

  • To ensure the dependency is displayed correctly, make sure the upstream node is saved.

  • In the dependency preview graph, a solid line indicates a same-cycle dependency, and a dashed line indicates a cross-cycle dependency.

On commit: Compare code parsing results

When you commit a node, confirm whether the dependency changes are expected and assess their impact on the production environment.

If automatic parsing is enabled, you must also confirm any changes to the node's scheduling configuration during the commit. This helps ensure that dependency changes do not negatively affect production tasks.

After deployment: View Auto Triggered Nodes

After a node is deployed, go to Operation Center to confirm that the production task's scheduling dependencies are correct.

  • Confirm production task dependencies

    In a standard-mode workspace, the dependencies of a node can differ between the development environment and the production environment. The dependencies for production nodes must be configured in DataStudio and take effect only after deployment.

    After a node is deployed, you can go to the Auto Triggered Nodes page in Operation Center and expand the upstream and downstream nodes to view the scheduling dependency details.

    Important

    The Auto Triggered Nodes page displays the latest status of nodes in the production environment. However, whether new dependencies are added to or removed from an instance depends on the selected instance generation mode.

  • Confirm production data status

    After confirming that the scheduling dependencies are correct, you also need to verify the data partitions being read and written by the upstream and downstream nodes (that is, whether the scheduling parameter is configured correctly). This prevents the downstream node from reading incorrect data, which can lead to data quality issues.

    Note

    If your deployment workflow includes a review process, we recommend going to the Auto Triggered Nodes page in Operation Center after a task is deployed to check its dependencies and properties. If you find that the task is not behaving as expected, check whether its deployment process is blocked. For more information, see Deploy tasks.

Impact of removing dependencies

During task operation or iteration, you may need to remove or adjust existing scheduling dependencies.

Before removing a dependency, you must evaluate the impact on the scheduling behavior of downstream tasks to avoid creating isolated nodes or causing data incidents.

Downstream dependency scenario

Impact of removal

Risk level

Downstream task depends only on the current node

The downstream task becomes an isolated node, loses its upstream trigger, and no longer runs automatically.

High

Downstream task depends on multiple parent nodes

The downstream task may start before all required upstream data is ready, leading to missing data or calculation errors.

Medium

Downstream task depends on a cross-cycle instance

If a cross-cycle dependency is removed, the downstream task may read data from the wrong business date, causing data logic errors.

Medium

Use cases

  • Layered offline data warehouses: Configure dependencies across the entire ODS → DWD → DWS → ADS pipeline to ensure data is produced in the correct order at each layer.

  • Standard ETL pipelines: Configure a same-cycle dependency to ensure a downstream task waits for its upstream instance to succeed, guaranteeing sequential and consistent data processing.

  • Next-day (T+1) reports: Configure a cross-cycle dependency (with an offset of -1) to make today's task depend on the complete business data from yesterday, enabling accurate next-day analysis and output.

  • Multi-cycle hybrid aggregation: Configure a cross-cycle dependency to make a daily task depend on all instances of an hourly task, ensuring the underlying data is fully ready before aggregation.

  • External data readiness triggers: Configure a custom dependency or a checker node to confirm that an external file has arrived or an API is ready before triggering a workflow, enabling cross-system scheduling.

  • Complex workflow control: Use a zero load node to aggregate multiple dependency branches, acting as a control milestone to simplify the pipeline structure and improve monitoring visibility.

FAQ

The following section describes common scenarios. For more frequently asked questions about scheduling dependencies, see FAQ about dependencies.

  • Node uniqueness

    • A node can have different configurations but is logically unique: A single node can have different scheduling dependency configurations in the development environment and the production environment. Although its form may differ, it remains a single, unique node.

    • Downstream dependencies must be removed in both environments before decommissioning a node: Due to node uniqueness, to ensure downstream tasks run correctly, you must first remove the dependency from the downstream node's configuration, reconfigure it to depend on a new upstream node, and then commit and deploy the change. Decommission the original upstream task only after confirming its dependency has been removed from both the development environment and the production environment.

  • Instance generation modes

    • When you create a node, ensure that its instance generation mode is the same as its upstream and downstream nodes. If the instance generation mode differs, for example if an upstream node generates an instance on the current day but a downstream node generates it on the next day, the downstream instance may become an isolated node.

    • If you change the scheduling cycle of an existing node and choose to generate instances immediately after deployment, the previously generated instances are not automatically deleted. This can lead to misaligned dependencies for the instances on the day of the deployment. For more information, see Impact of real-time instance conversion on same-day instance dependencies.

  • An error indicating more than 200 upstream dependencies is reported when updating a job by using an API.

    • Error details: 'One file could not have more than 200 inputs'.

    • You can add a zero load node in DataStudio between the upstream and downstream nodes to reduce the number of direct upstream dependencies. For more information about how to configure a zero load node, see Zero load node.