In DataWorks, a scheduling dependency defines the upstream and downstream relationship between periodically scheduled nodes. After you configure a dependency, the system ensures that a downstream node instance is triggered only after all its upstream node instances have succeeded. This guarantees that data is produced and consumed in the correct order. This document introduces the basic concepts, types, and configuration methods for scheduling dependencies to help you understand the basics and find relevant instructions.
How it works
A scheduling dependency is a mechanism in DataWorks that defines the upstream and downstream relationships between nodes. By configuring a dependency, you can specify that a node starts only after its upstream nodes succeed, ensuring the correct order of data processing. Once a dependency is configured, the DataWorks scheduling system automatically orchestrates the execution order. A downstream instance is triggered only when all its upstream instances have succeeded and all other conditions, such as time and resource availability, are met.
DataWorks establishes dependencies by matching the output names of upstream nodes with the input names of downstream nodes. The core workflow for configuring a dependency is as follows:
Configure the output on the upstream node: Add an output name to the upstream
node, usually in the formatproject_name.table_name(for example,my_project.dim_user), to represent the data table produced by thenode.Configure the input on the downstream node: In the downstream
node, search for and select the output name of the upstreamnodeas its input (dependency). This establishes the dependency relationship.Automatic parsing (optional): For SQL-based nodes, DataWorks can automatically parse
INSERTandSELECTstatements in your code to identify input and output tables and then generate the dependency configuration. You can also manually adjust the automatically parsed configuration. For a list ofnodetypes that support automatic parsing, see Automatic parsing scenarios for different node types.
Each node must have at least one output name. The system automatically generates a default output for each node in the format project_name.nodeID_out. This default output is retained even if you delete all custom outputs.
Rules and limitations
Effective upon deployment: A
scheduling dependencytakes effect only after you commit and deploy thenodetoOperation Center. Configurations made in thedevelopment environmentare not automatically synchronized to theproduction environment.Upstream and downstream scheduling status: For a dependency to work, both the upstream and downstream
nodeinstancesmust be generated and in a normal scheduling state. If anodeis configured incorrectly or an upstreaminstancefails, the downstreamnodemay become isolated and fail to run as scheduled.Circular dependency restriction: The system prohibits circular dependencies, both direct (A depends on B, and B depends on A) and indirect. If a circular dependency is detected during a
commit, the system blocks thedeploymentand reports an error.
Dependency types
DataWorks supports two types of scheduling dependency: same-cycle dependency and cross-cycle dependency. Each type is suitable for different business scenarios.
Prerequisite concepts
A cycle is a relative concept defined by a node's scheduling time. A "scheduling cycle" is the time offset between two consecutive scheduled instances of a node and is determined by its scheduling frequency. For example, for a daily scheduled task, the previous cycle refers to the instance from the previous day. For an hourly scheduled task, it refers to the instance from the previous hour.
Scheduling frequency | One cycle |
Daily, weekly, monthly, yearly | 1 day Note For weekly, monthly, and yearly tasks, instances are still generated on a daily basis, but they are dry-run instances on non-scheduled days. Therefore, dependency calculations are based on daily granularity, and a previous-cycle |
Hourly | An hourly interval |
By minute | A minute-level interval (for example, every 5 minutes) |
Two dependency types
For example, a daily scheduled node A produces the table dim_user, and a downstream node B consumes this table:
Same-cycle dependency: Node B's
instancefor today waits for Node A'sinstancefor today to succeed. This means Node B consumes the data produced by Node A on the same day.Cross-cycle dependency: Node B's
instancefor today waits for Node A'sinstancefrom yesterday to succeed. This means Node B consumes the data produced by Node A on the previous day.
Item | Same-cycle dependency | Cross-cycle dependency (dependency on the previous cycle) |
Definition | The current-cycle | The current-cycle |
DAG representation | Represented by a solid line. | Represented by a dashed line. |
Typical use case | Node B needs to read the data produced by Node A today. | A |
Configuration method | Supports automatic parsing, drag-and-drop connection in the workflow editor, and manual configuration. | In the scheduling configuration panel, in the "Previous Cycle" section, select the dependency form and specify the |
Note: Asame-cycle dependencyand across-cycle dependencycan exist between the same pair of nodes, but you must understand their respective business logic. If you only need across-cycle dependency, remember to delete thesame-cycle dependencythat may have been automatically generated. Otherwise, the downstreaminstancewill still wait for the upstream current-cycleinstanceto complete, which can cause unexpected delays.
Configuration
To ensure the integrity and maintainability of your scheduling pipelines, all nodes must have an upstream dependency configured before they can be deployed to Operation Center for automatic scheduling. If a node has no data dependencies, it must depend on a zero load node or the root node. To configure a scheduling dependency, analyze the node's business logic, identify the dependency object and type, and select the best method to build a robust data workflow.
1. Identify the dependency object
Before you configure a dependency, complete the following preparations:
Analyze the
data lineage: Confirm that the table or partition produced by the upstreamnodematches the table or partition read by the downstreamnode.Check scheduling properties: Ensure that the node's scheduling cycle, effective date, and
scheduling parameterare correctly set, as these properties directly affect how dependencies behave.
Select a dependency object based on your node's data dependency requirements.
Scenario 1: Dependency on direct output |
|
Scenario 2: Dependency on non-scheduled data |
|
Scenario 3: Business logic dependency |
|
2. Select the dependency type
If yournodedepends on the direct output of an upstreamnode(Scenario 1), you need to determine whether it depends on the output from the same cycle or from a different cycle.
Key decision
Check which cycle's data the downstream node actually reads. In most cases, a node uses a scheduling parameter to dynamically write data to a specific partition of a table. You can refer to Supported formats of scheduling parameters to understand how scheduling parameter substitution works. If you need to depend on a node in the same workspace, you can check its scheduling parameter configuration.
How to confirm
For nodes in the same
workspace: Check thescheduling parameterin the upstream node's code to see if the written partition corresponds to "today" or "yesterday" after parameter substitution.In the
development environment, check the upstream node'sscheduling parameterconfiguration and code. In theproduction environment, check the parameter substitution results in theinstancedetails.
For nodes in a different
workspace: Use Data Map to view the upstream table's partition information and change history.Confirm the actual partition value written each day.
Select a type
Downstream code fetches the upstream partition for the current day/cycle: same-cycle.
Downstream code fetches the upstream partition from the previous day/cycle: cross-cycle.
An hourly/minute task needs to run serially without concurrency: cross-cycle (self-dependency on the
nodeitself).
Consequences of incorrect data lineage verification:
Risk of missing dependencies: If a
data lineagerelationship exists but ascheduling dependencyis not configured, the downstream task may start before the upstreaminstancesucceeds, leading to missing or incomplete data.Risk of parameter mismatch: If a dependency is configured but the partition parameters are mismatched (for example, the upstream
nodeproduces today's partition, but the downstreamnodereads yesterday's partition), this can lead to data logic errors and quality issues.
3. Configure dependencies
Based on the dependency object and type identified in Steps 1 and 2, choose the appropriate method to configure the dependency.
DataWorks allows you to create dependencies between tasks with different scheduling frequencies. By combining same-cycle or cross-cycle dependencies with a scheduling parameter, you can implement a wide range of advanced scheduling patterns. For more information, see:
4. Verify the dependency
After configuration and before deployment, you must verify the dependency:
Verification method | Description |
Preview the configured scheduling dependencies to ensure they meet your expectations.
| |
When you If automatic parsing is enabled, you must also confirm any changes to the node's scheduling configuration during the | |
After a
|
Impact of removing dependencies
During task operation or iteration, you may need to remove or adjust existing scheduling dependencies.
Before removing a dependency, you must evaluate the impact on the scheduling behavior of downstream tasks to avoid creating isolated nodes or causing data incidents.
Downstream dependency scenario | Impact of removal | Risk level |
Downstream task depends only on the current | The downstream task becomes an isolated | High |
Downstream task depends on multiple parent nodes | The downstream task may start before all required upstream data is ready, leading to missing data or calculation errors. | Medium |
Downstream task depends on a cross-cycle | If a | Medium |
Use cases
Layered offline data warehouses: Configure dependencies across the entire ODS → DWD → DWS → ADS pipeline to ensure data is produced in the correct order at each layer.
Standard ETL pipelines: Configure a
same-cycle dependencyto ensure a downstream task waits for its upstreaminstanceto succeed, guaranteeing sequential and consistent data processing.Next-day (T+1) reports: Configure a
cross-cycle dependency(with an offset of -1) to make today's task depend on the complete business data from yesterday, enabling accurate next-day analysis and output.Multi-cycle hybrid aggregation: Configure a
cross-cycle dependencyto make a daily task depend on all instances of an hourly task, ensuring the underlying data is fully ready before aggregation.External data readiness triggers: Configure a custom dependency or a checker
nodeto confirm that an external file has arrived or an API is ready before triggering a workflow, enabling cross-system scheduling.Complex workflow control: Use a
zero load nodeto aggregate multiple dependency branches, acting as a control milestone to simplify the pipeline structure and improve monitoring visibility.
FAQ
The following section describes common scenarios. For more frequently asked questions about scheduling dependencies, see FAQ about dependencies.
Node uniqueness
A
nodecan have different configurations but is logically unique: A singlenodecan have differentscheduling dependencyconfigurations in thedevelopment environmentand theproduction environment. Although its form may differ, it remains a single, uniquenode.Downstream dependencies must be removed in both environments before decommissioning a node: Due to
nodeuniqueness, to ensure downstream tasks run correctly, you must first remove the dependency from the downstream node's configuration, reconfigure it to depend on a new upstreamnode, and thencommitand deploy the change. Decommission the original upstream task only after confirming its dependency has been removed from both thedevelopment environmentand theproduction environment.
Instance generation modes
When you create a
node, ensure that its instance generation mode is the same as its upstream and downstream nodes. If theinstance generation modediffers, for example if an upstreamnodegenerates aninstanceon the current day but a downstreamnodegenerates it on the next day, the downstreaminstancemay become an isolated node.If you change the scheduling cycle of an existing
nodeand choose to generate instances immediately afterdeployment, the previously generated instances are not automatically deleted. This can lead to misaligned dependencies for the instances on the day of thedeployment. For more information, see Impact of real-time instance conversion on same-day instance dependencies.
An error indicating more than 200 upstream dependencies is reported when updating a job by using an API.
Error details: 'One file could not have more than 200 inputs'.
You can add a
zero load nodein DataStudio between the upstream and downstream nodes to reduce the number of direct upstream dependencies. For more information about how to configure azero load node, see Zero load node.