In DataWorks, a Workflow helps you organize data development tasks from a business perspective. You can create an auto triggered workflow for tasks that require periodic scheduling or a Manually Triggered Workflows for on-demand tasks. This topic describes how to create, design, commit, and view workflows, and how to modify or delete nodes in bulk.
Background
A workspace can support multiple types of compute engines and contain multiple workflows. A workflow is a collection of various objects, including nodes for different engines such as Data Integration, MaxCompute, Hologres, and E-MapReduce (EMR). Examples include MaxCompute SQL nodes and MaxCompute table nodes.
Each object type has its own folder, and you can create subfolders within them. To ensure manageability, we recommend a subfolder depth of no more than four levels. If your structure exceeds this, consider splitting the workflow into two or more smaller workflows. You can then group these related workflows into a solution to improve efficiency.
Auto triggered and manually triggered workflows
DataWorks provides two types of workflows: auto triggered workflow and Manually Triggered Workflows. These types cater to periodic and on-demand execution scenarios. You can develop tasks for both workflow types in DataStudio and then commit them to Operation Center to run in the production environment. The following table compares the two types.
Comparison item | Auto triggered workflow | Manually triggered workflow |
Use cases | For tasks that run automatically on a recurring schedule. | For tasks that are triggered manually on demand. |
Execution method | Runs automatically on a schedule. | Triggered manually. |
Key configuration points for task development | You must configure periodic scheduling parameters, such as the scheduled time and node dependencies. | Because tasks are triggered manually, you do not need to configure scheduling-related parameters, such as parent node dependencies, node outputs, or scheduled run times. Note Except for these scheduling parameters, the configuration is identical to that of an auto triggered workflow. The task development examples in this topic are based on an auto triggered workflow. The configuration steps for shared parameters are the same for both workflow types. |
How to create | In the left-side navigation pane, click DataStudio. On the DataStudio page, click + New in the upper-right corner and select New Workflow from the drop-down list. | In the left-side navigation pane, select Manually Triggered Workflow. Click + New in the upper-right corner and select New Workflow from the drop-down list. |
If you cannot find the corresponding entry point in the left-side navigation pane, see Customize module display in DataStudio to configure the modules displayed in DataStudio.
Create an auto triggered workflow
In DataStudio, all development occurs within a workflow. Therefore, you must create a workflow before you can create a node. Before you create a workflow, you can refer to the Design a workflow section to plan a workflow that meets your business requirements. The following steps describe how to create an auto triggered workflow.
-
Go to the DataStudio page.
Log on to the DataWorks console. In the target region, click in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
Hover over the
icon and click Create Workflow.In the Create Workflow dialog box, enter a Workflow Name and Description.
Click Create.
After the workflow is created:
You can start developing tasks under the corresponding engine within the workflow. For general development guidelines, see Develop business logic.
After you finish developing tasks, you can commit the workflow to deploy the tasks to the production environment. For more information, see Commit a workflow.
You can also manage the workflow from various pages. For more information, see View all workflows, Delete nodes in a workflow, Quickly copy a workflow, and Quickly import and export multiple workflows to other DataWorks workspaces or open-source engines.
Design a workflow
You perform all code development within a workflow. You can develop nodes in two ways: create nodes from the directory tree in list view, or double-click a workflow to open its canvas and visually orchestrate multi-engine tasks by using a drag-and-drop, DAG-based interface. DataWorks uses a hierarchical structure for data development: a solution groups a set of related workflows, and you can double-click a workflow to open its workflow canvas. The workflow's directory tree contains categories such as Data Integration, DataStudio, table, resource, function, algorithm, Data Service, and control, allowing you to organize your code visually. On the workflow canvas, you can drag development components (nodes) to visualize and develop your business logic. When you design a workflow:
A large number of nodes in a workflow can degrade performance. We recommend limiting a single workflow to 100 nodes.
NoteYou can create a maximum of 1,000 nodes in a workflow.
In DAG mode, you can drag dependency lines to configure scheduling dependencies between nodes. You can also open the scheduling configuration panel for a node to manually edit its dependencies. For more information, see Configure scheduling dependencies.
For nodes created in list mode, you can configure scheduling dependencies based on code lineage. For more information, see Configure scheduling dependencies.
Develop business logic
DataWorks encapsulates engine capabilities, allowing you to develop data by using engine nodes without interacting with complex engine command lines. You can also use the platform's general-purpose nodes to handle complex logic.
Within a workflow, you can develop your business logic by using synchronization nodes and compute engine nodes.
In the Data Integration folder, you can use batch synchronization and real-time synchronization component nodes to synchronize data from one database to another.
You can cleanse data by using data development nodes under the corresponding engine group in the workflow. An example is an ODPS SQL node for the MaxCompute engine. If you need to use resources or functions during code development, DataWorks allows you to create resource nodes and function nodes through a visual interface.
For more information about the engine capabilities that DataWorks encapsulates and the development capabilities that the product supports, see DataStudio (Legacy).
For information about node scheduling dependencies and property configuration, see Scheduling configuration.
Commit a workflow
In a workspace in standard mode, the DataStudio interface serves as the development and testing environment for node tasks. To publish your code to the production environment, you need to commit the nodes in the workflow and then publish them in bulk from the deployment page.
After you design and test the workflow, click the
icon in the toolbar.In the Submission dialog box, select the nodes that you want to commit, enter a Remarks, and select whether to Ignore I/O Inconsistency Alerts based on your business requirements. If your input and output content does not match the code lineage analysis and you do not select Ignore I/O Inconsistency Alerts, an alert is triggered. For more information, see An alert that indicates a mismatch between I/O and code lineage is reported when I commit a node.
Click Submission.
NoteIf a node has been committed and its content is unchanged, it cannot be re-selected. In this case, enter a Remarks and click Submission. Changes to node properties are still committed.
Publish the tasks. For more information, see Task publishing process for a workspace in standard mode.
View operating history
The Operating History page in DataStudio displays all task run records for your account from the last three days.
When you run a task in DataStudio, DataStudio submits it to the corresponding engine service for execution. The task continues to run even if you accidentally close the task tab. On the Operating History page, you can view the run logs or stop a running task.
View all workflows
On the Data Studio page, right-click Workflow and select All Workflows to view all workflows in the workspace.
Click a card to go to the corresponding workflow canvas. The page displays a list of workflow cards, including an entry point for New Workflow and cards for existing workflows.
Manage workflows with solutions
You can group workflows into a solution. Solutions support the following features:
A solution can contain multiple workflows.
The same workflow can be reused across different solutions.
Custom-grouped solutions provide an immersive development experience.
To manage workflows by using solutions, you can perform the following operations:
At the bottom of the left-side navigation pane on the DataStudio page, click the
icon. On the page, select Show Solution in the File Management module.Add workflows to a solution in batches by using the solution edit panel. Right-click the target solution node, such as doctest, and click Edit.
Delete nodes in a workflow
Delete nodes in bulk
If you need to modify task scheduling properties, such as changing the resource group used by tasks, or submit tasks for review in bulk, you can use the Batch Operation feature. This feature allows you to filter and process target nodes based on criteria such as node type, workflow, and resource group for scheduling.
Batch operations modify task properties only in the development environment. To apply the changes to the production environment, you must go to the deployment page and publish them after the batch operation is complete.
On the Data Studio page, click the
icon next to Workflow to open the Tasks page. This is the Task List icon, located in the toolbar above the search box.Modify or delete the target nodes.
You can filter nodes based on criteria such as node name/ID, Node Type, and Workflow.
Select some or all of the nodes that you want to process.
Modify or delete the target nodes.
Modify the target nodes: You can only batch-modify the owner and resource group for scheduling of the target nodes. Click Change Owner or Change Resource Group for Scheduling to make changes.
Setting the Forcefully Modify parameter to Yes in the modification dialog box allows you to modify all selected nodes. If you set this parameter to No, you can modify only the nodes that you have locked.
Delete the target nodes: Click to delete the selected nodes.
In the Delete Node dialog box, setting the Force deletion parameter to Yes allows you to delete all selected nodes. If you set this parameter to No, you can delete only the nodes that you have locked.
Delete nodes by using node groups
You can use node groups to create, reference, split, and delete node groups. For more information, see Use node groups.
Quickly copy a workflow
You can use node groups to quickly bundle a workflow into a node group, and then reference this node group in a new workflow. For more information, see Use node groups.
Migrate workflows
To quickly export multiple workflows from one DataWorks workspace and import them into another, use DataWorks Migration Assistant. For more information, see Migration Assistant.