In DataWorks, you create data development tasks as nodes, each encapsulating a task for a specific compute engine. DataStudio also lets you develop complex tasks by using resources, functions, and various logical processing nodes. This topic describes the common data development workflow.
Prerequisites
-
The required data sources must be bound. For more information, see Prepare for data development: Bind a compute resource or cluster.
-
You must have the Development role. For more information about how to grant permissions, see Add workspace members and manage member role permissions.
Go to DataStudio
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
The following procedure describes how to create tasks in DataStudio.
Development workflow
The following figure shows the common workflow for data development.
|
Step |
Description |
Related documentation |
|
Step 1: Create a workflow |
In DataWorks, you organize all development tasks and code into workflows. |
|
|
Step 2: Create a table |
DataWorks provides a visual interface to create and manage tables, which are displayed in a directory structure. Before you start data development, you must create tables in the engine to store raw data and the results of data cleansing. The type of table you create depends on your specific use case. |
Create and use tables: View and manage tables: |
|
Step 3: Create and upload a resource (Optional) |
If your development process requires resources such as text files or JAR packages, you can upload and manage them in DataWorks for use with a specific compute engine. Note
The supported compute engines and resource types are displayed in the DataWorks UI. |
|
|
Step 4: Create a scheduling node |
In DataWorks, you develop tasks by using nodes. Tasks for different compute engines are encapsulated into different types of nodes. You can select the appropriate node type for your engine task based on your business needs. DataWorks also provides convenient node management operations. For example, you can use node groups to clone nodes in batches or use the recycle bin to quickly restore deleted nodes. |
DataWorks supports multiple engines, including: Different node types are available for different tasks in each engine. For a detailed list of supported node types, see Supported node types. For more information about node management, see the following topics: |
|
Step 5: Reference a resource in a node (Optional) |
To use a resource in DataWorks, you must load it into the node's runtime environment before using it in the node. |
|
|
Step 6: Register a function (Optional) |
If your development process requires a function, you can register it by using the visual interface of DataWorks. Before you register a function, you must upload the function's required resources to DataWorks. Note
The compute engines that support function registration are displayed in the DataWorks UI. |
|
|
Step 7: Edit node code |
On the node editor page, write business code by using the syntax of the corresponding engine and database. The syntax may vary for different node types. Note
After you finish editing the code, save it promptly ( |
For a detailed list of supported node types, see Supported node types. Usage notes for common engines: |
Next steps
After you develop a task, perform the following operations as needed:
-
Debug the code: Run a single task or an entire workflow to debug it. For more information, see Task debugging process.
-
Configure scheduling: Configure scheduling properties for the node. The node then runs periodically based on these settings. For more information, see Configure scheduling for a node.
-
Commit and deploy the node: After development is complete, commit the node to the target environment for scheduled execution. If you use a workspace in standard mode, after a successful commit, you must also deploy the node by clicking Deploy in the upper-right corner. For more information, see Deploy a node.
-
Perform O&M operations: After a node is deployed, it appears in Operation Center for the production environment by default. Go to Operation Center to view the run status of nodes and perform related O&M. For more information, see Operation Center overview.