Use a data quality monitoring node-DataWorks(DataWorks)-阿里云帮助中心

Background

The data quality feature in DataWorks detects changes in source data and identifies dirty data generated during the extract, transform, and load (ETL) process. It can automatically block problematic tasks to prevent dirty data from propagating to downstream nodes. This prevents unexpected data that can affect normal operations and business decisions. It also significantly reduces troubleshooting time and prevents wasted resource costs from task reruns. For more information, see Data quality.

Limitations

Supported table types: MaxCompute, E-MapReduce, Hologres, CDH Hive, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and StarRocks.
Table monitoring scope:
- You can monitor only tables in a data source bound to the same workspace as the data quality monitoring node.
- Each node can monitor only one table, but you can configure multiple monitoring rules for that table. The monitoring scope varies by table type:
  - For a non-partitioned table, the entire table is monitored by default.
  - For a partitioned table, you must use a partition filter expression to specify a partition to monitor.
  Note
  To monitor multiple tables, create multiple data quality monitoring nodes.
Operation limits:
- Data quality monitoring rules created in DataStudio can be run, modified, published, and managed only in DataStudio. You can view these rules in the Data Quality module, but you cannot manage them there.
- If you modify the monitoring rules in a data quality monitoring node and then publish the node, the original monitoring rules are replaced.

Prerequisites

A business flow is created.

In DataStudio, development operations for different data sources rely on business flows. Therefore, you must create a business flow before you can create a node. For more information, see Create a business flow.
A data source must be created and bound to the current workspace, and the table to be monitored must exist in that data source.

Before you run a data quality monitoring task, create the table that you want to monitor in your data source. For more information, see Manage data sources, Manage compute engine resources, and Develop a node.
A resource group is created.

Data quality monitoring nodes can run only on a Serverless resource group. For more information, see Manage resource groups.
(Optional, for RAM users) The RAM user for task development must be added to the workspace and granted the Development or Workspace Administrator role. The Workspace Manager role has extensive permissions and should be granted with caution. For more information about adding members and granting permissions, see Add workspace members.

Step 1: Create a data quality monitoring node

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
Right-click the target business flow and choose Create Node > Data Quality > Data Quality Monitoring.
In the Create Node dialog box, enter a Name for the node and click OK. After the node is created, you can develop and configure the task on the node's configuration page.

Step 2: Configure data quality monitoring rules

1. Select table to monitor

Click Add Table. In the Add Table dialog box, search for and select the target table to monitor.

2. Configure data monitoring scope

For a non-partitioned table, the entire table is monitored by default. You can skip this step.
For a partitioned table, select the partition that you want to monitor. You can use scheduling parameters. Click Preview to verify that the partition filter expression is calculated correctly.

3. Configure quality rules

You can create new rules or import existing ones. By default, configured rules are enabled.

Create a new rule

Click Create Rule to create a data quality monitoring rule from a template or by using custom SQL. The following sections describe these methods.

Method 1: From a system template

The platform provides various built-in monitoring rules. You can use these rule templates to quickly create a data quality monitoring rule. The following figure shows the procedure.

Note

You can also find the required rule template in the system template list on the left and click +Use to create a rule.

System rule template parameters

Parameter	Description
Rule Name	A custom name for the rule.
Template	Defines the type of check to perform on the table. DataWorks provides a wide range of built-in templates for both table-level and field-level monitoring. For more information, see View built-in rule templates. Note The templates for average, sum, minimum, and maximum values apply only to numeric fields.
Rule Scope	The scope to which the rule applies. For a table-level rule, the scope defaults to the current table. For a field-level rule, select a specific field.
Comparison Method	Defines how the rule checks whether the table data meets expectations. Manual Settings: Manually define how the data output is compared against the rule. The available comparison methods vary depending on the rule template. The actual options available are shown in the UI. For Numeric Type results, you typically compare the output to a fixed value (expected value). The comparison methods include Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To. You can define a range for normal data (normal threshold) and abnormal data (error threshold). For Fluctuation results, you typically perform a range comparison. The comparison methods include Absolute Value, Raise, and Drop. You can define a range for normal data (normal threshold) and also define thresholds for data with anomalies (warning threshold) and data that does not meet expectations (error threshold) based on the degree of deviation. Intelligent Dynamic Threshold: You do not need to manually configure fluctuation thresholds or expected values. The system uses an intelligent algorithm to determine a reasonable threshold. If a data anomaly is detected, an alert is triggered or the task is blocked. Dynamic thresholds also support strong and weak rules. Note Only Custom SQL, Custom Range, and Dynamic Threshold data quality rules support the intelligent dynamic threshold comparison method.
Monitoring Threshold	If you set Comparison Method to Manual Settings, you can specify the Normal threshold and Error Threshold. Normal threshold: If the check result meets this value, the data passes validation. Error Threshold: If the check result meets this value, the data fails to meet expectations. If the rule is a Fluctuation check, you must specify a Warning Threshold. Warning Threshold: When the data quality check result meets the value set here, it indicates that an anomaly exists in the data but it does not affect business operations.
Retain problem data	If an enabled rule fails its check, the system automatically creates a table to store the identified problematic data. For more information, see Manage problematic data. Important This feature is currently supported for MaxCompute and Hologres tables. Only some data quality monitoring rules support this feature. For a list of rules that support retaining problematic data, see Appendix: List of rules that support retaining problematic data and their specifications. If the rule is Deactivate, problematic data will not be retained.
Status	The Enable and Deactivate statuses control whether the rule runs in the production environment. Important When the status is set to Deactivate, the rule cannot trigger a test run and will not be triggered by associated scheduling tasks.
Degree of importance	The importance of the rule in your business workflow. Strong rule: A critical rule. If an error alert occurs, the system blocks the associated scheduled task by default. Weak rule: A standard rule. If an error alert occurs, the system does not block the associated scheduled task by default.
Configuration Source	The source of the rule configuration, which defaults to Data Quality.
Description	You can add a description for the rule.

Method 2: From a custom template

Before you use this method, go to Data Quality > Quality Assets > Rule Template Library to create a custom rule template. You can then create a data quality monitoring rule based on that template. For more information, see Create and manage custom rule templates.

The following figure shows how to create a data quality rule from a custom template.

Note

You can also find the required rule template in the custom template list on the left and click +Use to create a rule.

Custom rule template parameters

This section describes only parameters that are unique to custom rule templates. For information about other parameters, see the system rule template parameter descriptions.

Parameter

Description

Flag parameters

Defines the SET commands that need to be executed before the data quality check SQL is run.

SQL

Defines the complete SQL check logic. The query must return a single numeric value.

In the custom SQL, use square brackets to match the partition filter expression of the table. Example:

SELECT count(*) FROM ${tableName} WHERE ds=$[yyyymmdd];

Note

The system dynamically replaces the ${tableName} variable with the name of the monitored table.
For more information about how to configure a partition filter expression, see .
If you have created a quality monitoring rule for the table, when you configure a rule using this method, the Data Scope set in the quality monitoring settings will not take effect. The rule will use the WHERE clause in this SQL statement to determine which table partition to check.

Method 3: From custom SQL

This method lets you define custom data quality check logic for a table.

Custom SQL parameters

This section describes only parameters that are unique to custom SQL. For information about other parameters, see the system rule template parameter descriptions.

Parameter

Description

Flag parameters

Defines the SET commands that need to be executed before the data quality check SQL is run.

SQL

Defines the complete SQL check logic. The query must return a single row and a single column, and the result must be a numeric value.

In the custom SQL, use square brackets to match the partition filter expression of the table. Example:

SELECT count(*) FROM <table_name> WHERE ds=$[yyyymmdd];

Note

You must replace <table_name> with the name of the target table. This SQL statement determines which table is monitored.
For more information about how to configure a partition filter expression, see .
If you have created a quality monitoring rule for the table, when you configure a rule using this method, the Data Scope set in the quality monitoring settings will not take effect. The rule will use the WHERE clause in this SQL statement to determine which table partition to check.

Import existing rules

If monitoring rules for the target table already exist in the Data Quality module, you can import them to quickly clone the rules. If no rules exist, you must first create them in the Data Quality module. For more information, see Configure rules for a single table.

Note
This method supports importing multiple rules in a batch and allows you to configure monitoring rules at the field level.

Click Import rules. You can search for and select the rules to import by rule ID or name, rule template, or associated scope (the entire table or specific fields).

Note

After you publish a data quality monitoring node, you can view the details of its quality monitoring rules in the Data Quality module. However, you cannot perform management operations, such as modifying or deleting the rules, in that module.

4. Configure compute resources

Select the compute resources required to run the quality rule check. This specifies the data source in which the data quality monitoring task runs. By default, the data source of the monitored table is used.

Note

If you select a different data source, ensure that it has access permissions to the table.

Step 3: Configure policy for check results

In the Handling Policy section of the node editor, you can configure how to handle exceptions from data quality rule checks and how to subscribe to notifications.

Exception categories

The following table describes the categories of check exceptions.

Exception category	Description
Strong rule - check failed	Severity (Strong/Weak): Indicates the importance of the rule. Error: The data check metric reached the critical threshold. This typically means the data does not meet expectations and will severely impact downstream operations. Warning: The data check metric reached the warning threshold. This typically means an exception was found, but it does not affect downstream operations. Check Failed: The check task failed. For example, this can happen if the monitored partition is not generated or the SQL statement used for monitoring fails to execute.
Strong rule - error alert
Strong rule - warning alert
Weak rule - check failed
Weak rule - error alert
Weak rule - warning alert

Exception handling policy

Configure a policy to handle exceptions found by the rule check as needed:

Do not ignore: Configure the system to stop the current node and set its status to failed when a specific exception category (for example, an error alert for a strong rule) is detected.
Note
- If the current node fails, its downstream nodes will not run. This blocks the production pipeline and prevents problematic data from spreading.
- You can add multiple exception categories for detection.
- This policy is typically used when an exception has a major impact and needs to block the execution of downstream tasks.
Ignore: Ignore the exception and continue to run downstream nodes.

Exception notification method

You can configure how to receive exception notifications, such as by email. When an exception occurs, the platform sends a notification using the specified method so you can promptly find and handle the exception.

Note

The platform supports multiple notification methods. The methods available on the UI may vary. Note the following:

For Email, Email and SMS, and Phone, you can select only users within the current account as recipients. Confirm that the email addresses or phone numbers of the relevant personnel are configured correctly. For more information, see View and set alert contacts.
For other methods, you must enter a webhook URL. For information about how to obtain a webhook URL, see Obtain a webhook URL.

Step 4: Configure task scheduling

If you need to run the node task periodically, click Scheduling Settings on the right side of the node editor. In the Properties pane, configure the scheduling information for the task based on your business requirements. For more information, see Configure scheduling properties for a node.

Note

You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.

Step 5: Debug the task

Perform the following debugging operations as needed to verify that the task runs as expected.

(Optional) Select a resource group and assign values to custom parameters.
- Click the icon in the toolbar. In the Parameter dialog box, select the scheduling resource group to use for debugging.
- If your task uses scheduling parameters, you can assign values to the variables here for debugging. For more information about the parameter assignment logic, see Task debugging process.
  
  The following figure shows an example of scheduling parameter configuration.
Save and run the task.

Click the icon in the toolbar to save the task. Click the icon to run the task.

After the task is complete, you can view the run result at the bottom of the node editor. If the run fails, troubleshoot the issue based on the error message.
(Optional) Perform smoke testing.

If you want to perform smoke testing in the development environment to check whether the scheduling node task runs as expected, you can perform smoke testing when you submit the node or after it is submitted. For more information, see Perform smoke testing.

Step 6: Submit and deploy the task

After configuring the node task, you must submit and deploy it. After the task is deployed, the node runs periodically based on its scheduling configuration.

Note

Submitting and deploying the node also submits and deploys its configured quality rules.

Click the icon in the toolbar to save the node.
Click the icon in the toolbar to submit the node task.

When you submit the task, enter a Change Description in the Submission dialog box. If needed, you can also select whether to perform a code review after the node is submitted.
Note
- You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.
- Code review helps control the quality of task configurations and prevents errors from unreviewed deployments. If you perform a code review, the submitted node can be deployed only after it is approved by a reviewer. For more information, see Code review.

If you use a workspace in standard mode, you must click Deploy in the upper-right corner of the node editor after the task is submitted. This deploys the task to the production environment. For more information, see Deploy tasks.

Next steps

Operations and maintenance: After the task is submitted and deployed, it runs periodically based on the node's configuration. You can click O&M in the upper-right corner of the node editor to go to the Operation Center, where you can view the scheduling and running status of the periodic task, including the node status and details of triggered rules. For more information, see Manage periodic tasks.
Data Quality: After the data quality monitoring rules are deployed, you can also go to the Data Quality module to view rule details. However, you cannot manage the rules, such as modifying or deleting them. For more information, see Data quality.

Background

Limitations

Prerequisites

Step 1: Create a data quality monitoring node

Step 2: Configure data quality monitoring rules

1. Select table to monitor

2. Configure data monitoring scope

3. Configure quality rules

Create a new rule

Method 1: From a system template

Method 2: From a custom template

Method 3: From custom SQL

Import existing rules

4. Configure compute resources

Step 3: Configure policy for check results

Exception categories

Exception handling policy

Exception notification method

Step 4: Configure task scheduling

Step 5: Debug the task

Step 6: Submit and deploy the task

Next steps