Data quality monitoring node

更新时间:
复制 MD 格式

In DataWorks, a data quality monitoring node lets you configure rules to monitor data quality in your data source tables, for example, by checking for dirty data. You can also define custom scheduling policies to periodically run validation tasks. This topic describes how to use a data quality monitoring node.

Background

The data quality feature in DataWorks detects changes in source data and identifies dirty data generated during the extract, transform, and load (ETL) process. It can automatically block problematic tasks to prevent dirty data from propagating to downstream nodes. This prevents unexpected data that can affect normal operations and business decisions. It also significantly reduces troubleshooting time and prevents wasted resource costs from task reruns. For more information, see Data quality.

Limitations

  • Supported table types: MaxCompute, E-MapReduce, Hologres, CDH Hive, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and StarRocks.

  • Table monitoring scope:

    • You can monitor only tables in a data source bound to the same workspace as the data quality monitoring node.

    • Each node can monitor only one table, but you can configure multiple monitoring rules for that table. The monitoring scope varies by table type:

      • For a non-partitioned table, the entire table is monitored by default.

      • For a partitioned table, you must use a partition filter expression to specify a partition to monitor.

      Note

      To monitor multiple tables, create multiple data quality monitoring nodes.

  • Operation limits:

    • Data quality monitoring rules created in DataStudio can be run, modified, published, and managed only in DataStudio. You can view these rules in the Data Quality module, but you cannot manage them there.

    • If you modify the monitoring rules in a data quality monitoring node and then publish the node, the original monitoring rules are replaced.

Prerequisites

  • A business flow is created.

    In DataStudio, development operations for different data sources rely on business flows. Therefore, you must create a business flow before you can create a node. For more information, see Create a business flow.

  • A data source must be created and bound to the current workspace, and the table to be monitored must exist in that data source.

    Before you run a data quality monitoring task, create the table that you want to monitor in your data source. For more information, see Manage data sources, Manage compute engine resources, and Develop a node.

  • A resource group is created.

    Data quality monitoring nodes can run only on a Serverless resource group. For more information, see Manage resource groups.

  • (Optional, for RAM users) The RAM user for task development must be added to the workspace and granted the Development or Workspace Administrator role. The Workspace Manager role has extensive permissions and should be granted with caution. For more information about adding members and granting permissions, see Add workspace members.

Step 1: Create a data quality monitoring node

  1. Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.

  2. Right-click the target business flow and choose Create Node > Data Quality > Data Quality Monitoring.

  3. In the Create Node dialog box, enter a Name for the node and click OK. After the node is created, you can develop and configure the task on the node's configuration page.

Step 2: Configure data quality monitoring rules

1. Select table to monitor

Click Add Table. In the Add Table dialog box, search for and select the target table to monitor.image

2. Configure data monitoring scope

  • For a non-partitioned table, the entire table is monitored by default. You can skip this step.

  • For a partitioned table, select the partition that you want to monitor. You can use scheduling parameters. Click Preview to verify that the partition filter expression is calculated correctly.image

3. Configure quality rules

You can create new rules or import existing ones. By default, configured rules are enabled.

  • Create a new rule

    Click Create Rule to create a data quality monitoring rule from a template or by using custom SQL. The following sections describe these methods.

    Method 1: From a system template

    The platform provides various built-in monitoring rules. You can use these rule templates to quickly create a data quality monitoring rule. The following figure shows the procedure.

    Note

    You can also find the required rule template in the system template list on the left and click +Use to create a rule.

    image

    System rule template parameters

    Parameter

    Description

    Rule Name

    A custom name for the rule.

    Template

    Defines the type of check to perform on the table.

    DataWorks provides a wide range of built-in templates for both table-level and field-level monitoring. For more information, see View built-in rule templates.

    Note

    The templates for average, sum, minimum, and maximum values apply only to numeric fields.

    Rule Scope

    The scope to which the rule applies. For a table-level rule, the scope defaults to the current table. For a field-level rule, select a specific field.

    Comparison Method

    Defines how the rule checks whether the table data meets expectations.

    • Manual Settings: Manually define how the data output is compared against the rule.

      The available comparison methods vary depending on the rule template. The actual options available are shown in the UI.

      • For Numeric Type results, you typically compare the output to a fixed value (expected value). The comparison methods include Greater Than, Greater Than or Equal To, Equal To, Unequal To, Less Than, and Less Than or Equal To. You can define a range for normal data (normal threshold) and abnormal data (error threshold).

      • For Fluctuation results, you typically perform a range comparison. The comparison methods include Absolute Value, Raise, and Drop. You can define a range for normal data (normal threshold) and also define thresholds for data with anomalies (warning threshold) and data that does not meet expectations (error threshold) based on the degree of deviation.

    • Intelligent Dynamic Threshold: You do not need to manually configure fluctuation thresholds or expected values. The system uses an intelligent algorithm to determine a reasonable threshold. If a data anomaly is detected, an alert is triggered or the task is blocked. Dynamic thresholds also support strong and weak rules.

      Note

      Only Custom SQL, Custom Range, and Dynamic Threshold data quality rules support the intelligent dynamic threshold comparison method.

    Monitoring Threshold

    • If you set Comparison Method to Manual Settings, you can specify the Normal threshold and Error Threshold.

      • Normal threshold: If the check result meets this value, the data passes validation.

      • Error Threshold: If the check result meets this value, the data fails to meet expectations.

    • If the rule is a Fluctuation check, you must specify a Warning Threshold.

      • Warning Threshold: When the data quality check result meets the value set here, it indicates that an anomaly exists in the data but it does not affect business operations.

    Retain problem data

    If an enabled rule fails its check, the system automatically creates a table to store the identified problematic data. For more information, see Manage problematic data.

    Important

    Status

    The Enable and Deactivate statuses control whether the rule runs in the production environment.

    Important

    When the status is set to Deactivate, the rule cannot trigger a test run and will not be triggered by associated scheduling tasks.

    Degree of importance

    The importance of the rule in your business workflow.

    • Strong rule: A critical rule. If an error alert occurs, the system blocks the associated scheduled task by default.

    • Weak rule: A standard rule. If an error alert occurs, the system does not block the associated scheduled task by default.

    Configuration Source

    The source of the rule configuration, which defaults to Data Quality.

    Description

    You can add a description for the rule.

    Method 2: From a custom template

    Before you use this method, go to Data Quality > Quality Assets > Rule Template Library to create a custom rule template. You can then create a data quality monitoring rule based on that template. For more information, see Create and manage custom rule templates.

    The following figure shows how to create a data quality rule from a custom template.

    Note

    You can also find the required rule template in the custom template list on the left and click +Use to create a rule.

    image

    Custom rule template parameters

    This section describes only parameters that are unique to custom rule templates. For information about other parameters, see the system rule template parameter descriptions.

    Parameter

    Description

    Flag parameters

    Defines the SET commands that need to be executed before the data quality check SQL is run.

    SQL

    Defines the complete SQL check logic. The query must return a single numeric value.

    In the custom SQL, use square brackets to match the partition filter expression of the table. Example:

    SELECT count(*) FROM ${tableName} WHERE ds=$[yyyymmdd];
    Note
    • The system dynamically replaces the ${tableName} variable with the name of the monitored table.

    • For more information about how to configure a partition filter expression, see .

    • If you have created a quality monitoring rule for the table, when you configure a rule using this method, the Data Scope set in the quality monitoring settings will not take effect. The rule will use the WHERE clause in this SQL statement to determine which table partition to check.

    Method 3: From custom SQL

    This method lets you define custom data quality check logic for a table.

    image

    Custom SQL parameters

    This section describes only parameters that are unique to custom SQL. For information about other parameters, see the system rule template parameter descriptions.

    Parameter

    Description

    Flag parameters

    Defines the SET commands that need to be executed before the data quality check SQL is run.

    SQL

    Defines the complete SQL check logic. The query must return a single row and a single column, and the result must be a numeric value.

    In the custom SQL, use square brackets to match the partition filter expression of the table. Example:

    SELECT count(*) FROM <table_name> WHERE ds=$[yyyymmdd];
    Note
    • You must replace <table_name> with the name of the target table. This SQL statement determines which table is monitored.

    • For more information about how to configure a partition filter expression, see .

    • If you have created a quality monitoring rule for the table, when you configure a rule using this method, the Data Scope set in the quality monitoring settings will not take effect. The rule will use the WHERE clause in this SQL statement to determine which table partition to check.

  • Import existing rules

    If monitoring rules for the target table already exist in the Data Quality module, you can import them to quickly clone the rules. If no rules exist, you must first create them in the Data Quality module. For more information, see Configure rules for a single table.

    Note

    This method supports importing multiple rules in a batch and allows you to configure monitoring rules at the field level.

    Click Import rules. You can search for and select the rules to import by rule ID or name, rule template, or associated scope (the entire table or specific fields).

    image

Note

After you publish a data quality monitoring node, you can view the details of its quality monitoring rules in the Data Quality module. However, you cannot perform management operations, such as modifying or deleting the rules, in that module.

4. Configure compute resources

Select the compute resources required to run the quality rule check. This specifies the data source in which the data quality monitoring task runs. By default, the data source of the monitored table is used.

Note

If you select a different data source, ensure that it has access permissions to the table.

Step 3: Configure policy for check results

In the Handling Policy section of the node editor, you can configure how to handle exceptions from data quality rule checks and how to subscribe to notifications.

Exception categories

The following table describes the categories of check exceptions.

Exception category

Description

Strong rule - check failed

  • Severity (Strong/Weak): Indicates the importance of the rule.

  • Error: The data check metric reached the critical threshold. This typically means the data does not meet expectations and will severely impact downstream operations.

  • Warning: The data check metric reached the warning threshold. This typically means an exception was found, but it does not affect downstream operations.

  • Check Failed: The check task failed. For example, this can happen if the monitored partition is not generated or the SQL statement used for monitoring fails to execute.

Strong rule - error alert

Strong rule - warning alert

Weak rule - check failed

Weak rule - error alert

Weak rule - warning alert

Exception handling policy

Configure a policy to handle exceptions found by the rule check as needed:

  • Do not ignore: Configure the system to stop the current node and set its status to failed when a specific exception category (for example, an error alert for a strong rule) is detected.

    Note
    • If the current node fails, its downstream nodes will not run. This blocks the production pipeline and prevents problematic data from spreading.

    • You can add multiple exception categories for detection.

    • This policy is typically used when an exception has a major impact and needs to block the execution of downstream tasks.

  • Ignore: Ignore the exception and continue to run downstream nodes.

Exception notification method

You can configure how to receive exception notifications, such as by email. When an exception occurs, the platform sends a notification using the specified method so you can promptly find and handle the exception.

Note

The platform supports multiple notification methods. The methods available on the UI may vary. Note the following:

  • For Email, Email and SMS, and Phone, you can select only users within the current account as recipients. Confirm that the email addresses or phone numbers of the relevant personnel are configured correctly. For more information, see View and set alert contacts.

  • For other methods, you must enter a webhook URL. For information about how to obtain a webhook URL, see Obtain a webhook URL.

Step 4: Configure task scheduling

If you need to run the node task periodically, click Scheduling Settings on the right side of the node editor. In the Properties pane, configure the scheduling information for the task based on your business requirements. For more information, see Configure scheduling properties for a node.

Note

You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.

Step 5: Debug the task

Perform the following debugging operations as needed to verify that the task runs as expected.

  1. (Optional) Select a resource group and assign values to custom parameters.

    • Click the 高级运行 icon in the toolbar. In the Parameter dialog box, select the scheduling resource group to use for debugging.

    • If your task uses scheduling parameters, you can assign values to the variables here for debugging. For more information about the parameter assignment logic, see Task debugging process.

      The following figure shows an example of scheduling parameter configuration.

      image

  2. Save and run the task.

    Click the 保存 icon in the toolbar to save the task. Click the 运行 icon to run the task.

    After the task is complete, you can view the run result at the bottom of the node editor. If the run fails, troubleshoot the issue based on the error message.

  3. (Optional) Perform smoke testing.

    If you want to perform smoke testing in the development environment to check whether the scheduling node task runs as expected, you can perform smoke testing when you submit the node or after it is submitted. For more information, see Perform smoke testing.

Step 6: Submit and deploy the task

After configuring the node task, you must submit and deploy it. After the task is deployed, the node runs periodically based on its scheduling configuration.

Note

Submitting and deploying the node also submits and deploys its configured quality rules.

  1. Click the 保存 icon in the toolbar to save the node.

  2. Click the 提交 icon in the toolbar to submit the node task.

    When you submit the task, enter a Change Description in the Submission dialog box. If needed, you can also select whether to perform a code review after the node is submitted.

    Note
    • You must set the Rerun attribute and Parent Nodes properties for the node before you can submit it.

    • Code review helps control the quality of task configurations and prevents errors from unreviewed deployments. If you perform a code review, the submitted node can be deployed only after it is approved by a reviewer. For more information, see Code review.

If you use a workspace in standard mode, you must click Deploy in the upper-right corner of the node editor after the task is submitted. This deploys the task to the production environment. For more information, see Deploy tasks.

Next steps

  • Operations and maintenance: After the task is submitted and deployed, it runs periodically based on the node's configuration. You can click O&M in the upper-right corner of the node editor to go to the Operation Center, where you can view the scheduling and running status of the periodic task, including the node status and details of triggered rules. For more information, see Manage periodic tasks.

  • Data Quality: After the data quality monitoring rules are deployed, you can also go to the Data Quality module to view rule details. However, you cannot manage the rules, such as modifying or deleting them. For more information, see Data quality.