Manage unexpected data-DataWorks(DataWorks)-阿里云帮助中心

Data Quality lets you retain data that fails monitoring checks, helping you to quickly troubleshoot issues and identify their root cause. This topic describes how to manage this problem data.

Background information

Problem Data Table

If some rules fail data quality checks, Data Quality automatically creates a Problem Data Table to store the records that failed the checks.

Naming rule: The problem data table is named in the format {source_table_name}_dirtydata_dw_system_dqc.
Workspace: The source table's environment determines the workspace of the problem data table. For example, if the source table is dev_project.table1, the problem data table is also stored in the dev_project workspace.
Owner: The problem data table shares the same owner as the source table.
Retention period: By default, problem data is retained for 15 days, so the lifecycle of the problem data table is also 15 days. If the table meets the lifecycle-based reclamation policy of MaxCompute, the data is deleted. For more information about MaxCompute table lifecycles, see Lifecycle.

Partition structure:

Partition Level	Partition Field Name	Description
Level-1 partition	dqc_task_run_dt	The collection time for the problem data. This value is the runtime of the monitoring rule, in `yyyymmdd` format.
Level-2 partition	dqc_task_id	The ID of the DQC rule instance that collected the problem data. Note Executing a DQC rule generates a DQC instance.

Supported monitoring rules and problem data specifications

For more information about the monitoring rules that support data retention and the definitions of problem data for each rule, see Appendix: Supported monitoring rules and problem data specifications.

Usage notes

You can enable the problem data retention feature only for MaxCompute tables.
Only some data quality monitoring rules support retaining problem data. For a list of supported rules, see Appendix: Supported monitoring rules and problem data specifications.
Problem data is collected by running rule-generated SQL statements on the MaxCompute compute engine. This process incurs MaxCompute computing fees.
The system stores the collected problem data in a MaxCompute temporary table. This process incurs MaxCompute storage fees.

Enable problem data retention

When you configure a monitoring rule for a table and add a template rule, you can enable problem data retention for specific field-level rules. If this feature is enabled, the system saves data that fails a rule check to an automatically generated Problem Data Table. For more information about how to configure monitoring rules for a table, see Configure rules for a single table.

Note

Only some rules support retaining problem data when a check fails. For a list of supported rules, see Appendix: Supported monitoring rules and problem data specifications.

View problem data

Go to the Quality O&M > Running Records page. Find a running record with a result of "Exception" or "Failed" and click Actions in the Details column. On the Monitor Running Details page, click the Problem Data Processing tab.

Note

If you do not have permission to preview the problem data or query the problem data table, go to Security Center to apply for query permissions on the table. For more information, see Access control for MaxCompute data.

View problem data collection logs

You can use the following methods to view logs for problem data collection, including the SQL code and execution process. These logs can help you quickly troubleshoot errors that occur during collection.

Method 1: View logs in the Data Quality module.

On the Quality O&M > Running Records page, view the Raw Log of the target monitoring rule to check the collection process. For more information, see View monitor execution details.
Method 2: View logs from the scheduling task list.

After you associate a data quality monitoring rule with a scheduling node, go to the Operation Details page of the node. On the Execution > DQC panel, click Actions in the View Details column for the target rule to view the collection process. For more information, see intelligent diagnosis.

Appendix: Supported monitoring rules and problem data specifications

Built-in templates

Check type	Check rule	Rule description	Problem data specifications
unique value	Number of unique values, fixed value.	Compares the count of distinct values against a specified fixed value.	If the unique value check fails: If all values are unique, the system creates an empty problem data table. If duplicate values exist, they are saved to the problem data table as problem data.
unique value	Number of unique values/Total rows.	Compares the ratio of unique values to the total number of rows against a fixed value.
null value	Number of null values, fixed value.	Compares the number of null values in the field against a fixed value. Note The check uses the SQL `is null` operator.	The system saves all rows containing a null value in the monitored field to the problem data table. These rows constitute the problem data.
null value	Number of null values/Total rows, fixed value.	Compares the ratio of null values to the total number of rows against a fixed value. Note The fixed value is a decimal number.
duplicate value	Number of duplicate values/Total rows, fixed value.	Compares the ratio of duplicate values to the total number of rows against a fixed value.	The system saves the duplicate values for the field to the problem data table. All duplicate values are considered problem data.
duplicate value	Number of duplicate values, fixed value.	Calculates the number of duplicate values by subtracting the distinct value count from the total row count and compares the result to a fixed value.

Custom SQL statement

After you enable Retain problem data, problem data is saved whenever a custom SQL monitoring rule triggers an alert.