Data Quality lets you retain data that fails monitoring checks, helping you to quickly troubleshoot issues and identify their root cause. This topic describes how to manage this problem data.
Background information
-
Problem Data Table
If some rules fail data quality checks, Data Quality automatically creates a Problem Data Table to store the records that failed the checks.
-
Naming rule: The problem data table is named in the format
{source_table_name}_dirtydata_dw_system_dqc. -
Workspace: The source table's environment determines the workspace of the problem data table. For example, if the source table is
dev_project.table1, the problem data table is also stored in thedev_projectworkspace. -
Owner: The problem data table shares the same owner as the source table.
-
Retention period: By default, problem data is retained for 15 days, so the lifecycle of the problem data table is also 15 days. If the table meets the lifecycle-based reclamation policy of MaxCompute, the data is deleted. For more information about MaxCompute table lifecycles, see Lifecycle.
-
Partition structure:
Partition Level
Partition Field Name
Description
Level-1 partition
dqc_task_run_dt
The collection time for the problem data. This value is the runtime of the monitoring rule, in
yyyymmddformat.Level-2 partition
dqc_task_id
The ID of the DQC rule instance that collected the problem data.
NoteExecuting a DQC rule generates a DQC instance.
-
-
Supported monitoring rules and problem data specifications
For more information about the monitoring rules that support data retention and the definitions of problem data for each rule, see Appendix: Supported monitoring rules and problem data specifications.
Usage notes
-
You can enable the problem data retention feature only for MaxCompute tables.
-
Only some data quality monitoring rules support retaining problem data. For a list of supported rules, see Appendix: Supported monitoring rules and problem data specifications.
-
Problem data is collected by running rule-generated SQL statements on the MaxCompute compute engine. This process incurs MaxCompute computing fees.
-
The system stores the collected problem data in a MaxCompute temporary table. This process incurs MaxCompute storage fees.
Enable problem data retention
When you configure a monitoring rule for a table and add a template rule, you can enable problem data retention for specific field-level rules. If this feature is enabled, the system saves data that fails a rule check to an automatically generated Problem Data Table. For more information about how to configure monitoring rules for a table, see Configure rules for a single table.
Only some rules support retaining problem data when a check fails. For a list of supported rules, see Appendix: Supported monitoring rules and problem data specifications.
View problem data
Go to the page. Find a running record with a result of "Exception" or "Failed" and click Actions in the Details column. On the Monitor Running Details page, click the Problem Data Processing tab.
If you do not have permission to preview the problem data or query the problem data table, go to Security Center to apply for query permissions on the table. For more information, see Access control for MaxCompute data.
View problem data collection logs
You can use the following methods to view logs for problem data collection, including the SQL code and execution process. These logs can help you quickly troubleshoot errors that occur during collection.
-
Method 1: View logs in the Data Quality module.
On the page, view the Raw Log of the target monitoring rule to check the collection process. For more information, see View monitor execution details.
-
Method 2: View logs from the scheduling task list.
After you associate a data quality monitoring rule with a scheduling node, go to the Operation Details page of the node. On the panel, click Actions in the View Details column for the target rule to view the collection process. For more information, see intelligent diagnosis.
Appendix: Supported monitoring rules and problem data specifications
-
Built-in templates
Check type
Check rule
Rule description
Problem data specifications
unique value
Number of unique values, fixed value.
Compares the count of distinct values against a specified fixed value.
If the unique value check fails:
-
If all values are unique, the system creates an empty problem data table.
-
If duplicate values exist, they are saved to the problem data table as problem data.
Number of unique values/Total rows.
Compares the ratio of unique values to the total number of rows against a fixed value.
null value
Number of null values, fixed value.
Compares the number of null values in the field against a fixed value.
NoteThe check uses the SQL
is nulloperator.The system saves all rows containing a null value in the monitored field to the problem data table. These rows constitute the problem data.
Number of null values/Total rows, fixed value.
Compares the ratio of null values to the total number of rows against a fixed value.
NoteThe fixed value is a decimal number.
duplicate value
Number of duplicate values/Total rows, fixed value.
Compares the ratio of duplicate values to the total number of rows against a fixed value.
The system saves the duplicate values for the field to the problem data table. All duplicate values are considered problem data.
Number of duplicate values, fixed value.
Calculates the number of duplicate values by subtracting the distinct value count from the total row count and compares the result to a fixed value.
-
-
Custom SQL statement
After you enable Retain problem data, problem data is saved whenever a custom SQL monitoring rule triggers an alert.