数据过滤配置

在数据质量监控场景中,通常不会将整张表的数据都纳入监控,而是每次监控只过滤出发生变动的数据,本文介绍如何设置数据过滤。

数据过滤配置示例

大部分场景中,“发生变动的数据”都是基于时间字段来过滤。例如在MaxCompute中,会使用时间分区来过滤、在MySQL业务流水表场景中,会根据业务流水的创建时间过滤。

您可以在监控和规则中设置filter来实现数据过滤,配置示例:

{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "assertion": "row_count > 0",
      // 在规则级别添加filter,会与Scan.dataset中的filter共同起作用
      // 实际执行时的filter会是 (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
      "filter": "id IS NOT NULL"
    }
  ],
  "computeResource": {
    "id": 2001
  }
}

filter语法说明

基本语法

filter语法与SQL-92标准的WHERE条件写法基本保持一致(除去matchoverlaps两种谓词),具体的语法格式如下:

<search condition> ::=
    <boolean term> | <search condition> OR <boolean term>

<boolean term> ::=
    <boolean factor> | <boolean term> AND <boolean factor>

<boolean factor> ::=
    [ NOT ] <boolean test>

<boolean test> ::=
    <predicate> | ( <search condition> )

<predicate> ::=
    <comparison predicate> | <between predicate> |
    <in predicate> | <like predicate> | <null predicate> |
    <quantified comparison predicate> | <exists predicate>

<comparison predicate> ::=
    <expression> <comparison_operator> <expression>

<between predicate> ::=
    <expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>

<in predicate> ::=
    <expression> [NOT] IN (value_list | subquery)

<like predicate> ::=
    <expression> [NOT] LIKE pattern [ESCAPE escape_character]

<null predicate> ::=
    <expression> IS [NOT] NULL

<quantified comparison predicate> ::=
    <expression> <comparison_operator> { ALL | ANY | SOME } (subquery)

<exists predicate> ::=
    EXISTS (subquery)

配置示例

  • comparison predicate

    salary > 50000
  • between predicate

    price BETWEEN 10 AND 50
  • in predicate

    department IN ('Sales', 'Marketing')
  • like predicate

    name LIKE 'Jo%'
  • null predicate

    department IS NOT NULL
  • quantified comparison predicate

    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')
  • exists predicate

    EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
  • 较复杂的search condition

    -- 使用AND逻辑操作符连接两个predict
    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
    -- 使用OR逻辑操作符连接
    department IS NOT NULL OR name LIKE 'Jo%'

使用原生的query定义Filter

如果底层数据存储的查询语句无法用SQL-92标准的SQL转化(例如使用了RLIKE、地理位置搜索、全文搜索等特性),则可以在filter中以query:关键字开头,直接写原生的query。

例如,使用RLIKEMaxCompute表的数据进行过滤,配置示例如下:

{
  "datasets": [
    {
      "type": "Table",
      "indices": [
        "ods_d_customers"
      ],
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "assertion": "row_count > 0",
      "filter": "query: email_address RLIKE '^info@'",
    }
  ]
}

filter中使用时间变量

如果使用时间分区,质量监控每天会扫描当天的分区或者T-1的分区,此时需要每天执行质量监控时自动获取到对应的日期值,您可以使用调度系统的时间偏移表达式写法。

例如,在filter中使用了类似$[yyyymmdd-1]的调度时间表达式写法,数据质量系统会自动以调度实例的定时时间为基准做偏移并替换。

说明

如果数据质量监控不是通过调度实例触发的,需要在CreateDataQualityScanRun的触发参数中指定triggerTime,系统会以triggerTime为基准做偏移,