在数据质量监控场景中,通常不会将整张表的数据都纳入监控,而是每次监控只过滤出发生变动的数据,本文介绍如何设置数据过滤。
数据过滤配置示例
大部分场景中,“发生变动的数据”都是基于时间字段来过滤。例如在MaxCompute中,会使用时间分区来过滤、在MySQL业务流水表场景中,会根据业务流水的创建时间过滤。
您可以在监控和规则中设置filter来实现数据过滤,配置示例:
{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "assertion": "row_count > 0",
      // 在规则级别添加filter,会与Scan.dataset中的filter共同起作用
      // 实际执行时的filter会是 (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
      "filter": "id IS NOT NULL"
    }
  ],
  "computeResource": {
    "id": 2001
  }
}filter语法说明
基本语法
filter语法与SQL-92标准的WHERE条件写法基本保持一致(除去match、overlaps两种谓词),具体的语法格式如下:
<search condition> ::=
    <boolean term> | <search condition> OR <boolean term>
<boolean term> ::=
    <boolean factor> | <boolean term> AND <boolean factor>
<boolean factor> ::=
    [ NOT ] <boolean test>
<boolean test> ::=
    <predicate> | ( <search condition> )
<predicate> ::=
    <comparison predicate> | <between predicate> |
    <in predicate> | <like predicate> | <null predicate> |
    <quantified comparison predicate> | <exists predicate>
<comparison predicate> ::=
    <expression> <comparison_operator> <expression>
<between predicate> ::=
    <expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>
<in predicate> ::=
    <expression> [NOT] IN (value_list | subquery)
<like predicate> ::=
    <expression> [NOT] LIKE pattern [ESCAPE escape_character]
<null predicate> ::=
    <expression> IS [NOT] NULL
<quantified comparison predicate> ::=
    <expression> <comparison_operator> { ALL | ANY | SOME } (subquery)
<exists predicate> ::=
    EXISTS (subquery)配置示例
- comparison predicate - salary > 50000
- between predicate - price BETWEEN 10 AND 50
- in predicate - department IN ('Sales', 'Marketing')
- like predicate - name LIKE 'Jo%'
- null predicate - department IS NOT NULL
- quantified comparison predicate - salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')
- exists predicate - EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
- 较复杂的search condition - -- 使用AND逻辑操作符连接两个predict salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id) -- 使用OR逻辑操作符连接 department IS NOT NULL OR name LIKE 'Jo%'
使用原生的query定义Filter
如果底层数据存储的查询语句无法用SQL-92标准的SQL转化(例如使用了RLIKE、地理位置搜索、全文搜索等特性),则可以在filter中以query:关键字开头,直接写原生的query。
例如,使用RLIKE对MaxCompute表的数据进行过滤,配置示例如下:
{
  "datasets": [
    {
      "type": "Table",
      "indices": [
        "ods_d_customers"
      ],
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "assertion": "row_count > 0",
      "filter": "query: email_address RLIKE '^info@'",
    }
  ]
}filter中使用时间变量
如果使用时间分区,质量监控每天会扫描当天的分区或者T-1的分区,此时需要每天执行质量监控时自动获取到对应的日期值,您可以使用调度系统的时间偏移表达式写法。
例如,在filter中使用了类似$[yyyymmdd-1]的调度时间表达式写法,数据质量系统会自动以调度实例的定时时间为基准做偏移并替换。
如果数据质量监控不是通过调度实例触发的,需要在CreateDataQualityScanRun的触发参数中指定triggerTime,系统会以triggerTime为基准做偏移,