数据过滤配置

在数据质量监控场景中,通常不会将整张表的数据都纳入监控,而是每次监控只过滤出发生变动的数据,本文介绍如何设置数据过滤。

数据过滤配置示例

大部分场景中,发生变动的数据都是基于时间字段来过滤。例如在MaxCompute中,会使用时间分区来过滤;在MySQL业务流水表场景中,会根据业务流水的创建时间过滤。

您可以在监控和规则中设置filter来实现数据过滤,配置示例:

datasets:
  - type: Table
    tables:
      - tb_d_spec_demo
    filter: "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'"
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: "row_count > 0"
    # 在规则级别添加filter,会与Scan.dataset中的filter共同起作用
    # 实际执行时的filter会是 (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
    filter: "id IS NOT NULL"
computeResource:
  id: 2001

filter语法说明

基本语法

filter语法与SQL-92标准的WHERE条件写法基本保持一致(除去matchoverlaps两种谓词),具体的语法格式如下:

<search condition> ::=
    <boolean term> | <search condition> OR <boolean term>

<boolean term> ::=
    <boolean factor> | <boolean term> AND <boolean factor>

<boolean factor> ::=
    [ NOT ] <boolean test>

<boolean test> ::=
    <predicate> | ( <search condition> )

<predicate> ::=
    <comparison predicate> | <between predicate> |
    <in predicate> | <like predicate> | <null predicate> |
    <quantified comparison predicate> | <exists predicate>

<comparison predicate> ::=
    <expression> <comparison_operator> <expression>

<between predicate> ::=
    <expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>

<in predicate> ::=
    <expression> [NOT] IN (value_list | subquery)

<like predicate> ::=
    <expression> [NOT] LIKE pattern [ESCAPE escape_character]

<null predicate> ::=
    <expression> IS [NOT] NULL

<quantified comparison predicate> ::=
    <expression> <comparison_operator> { ALL | ANY | SOME } (subquery)

<exists predicate> ::=
    EXISTS (subquery)

配置示例

  • comparison predicate

    salary > 50000
  • between predicate

    price BETWEEN 10 AND 50
  • in predicate

    department IN ('Sales', 'Marketing')
  • like predicate

    name LIKE 'Jo%'
  • null predicate

    department IS NOT NULL
  • quantified comparison predicate

    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')
  • exists predicate

    EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
  • 较复杂的search condition

    -- 使用AND逻辑操作符连接两个predict
    salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
    -- 使用OR逻辑操作符连接
    department IS NOT NULL OR name LIKE 'Jo%'

使用原生的query定义Filter

如果底层数据存储的查询语句无法用SQL-92标准的SQL转化(例如使用了RLIKE、地理位置搜索、全文搜索等特性),则可以在filter中以query:关键字开头,直接写原生的query。

例如,使用RLIKEMaxCompute表的数据进行过滤,配置示例如下:

datasets:
  - type: Table
    indices:
      - ods_d_customers
    dataSource:
      name: odps_first
      envType: Dev
rules:
  - assertion: "row_count > 0"
    filter: "query: email_address RLIKE '^info@'"

filter中使用时间变量

如果使用时间分区,质量监控每天会扫描当天的分区或者T-1的分区,此时需要每天执行质量监控时自动获取到对应的日期值,您可以使用调度系统的时间偏移表达式写法。

例如,在filter中使用了类似$[yyyymmdd-1]的调度时间表达式写法,数据质量系统会自动以调度实例的定时时间为基准做偏移并替换。

说明

如果数据质量监控不是通过调度实例触发的,需要在CreateDataQualityScanRun的触发参数中指定triggerTime,系统会以triggerTime为基准做偏移,