在数据质量监控场景中,通常不会将整张表的数据都纳入监控,而是每次监控只过滤出发生变动的数据,本文介绍如何设置数据过滤。
数据过滤配置示例
大部分场景中,发生变动的数据都是基于时间字段来过滤。例如在MaxCompute中,会使用时间分区来过滤;在MySQL业务流水表场景中,会根据业务流水的创建时间过滤。
您可以在监控和规则中设置filter来实现数据过滤,配置示例:
datasets:
- type: Table
tables:
- tb_d_spec_demo
filter: "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'"
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: "row_count > 0"
# 在规则级别添加filter,会与Scan.dataset中的filter共同起作用
# 实际执行时的filter会是 (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
filter: "id IS NOT NULL"
computeResource:
id: 2001filter语法说明
基本语法
filter语法与SQL-92标准的WHERE条件写法基本保持一致(除去match、overlaps两种谓词),具体的语法格式如下:
<search condition> ::=
<boolean term> | <search condition> OR <boolean term>
<boolean term> ::=
<boolean factor> | <boolean term> AND <boolean factor>
<boolean factor> ::=
[ NOT ] <boolean test>
<boolean test> ::=
<predicate> | ( <search condition> )
<predicate> ::=
<comparison predicate> | <between predicate> |
<in predicate> | <like predicate> | <null predicate> |
<quantified comparison predicate> | <exists predicate>
<comparison predicate> ::=
<expression> <comparison_operator> <expression>
<between predicate> ::=
<expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>
<in predicate> ::=
<expression> [NOT] IN (value_list | subquery)
<like predicate> ::=
<expression> [NOT] LIKE pattern [ESCAPE escape_character]
<null predicate> ::=
<expression> IS [NOT] NULL
<quantified comparison predicate> ::=
<expression> <comparison_operator> { ALL | ANY | SOME } (subquery)
<exists predicate> ::=
EXISTS (subquery)配置示例
comparison predicate
salary > 50000between predicate
price BETWEEN 10 AND 50in predicate
department IN ('Sales', 'Marketing')like predicate
name LIKE 'Jo%'null predicate
department IS NOT NULLquantified comparison predicate
salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')exists predicate
EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)较复杂的search condition
-- 使用AND逻辑操作符连接两个predict salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id) -- 使用OR逻辑操作符连接 department IS NOT NULL OR name LIKE 'Jo%'
使用原生的query定义Filter
如果底层数据存储的查询语句无法用SQL-92标准的SQL转化(例如使用了RLIKE、地理位置搜索、全文搜索等特性),则可以在filter中以query:关键字开头,直接写原生的query。
例如,使用RLIKE对MaxCompute表的数据进行过滤,配置示例如下:
datasets:
- type: Table
indices:
- ods_d_customers
dataSource:
name: odps_first
envType: Dev
rules:
- assertion: "row_count > 0"
filter: "query: email_address RLIKE '^info@'"filter中使用时间变量
如果使用时间分区,质量监控每天会扫描当天的分区或者T-1的分区,此时需要每天执行质量监控时自动获取到对应的日期值,您可以使用调度系统的时间偏移表达式写法。
例如,在filter中使用了类似$[yyyymmdd-1]的调度时间表达式写法,数据质量系统会自动以调度实例的定时时间为基准做偏移并替换。
说明
如果数据质量监控不是通过调度实例触发的,需要在CreateDataQualityScanRun的触发参数中指定triggerTime,系统会以triggerTime为基准做偏移,
该文章对您有帮助吗?