在数据质量监控场景中,通常不会将整张表的数据都纳入监控,而是每次监控只过滤出发生变动的数据,本文介绍如何设置数据过滤。
数据过滤配置示例
大部分场景中,“发生变动的数据”都是基于时间字段来过滤。例如在MaxCompute中,会使用时间分区来过滤、在MySQL业务流水表场景中,会根据业务流水的创建时间过滤。
您可以在监控和规则中设置filter
来实现数据过滤,配置示例:
{
"datasets": [
{
"type": "Table",
"tables": [
"tb_d_spec_demo"
],
"filter": "dt = '$[yyyymmdd]' AND hh = '$[hh24-1/24]'",
"dataSource": {
"name": "odps_first",
"envType": "Dev"
}
}
],
"rules": [
{
"assertion": "row_count > 0",
// 在规则级别添加filter,会与Scan.dataset中的filter共同起作用
// 实际执行时的filter会是 (dt = '$[yyyymmdd-1]' AND hh = '$[hh24-1/24]') AND (id IS NOT NULL)
"filter": "id IS NOT NULL"
}
],
"computeResource": {
"id": 2001
}
}
filter语法说明
基本语法
filter
语法与SQL-92标准的WHERE条件写法基本保持一致(除去match
、overlaps
两种谓词),具体的语法格式如下:
<search condition> ::=
<boolean term> | <search condition> OR <boolean term>
<boolean term> ::=
<boolean factor> | <boolean term> AND <boolean factor>
<boolean factor> ::=
[ NOT ] <boolean test>
<boolean test> ::=
<predicate> | ( <search condition> )
<predicate> ::=
<comparison predicate> | <between predicate> |
<in predicate> | <like predicate> | <null predicate> |
<quantified comparison predicate> | <exists predicate>
<comparison predicate> ::=
<expression> <comparison_operator> <expression>
<between predicate> ::=
<expression> [NOT] BETWEEN <lower_bound> AND <upper_bound>
<in predicate> ::=
<expression> [NOT] IN (value_list | subquery)
<like predicate> ::=
<expression> [NOT] LIKE pattern [ESCAPE escape_character]
<null predicate> ::=
<expression> IS [NOT] NULL
<quantified comparison predicate> ::=
<expression> <comparison_operator> { ALL | ANY | SOME } (subquery)
<exists predicate> ::=
EXISTS (subquery)
配置示例
comparison predicate
salary > 50000
between predicate
price BETWEEN 10 AND 50
in predicate
department IN ('Sales', 'Marketing')
like predicate
name LIKE 'Jo%'
null predicate
department IS NOT NULL
quantified comparison predicate
salary > ALL (SELECT salary FROM employees WHERE department = 'Sales')
exists predicate
EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id)
较复杂的search condition
-- 使用AND逻辑操作符连接两个predict salary > ALL (SELECT salary FROM employees WHERE department = 'Sales') AND EXISTS (SELECT 1 FROM customers cu WHERE cu.country_id = c.country_id) -- 使用OR逻辑操作符连接 department IS NOT NULL OR name LIKE 'Jo%'
使用原生的query定义Filter
如果底层数据存储的查询语句无法用SQL-92标准的SQL转化(例如使用了RLIKE、地理位置搜索、全文搜索等特性),则可以在filter
中以query:
关键字开头,直接写原生的query。
例如,使用RLIKE对MaxCompute表的数据进行过滤,配置示例如下:
{
"datasets": [
{
"type": "Table",
"indices": [
"ods_d_customers"
],
"dataSource": {
"name": "odps_first",
"envType": "Dev"
}
}
],
"rules": [
{
"assertion": "row_count > 0",
"filter": "query: email_address RLIKE '^info@'",
}
]
}
filter中使用时间变量
如果使用时间分区,质量监控每天会扫描当天的分区或者T-1的分区,此时需要每天执行质量监控时自动获取到对应的日期值,您可以使用调度系统的时间偏移表达式写法。
例如,在filter
中使用了类似$[yyyymmdd-1]
的调度时间表达式写法,数据质量系统会自动以调度实例的定时时间为基准做偏移并替换。
如果数据质量监控不是通过调度实例触发的,需要在CreateDataQualityScanRun的触发参数中指定triggerTime
,系统会以triggerTime
为基准做偏移,