空值规则_大数据开发治理平台 DataWorks(DataWorks)-阿里云帮助中心

您可以使用空值规则来对数据中某个字段的空值情况进行校验。

配置示例

{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt=$[yyyymmdd]/hh=$[hh24-1/24]",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "missing_count(birthday) = 0"
    }, {
      "missing_percent(gender) < 5%"
    }, {
      "missing_count(first_name) = 0",
      "missing" : {
        "regex": "(?:N/A)"
      }
    }, {
      "missing_count(first_name) = 0",
      "missing" : {
        "values": [
          "n/a", 
          "NA", 
          "none"
        ]
      }
    }, {
      "missing_percent(email_address) = 0%"
    }
  ],
  "computeResource": {
    "id": 2001
  }
}

定义空值规则

执行空值规则校验时，系统会扫描数据中的指定字段，统计出指定字段值为空的数据行数或者数据行数占比。

系统会将NULL值当作空值统计。
可以通过missing_percent来校验空值行数的占比。

示例一：校验birthday字段为NULL的数据。

{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt=$[yyyymmdd]/hh=$[hh24-1/24]",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "missing_count(birthday) = 0"
    }
  ],
  "computeResource": {
    "id": 2001
  }
}

示例二：定义number_employees字段为NULL的行数不能超过监控总行数的5%。

{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt=$[yyyymmdd]/hh=$[hh24-1/24]",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "missing_percent(number_employees) < 5%"
    }
  ],
  "computeResource": {
    "id": 2001
  }
}

保留问题数据

如果规则的collectFailedRows设置为true，则在校验不通过（状态为warn或者fail）时，会将指定字段为NULL的数据自动保存到问题数据表中供后续问题排查使用。

{
  "datasets": [
    {
      "type": "Table",
      "tables": [
        "tb_d_spec_demo"
      ],
      "filter": "dt=$[yyyymmdd]/hh=$[hh24-1/24]",
      "dataSource": {
        "name": "odps_first", 
        "envType": "Dev"
      }
    }
  ],
  "rules": [
    {
      "missing_percent(number_employees) < 5%",
      "collectFailedRows": true
    }
  ],
  "computeResource": {
    "id": 2001
  }
}