您可以通过给定一个缺失值的配置列表,来实现将输入表的缺失值用指定的值来填充。

背景信息

  • 将数值型的空值替换为最大值,最小值,均值或者一个自定义的值。
  • 将字符型的空值,空字符串,空值和空字符串,指定值替换为一个自定义的值。
  • 待填充的缺失值可以选择空值或空字符,也可以自定义。

    缺失值如果选择空字符串,则填充的目标列应是STRING型。

  • 数值型替换可以自定义,也可以直接选择替换成数值最大值,最小值或者均值。

组件配置

您可以使用以下任意一种方式,配置缺失值填充组件参数。

方式一:可视化方式

在PAI-Designer(原PAI-Studio)工作流页面配置组件参数。
页签参数描述
参数设置填充的字段默认全选,多余列不影响预测结果。
原值
  • Null(数值和string)
  • 空字符串(string)
  • Null和空字符串(string)
  • 自定义(string)
替换为
  • Min(数值型)
  • Max(数值型)
  • Mean(数值型)
  • 自定义(数值型和string)
configsID列。
说明 勾选高级选项时展示。
执行调优计算核心数
每个核内存数

方式二:PAI命令方式

使用PAI命令方式,配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见SQL脚本
PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="poutcome,null-empty,testing" \
    -DoutputParaTableName="test_input_model_output"
    -DoutputTableName="test_3"
    -DinputTablePartitions="pt=20150501"
    -DinputTableName="bank_data_partition";
参数名称是否必选参数描述默认值
inputTableName输入表的表名。
inputTablePartitions输入表中,参与训练的分区。支持以下格式:
  • Partition_name=value
  • name1=value1/name2=value2:多级格式
说明 如果指定多个分区,则使用英文逗号(,)分隔。
所有分区
outputTableName输出结果表。
configs缺失值填充的配置。
例如格式col1, null, 3.14; col2, empty, hello; col3, empty-null, world,其中null表示空值,empty表示空字符。
  • 如果选择空字符,则填充的目标列应是STRING型。
  • 如果采用最大值、最小值、均值,可以采用变量,其命名规范形如:min, max, mean。
  • 如果用户自定义替换值,则使用user-defined,格式例如col4,user-defined,str,str123
outputParaTableName配置输出表。输出表1为非分区表
inputParaTableName配置输入表。
lifecycle输出表的生命周期,取值范围为[1,3650]
coreNum计算的核心数目,取值为正整数。系统自动分配
memSizePerCore每个核心的内存(单位是兆),取值范围为(1, 65536)系统自动分配

示例

  1. 使用SQL语句,生成测试数据。
    drop table if exists fill_missing_values_test_input;
    create table fill_missing_values_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table fill_missing_values_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        from dual
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
            from dual
    ) tmp;
    输入数据说明。
    +------------+------------+------------+-------------+--------------+
    | col_string | col_bigint | col_double | col_boolean | col_datetime |
    +------------+------------+------------+-------------+--------------+
    | 04         | 13         | 10.4       | NULL        | 2016-07-05 10:00:00 |
    | 02         | NULL       | 10.3       | true        | 2016-07-03 10:00:00 |
    | 03         | 12         | NULL       | false       | 2016-07-04 10:00:00 |
    | NULL       | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
    | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
    | 05         | 14         | 10.5       | true        | NULL         |
    +------------+------------+------------+-------------+--------------+
  2. 运行命令。
    drop table if exists fill_missing_values_test_input_output;
    drop table if exists fill_missing_values_test_input_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
    -DoutputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output"
    -DinputTableName="fill_missing_values_test_input";
    drop table if exists fill_missing_values_test_input_output_using_model;
    drop table if exists fill_missing_values_test_input_output_using_model_model_output;
    PAI -name FillMissingValues
    -project algo_public
    -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output"
    -DinputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output_using_model"
    -DinputTableName="fill_missing_values_test_input";
  3. 运行结果。
    • fill_missing_values_test_input_output
      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_model_output
      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+
    • fill_missing_values_test_input_output_using_model
      +------------+------------+------------+-------------+--------------+
      | col_string | col_bigint | col_double | col_boolean | col_datetime |
      +------------+------------+------------+-------------+--------------+
      | 04         | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
      | 02         | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
      | 03         | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
      | str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
      | 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
      | 05         | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
      +------------+------------+------------+-------------+--------------+
    • fill_missing_values_test_input_output_using_model_model_output
      +------------+------------+
      | feature    | json       |
      +------------+------------+
      | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty",  "replaced_value": "str_type_empty"}} |
      | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null",  "replaced_value": 14}} |
      | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null",  "replaced_value": 10.3}} |
      | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null",  "replaced_value": 1}} |
      | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null",  "replaced_value": 1467770400000}} |
      +------------+------------+