缺失值填充是一种处理数据集中缺失数据的方法,旨在通过推断和替换缺失值来提高数据完整性和模型性能。常见的填充方法包括使用最小值、最大值、平均数和自定义值进行填补。这些方法能够帮助减少数据不完整对模型训练和预测的影响。
配置组件
方式一:可视化方式
在Designer工作流页面添加缺失值填充组件,并在界面右侧配置相关参数:
| 参数类型 | 参数 | 描述 | 
| 字段设置 | 填充的字段 | 选择待填充的字段。 | 
| 原值 | 待填充字段原值,取值: 
 | |
| 替换为 | 替换字段,取值: 
 | |
| 高级选项 | 自定义替换策略config。格式为: | |
| 执行调优 | 计算核心数 | 计算核心数。 | 
| 每个核内存数 | 每个核内存数,单位MB。 | 
方式二:PAI命令方式
使用PAI命令配置缺失值填充组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见场景4:在SQL脚本组件中执行PAI命令。
PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="poutcome,null-empty,testing" 
    -DoutputParaTableName="test_input_model_output"
    -DoutputTableName="test_3"
    -DinputTablePartitions="pt=20150501"
    -DinputTableName="bank_data_partition";| 参数名称 | 是否必选 | 默认值 | 描述 | 
| inputTableName | 是 | 无 | 输入表的表名。 | 
| inputTablePartitions | 否 | 所有分区 | 输入表中,参与训练的分区。支持以下格式: 
 说明  如果指定多个分区,则使用英文逗号(,)分隔。 | 
| outputTableName | 是 | 无 | 输出结果表。 | 
| configs | 是 | 无 | 缺失值填充的配置。 例如格式 
 | 
| outputParaTableName | 是 | 输出表1为非分区表 | 配置输出表。 | 
| inputParaTableName | 否 | 无 | 配置输入表。 | 
| lifecycle | 否 | 无 | 输出表的生命周期,取值范围为[1,3650]。 | 
| coreNum | 否 | 系统自动分配 | 计算的核心数目,取值为正整数。 | 
| memSizePerCore | 否 | 系统自动分配 | 每个核心的内存(单位是兆),取值范围为(1, 65536)。 | 
使用示例
- 使用SQL语句,生成测试数据。 - drop table if exists fill_missing_values_test_input; create table fill_missing_values_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table fill_missing_values_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime ) tmp;- 输入数据说明: - +------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 | | 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | NULL | false | 2016-07-04 10:00:00 | | NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | NULL | +------------+------------+------------+-------------+--------------+
- 运行PAI命令。 - drop table if exists fill_missing_values_test_input_output; drop table if exists fill_missing_values_test_input_model_output; PAI -name FillMissingValues -project algo_public -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00" -DoutputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output" -DinputTableName="fill_missing_values_test_input"; drop table if exists fill_missing_values_test_input_output_using_model; drop table if exists fill_missing_values_test_input_output_using_model_model_output; PAI -name FillMissingValues -project algo_public -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output" -DinputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output_using_model" -DinputTableName="fill_missing_values_test_input";
- 运行结果。 - fill_missing_values_test_input_output - +------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
- fill_missing_values_test_input_model_output - +------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+
- fill_missing_values_test_input_output_using_model - +------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
- fill_missing_values_test_input_output_using_model_model_output - +------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+