您可以通过给定一个缺失值的配置列表,来实现将输入表的缺失值用指定的值来填充。
背景信息
- 将数值型的空值替换为最大值,最小值,均值或者一个自定义的值。
- 将字符型的空值,空字符串,空值和空字符串,指定值替换为一个自定义的值。
- 待填充的缺失值可以选择空值或空字符,也可以自定义。
缺失值如果选择空字符串,则填充的目标列应是STRING型。
- 数值型替换可以自定义,也可以直接选择替换成数值最大值,最小值或者均值。
缺失值填充
PAI-Studio支持通过可视化或PAI命令的方式,配置该组件参数:
- 可视化方式
页签 参数 描述 参数设置 填充的字段 默认全选,多余列不影响预测结果。 原值 - Null(数值和string)
- 空字符串(string)
- Null和空字符串(string)
- 自定义(string)
替换为 - Min(数值型)
- Max(数值型)
- Mean(数值型)
- 自定义(数值型和string)
configs ID列。 说明 勾选高级选项时展示。执行调优 计算核心数 每个核内存数 - PAI命令方式
PAI -name FillMissingValues -project algo_public -Dconfigs="poutcome,null-empty,testing" \ -DoutputTableName="test_3" -DinputPartitions="pt=20150501" -DinputTableName="bank_data_partition";
参数名称 是否必选 参数描述 默认值 inputTableName 是 输入表的表名。 无 inputTablePartitions 否 输入表中,参与训练的分区。支持以下格式: - Partition_name=value
- name1=value1/name2=value2:多级格式
说明 如果指定多个分区,则使用英文逗号(,)分隔。所有分区 outputTableName 是 输出结果表。 无 configs 是 缺失值填充的配置。 例如格式col1, null, 3.14; col2, empty, hello; col3, empty-null, world
,其中null
表示空值,empty
表示空字符。- 如果选择空字符,则填充的目标列应是STRING型。
- 如果采用最大值、最小值、均值,可以采用变量,其命名规范形如:min, max, mean。
- 如果用户自定义替换值,则使用user-defined,格式例如
col4,user-defined,str,str123
。
无 outputParaTableName 否 配置输出表。 输出表1为非分区表 inputParaTableName 是 配置输入表。 无 lifecycle 否 输出表的生命周期,取值范围为[1,3650]。 无 coreNum 否 计算的核心数目,取值为正整数。 系统自动分配 memSizePerCore 否 每个核心的内存(单位是兆),取值范围为(1, 65536)。 系统自动分配
缺失值填充示例
- 使用SQL语句,生成测试数据。
drop table if exists fill_missing_values_test_input; create table fill_missing_values_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table fill_missing_values_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime from dual union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime from dual union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime from dual union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime from dual union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime from dual union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime from dual ) tmp;
输入数据说明。+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 | | 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | NULL | false | 2016-07-04 10:00:00 | | NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | NULL | +------------+------------+------------+-------------+--------------+
- 运行命令。
drop table if exists fill_missing_values_test_input_output; drop table if exists fill_missing_values_test_input_model_output; PAI -name FillMissingValues -project algo_public -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00" -DoutputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output" -DinputTableName="fill_missing_values_test_input"; drop table if exists fill_missing_values_test_input_output_using_model; drop table if exists fill_missing_values_test_input_output_using_model_model_output; PAI -name FillMissingValues -project algo_public -DoutputParaTableName="fill_missing_values_test_input_output_using_model_model_output" -DinputParaTableName="fill_missing_values_test_input_model_output" -Dlifecycle="28" -DoutputTableName="fill_missing_values_test_input_output_using_model" -DinputTableName="fill_missing_values_test_input";
- 运行结果。
- fill_missing_values_test_input_output
+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
- fill_missing_values_test_input_model_output
+------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+
- fill_missing_values_test_input_output_using_model
+------------+------------+------------+-------------+--------------+ | col_string | col_bigint | col_double | col_boolean | col_datetime | +------------+------------+------------+-------------+--------------+ | 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 | | 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 | | 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 | | str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 | | 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 | | 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 | +------------+------------+------------+-------------+--------------+
- fill_missing_values_test_input_output_using_model_model_output
+------------+------------+ | feature | json | +------------+------------+ | col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} | | col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} | | col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} | | col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} | | col_datetime | {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} | +------------+------------+
- fill_missing_values_test_input_output
在文档使用中是否遇到以下问题
更多建议
匿名提交