本文为您介绍PAI-Designer(原PAI-Studio)提供的归一化组件。
组件配置
您可以使用以下任意一种方式,配置归一化组件参数。
方式一:可视化方式
在PAI-Designer(原PAI-Studio)工作流页面配置组件参数。
页签 | 参数 | 描述 |
---|---|---|
字段设置 | 默认全选 | 默认全选,多余列不影响预测结果。 |
保留原始列 | 处理过的列增加“stdized_”前缀。支持DOUBLE类型与BIGINT类型。 | |
执行调优 | 计算核心数 | 系统根据输入数据量,自动分配训练的实例数量。 |
每个核内存 | 系统根据输入数据量,自动分配内存。单位为MB。 |
方式二:PAI命令方式
使用PAI命令方式,配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见SQL脚本。
- 稠密数据的命令
PAI -name Normalize -project algo_public -DkeepOriginal="true" -DoutputTableName="test_4" -DinputTablePartitions="pt=20150501" -DinputTableName="bank_data_partition" -DselectedColNames="emp_var_rate,euribor3m"
- 稀疏数据的命令
PAI -name Normalize -project projectxlib4 -DkeepOriginal="true" -DoutputTableName="kv_norm_output" -DinputTableName=kv_norm_test -DselectedColNames="f0,f1,f2" -DenableSparse=true -DoutputParaTableName=kv_norm_model -DkvIndices=1,2,8,6 -DitemDelimiter=",";
参数名称 | 是否必选 | 参数描述 | 默认值 |
---|---|---|---|
inputTableName | 是 | 输入表的表名。 | 无 |
selectedColNames | 否 | 输入表中,参与训练的列。列名以英文逗号(,)分隔,支持INT和DOUBLE类型。如果输入为稀疏格式,则支持STRING类型的列。 | 所有列 |
inputTablePartitions | 否 | 输入表中,参与训练的分区。支持以下格式:
说明 如果指定多个分区,则使用英文逗号(,)分隔。
|
所有分区 |
outputTableName | 是 | 输出结果表。 | 无 |
outputParaTableName | 否 | 配置输出表。 | 输出表1为非分区表 |
inputParaTableName | 是 | 配置输入表。 | 无 |
keepOriginal | 否 | 是否保留原始列:
|
false |
lifecycle | 否 | 输出表的生命周期,取值范围为[1,3650]。 | 无 |
coreNum | 否 | 计算的核心数目,取值为正整数。 | 系统自动分配 |
memSizePerCore | 否 | 每个核心的内存(单位是兆),取值范围为(1, 65536)。 | 系统自动分配 |
enableSparse | 否 | 是否打开稀疏支持:
|
false |
itemDelimiter | 否 | KV对之间分隔符。 | 默认”,” |
kvDelimiter | 否 | Key和Value之间分隔符。 | 默认”:” |
kvIndices | 否 | KV表中需要归一化的特征索引。 | 无 |
示例
- 数据生成
drop table if exists normalize_test_input; create table normalize_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table normalize_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime from dual union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime from dual union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime from dual union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime from dual union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime from dual union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime from dual ) tmp;
- PAI命令行
drop table if exists normalize_test_input_output; drop table if exists normalize_test_input_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output" -DinputTableName="normalize_test_input" -DselectedColNames="col_double,col_bigint" -DkeepOriginal="true"; drop table if exists normalize_test_input_output_using_model; drop table if exists normalize_test_input_output_using_model_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_output_using_model_model_output" -DinputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output_using_model" -DinputTableName="normalize_test_input";
- 输入说明
normalize_test_input
col_string col_bigint col_double col_boolean col_datetime 01 10 10.1 true 2016-07-01 10:00:00 NULL 11 10.2 false 2016-07-02 10:00:00 02 NULL 10.3 true 2016-07-03 10:00:00 03 12 NULL false 2016-07-04 10:00:00 04 13 10.4 NULL 2016-07-05 10:00:00 05 14 10.5 true NULL - 输出说明
- normalize_test_input_output
col_string col_bigint col_double col_boolean col_datetime normalized_col_bigint normalized_col_double 01 10 10.1 true 2016-07-01 10:00:00 0.0 0.0 NULL 11 10.2 false 2016-07-02 10:00:00 0.25 0.2499999999999989 02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.5000000000000022 03 12 NULL false 2016-07-04 10:00:00 0.5 NULL 04 13 10.4 NULL 2016-07-05 10:00:00 0.75 0.7500000000000011 05 14 10.5 true NULL 1.0 1.0 - normalize_test_input_model_output
feature json col_bigint {“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}} col_double {“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}} - normalize_test_input_output_using_model
col_string col_bigint col_double col_boolean col_datetime 01 0.0 0.0 true 2016-07-01 10:00:00 NULL 0.25 0.2499999999999989 false 2016-07-02 10:00:00 02 NULL 0.5000000000000022 true 2016-07-03 10:00:00 03 0.5 NULL false 2016-07-04 10:00:00 04 0.75 0.7500000000000011 NULL 2016-07-05 10:00:00 05 1.0 1.0 true NULL - normalize_test_input_output_using_model_model_output
feature json col_bigint {“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}} col_double {“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}}
- normalize_test_input_output