本文为您介绍PAI-Designer(原PAI-Studio)提供的归一化组件。

组件配置

您可以使用以下任意一种方式,配置归一化组件参数。

方式一:可视化方式

在PAI-Designer(原PAI-Studio)工作流页面配置组件参数。
页签 参数 描述
字段设置 默认全选 默认全选,多余列不影响预测结果。
保留原始列 处理过的列增加“stdized_”前缀。支持DOUBLE类型与BIGINT类型。
执行调优 计算核心数 系统根据输入数据量,自动分配训练的实例数量。
每个核内存 系统根据输入数据量,自动分配内存。单位为MB。

方式二:PAI命令方式

使用PAI命令方式,配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见SQL脚本
  • 稠密数据的命令
    PAI -name Normalize
        -project algo_public
        -DkeepOriginal="true"
        -DoutputTableName="test_4"
        -DinputTablePartitions="pt=20150501"
        -DinputTableName="bank_data_partition"
        -DselectedColNames="emp_var_rate,euribor3m"
  • 稀疏数据的命令
    PAI -name Normalize
        -project projectxlib4
        -DkeepOriginal="true"
        -DoutputTableName="kv_norm_output"
        -DinputTableName=kv_norm_test
        -DselectedColNames="f0,f1,f2"
        -DenableSparse=true
        -DoutputParaTableName=kv_norm_model
        -DkvIndices=1,2,8,6
        -DitemDelimiter=",";
参数名称 是否必选 参数描述 默认值
inputTableName 输入表的表名。
selectedColNames 输入表中,参与训练的列。列名以英文逗号(,)分隔,支持INT和DOUBLE类型。如果输入为稀疏格式,则支持STRING类型的列。 所有列
inputTablePartitions 输入表中,参与训练的分区。支持以下格式:
  • Partition_name=value
  • name1=value1/name2=value2:多级格式
说明 如果指定多个分区,则使用英文逗号(,)分隔。
所有分区
outputTableName 输出结果表。
outputParaTableName 配置输出表。 输出表1为非分区表
inputParaTableName 配置输入表。
keepOriginal 是否保留原始列:
  • true:处理过的列重命名(”normalized_”前缀),原始列保留。
  • false:全部列保留且不重命名。
false
lifecycle 输出表的生命周期,取值范围为[1,3650]
coreNum 计算的核心数目,取值为正整数。 系统自动分配
memSizePerCore 每个核心的内存(单位是兆),取值范围为(1, 65536) 系统自动分配
enableSparse 是否打开稀疏支持:
  • true
  • false
false
itemDelimiter KV对之间分隔符。 默认”,”
kvDelimiter Key和Value之间分隔符。 默认”:”
kvIndices KV表中需要归一化的特征索引。

示例

  • 数据生成
    drop table if exists normalize_test_input;
    create table normalize_test_input(
        col_string string,
        col_bigint bigint,
        col_double double,
        col_boolean boolean,
        col_datetime datetime);
    insert overwrite table normalize_test_input
    select
        *
    from
    (
        select
            '01' as col_string,
            10 as col_bigint,
            10.1 as col_double,
            True as col_boolean,
            cast('2016-07-01 10:00:00' as datetime) as col_datetime
        from dual
        union all
            select
                cast(null as string) as col_string,
                11 as col_bigint,
                10.2 as col_double,
                False as col_boolean,
                cast('2016-07-02 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '02' as col_string,
                cast(null as bigint) as col_bigint,
                10.3 as col_double,
                True as col_boolean,
                cast('2016-07-03 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '03' as col_string,
                12 as col_bigint,
                cast(null as double) as col_double,
                False as col_boolean,
                cast('2016-07-04 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '04' as col_string,
                13 as col_bigint,
                10.4 as col_double,
                cast(null as boolean) as col_boolean,
                cast('2016-07-05 10:00:00' as datetime) as col_datetime
            from dual
        union all
            select
                '05' as col_string,
                14 as col_bigint,
                10.5 as col_double,
                True as col_boolean,
                cast(null as datetime) as col_datetime
            from dual
    ) tmp;
  • PAI命令行
    drop table if exists normalize_test_input_output;
    drop table if exists normalize_test_input_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output"
        -DinputTableName="normalize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists normalize_test_input_output_using_model;
    drop table if exists normalize_test_input_output_using_model_model_output;
    PAI -name Normalize
        -project algo_public
        -DoutputParaTableName="normalize_test_input_output_using_model_model_output"
        -DinputParaTableName="normalize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="normalize_test_input_output_using_model"
        -DinputTableName="normalize_test_input";
  • 输入说明
    normalize_test_input
    col_string col_bigint col_double col_boolean col_datetime
    01 10 10.1 true 2016-07-01 10:00:00
    NULL 11 10.2 false 2016-07-02 10:00:00
    02 NULL 10.3 true 2016-07-03 10:00:00
    03 12 NULL false 2016-07-04 10:00:00
    04 13 10.4 NULL 2016-07-05 10:00:00
    05 14 10.5 true NULL
  • 输出说明
    • normalize_test_input_output
      col_string col_bigint col_double col_boolean col_datetime normalized_col_bigint normalized_col_double
      01 10 10.1 true 2016-07-01 10:00:00 0.0 0.0
      NULL 11 10.2 false 2016-07-02 10:00:00 0.25 0.2499999999999989
      02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.5000000000000022
      03 12 NULL false 2016-07-04 10:00:00 0.5 NULL
      04 13 10.4 NULL 2016-07-05 10:00:00 0.75 0.7500000000000011
      05 14 10.5 true NULL 1.0 1.0
    • normalize_test_input_model_output
      feature json
      col_bigint {“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}}
      col_double {“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}}
    • normalize_test_input_output_using_model
      col_string col_bigint col_double col_boolean col_datetime
      01 0.0 0.0 true 2016-07-01 10:00:00
      NULL 0.25 0.2499999999999989 false 2016-07-02 10:00:00
      02 NULL 0.5000000000000022 true 2016-07-03 10:00:00
      03 0.5 NULL false 2016-07-04 10:00:00
      04 0.75 0.7500000000000011 NULL 2016-07-05 10:00:00
      05 1.0 1.0 true NULL
    • normalize_test_input_output_using_model_model_output
      feature json
      col_bigint {“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}}
      col_double {“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}}