本文为您介绍PAI-Studio提供的标准化组件。

背景信息

  • 对一个表的某一列或多列,进行标准化处理,将产生的数据存入新表中。
  • 标准化所使用的公式 :(X - Mean)/(standard deviation)。
    • Mean:样本平均值。
    • standard deviation:样本标准偏差,针对从总体抽样,利用样本来计算总体偏差,为了使算出的值与总体水平更接近,就必须将算出的标准偏差的值适度放大,即 standard deviation
    • 样本标准偏差公式:expression

      其中x代表所采用的样本X1,X2,…,Xn的均值。

标准化

PAI-Studio支持通过可视化或PAI命令的方式,配置该组件参数:
  • 可视化方式
    页签 参数 描述
    字段设置 默认全选 默认全选,多余列不影响预测结果。
    保留原始列 处理过的列增加“stdized_”前缀。支持DOUBLE类型与BIGINT类型。
    执行调优 计算核心数 系统根据输入数据量,自动分配训练的实例数量。
    每个核内存数 系统根据输入数据量,自动分配内存。单位为MB。
  • PAI命令方式
    • 稠密数据的命令
      PAI -name Standardize
          -project algo_public 
          -DkeepOriginal="false"    
          -DoutputTableName="test_5"
          -DinputTablePartitions="pt=20150501"  
          -DinputTableName="bank_data_partition" 
          -DselectedColNames="euribor3m,pdays"
    • 稀疏数据的命令
      PAI -name Standardize    
          -project projectxlib4
          -DkeepOriginal="true"
          -DoutputTableName="kv_standard_output"
          -DinputTableName=kv_standard_test
          -DselectedColNames="f0,f1,f2"
          -DenableSparse=true
          -DoutputParaTableName=kv_standard_model    
          -DkvIndices=1,2,8,6
          -DitemDelimiter=",";
    参数名称 是否必选 参数描述 默认值
    inputTableName 输入表的表名。
    selectedColNames 输入表中,参与训练的列。列名以英文逗号(,)分隔,支持INT和DOUBLE类型。如果输入为稀疏格式,则支持STRING类型的列。 所有列
    inputTablePartitions 输入表中,参与训练的分区。支持以下格式:
    • Partition_name=value
    • name1=value1/name2=value2:多级格式
    说明 如果指定多个分区,则使用英文逗号(,)分隔。
    所有分区
    outputTableName 输出结果表。
    outputParaTableName 配置输出表。
    inputParaTableName 配置输入表。
    keepOriginal 是否保留原始列:
    • true:处理过的列重命名(“stdized_”前缀),原始列保留。
    • false:全部列保留且不重命名。
    false
    lifecycle 输出表生命周期。
    coreNum 核心数量。 系统自动分配
    memSizePerCore 单个核心使用的内存数。 系统自动分配
    enableSparse 是否打开稀疏支持:
    • true
    • false
    false
    itemDelimiter KV对之间分隔符。 默认”,”
    kvDelimiter Key和Value之间分隔符。 默认”:”
    kvIndices KV表中需要归一化的特征索引。

标准化示例

详细示例
drop table if exists standardize_test_input;
create table standardize_test_input(
    col_string string,
    col_bigint bigint,
    col_double double,
    col_boolean boolean,
    col_datetime datetime);
insert overwrite table standardize_test_input
select
    *
from
(
    select
        '01' as col_string,
        10 as col_bigint,
        10.1 as col_double,
        True as col_boolean,
        cast('2016-07-01 10:00:00' as datetime) as col_datetime
    from dual
    union all
        select
            cast(null as string) as col_string,
            11 as col_bigint,
            10.2 as col_double,
            False as col_boolean,
            cast('2016-07-02 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '02' as col_string,
            cast(null as bigint) as col_bigint,
            10.3 as col_double,
            True as col_boolean,
            cast('2016-07-03 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '03' as col_string,
            12 as col_bigint,
            cast(null as double) as col_double,
            False as col_boolean,
            cast('2016-07-04 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '04' as col_string,
            13 as col_bigint,
            10.4 as col_double,
            cast(null as boolean) as col_boolean,
            cast('2016-07-05 10:00:00' as datetime) as col_datetime
        from dual
    union all
        select
            '05' as col_string,
            14 as col_bigint,
            10.5 as col_double,
            True as col_boolean,
            cast(null as datetime) as col_datetime
        from dual
) tmp;
  • PAI命令行
    drop table if exists standardize_test_input_output;
    drop table if exists standardize_test_input_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output"
        -DinputTableName="standardize_test_input"
        -DselectedColNames="col_double,col_bigint"
        -DkeepOriginal="true";
    drop table if exists standardize_test_input_output_using_model;
    drop table if exists standardize_test_input_output_using_model_model_output;
    PAI -name Standardize
        -project algo_public
        -DoutputParaTableName="standardize_test_input_output_using_model_model_output"
        -DinputParaTableName="standardize_test_input_model_output"
        -Dlifecycle="28"
        -DoutputTableName="standardize_test_input_output_using_model"
        -DinputTableName="standardize_test_input";
  • 输入说明
    standardize_test_input
    col_string col_bigint col_double col_boolean col_datetime
    01 10 10.1 true 2016-07-01 10:00:00
    NULL 11 10.2 false 2016-07-02 10:00:00
    02 NULL 10.3 true 2016-07-03 10:00:00
    03 12 NULL false 2016-07-04 10:00:00
    04 13 10.4 NULL 2016-07-05 10:00:00
    05 14 10.5 true NULL
  • 输出说明
    • standardize_test_input_output
      col_string col_bigint col_double col_boolean col_datetime stdized_col_bigint stdized_col_double
      01 10 10.1 true 2016-07-01 10:00:00 -1.2649110640673518 -1.2649110640683832
      NULL 11 10.2 false 2016-07-02 10:00:00 -0.6324555320336759 -0.6324555320341972
      02 NULL 10.3 true 2016-07-03 10:00:00 NULL 0.0
      03 12 NULL false 2016-07-04 10:00:00 0.0 NULL
      04 13 10.4 NULL 2016-07-05 10:00:00 0.6324555320336759 0.6324555320341859
      05 14 10.5 true NULL 1.2649110640673518 1.2649110640683718
    • standardize_test_input_model_output
      feature json
      col_bigint {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
      col_double {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}
    • standardize_test_input_output_using_model
      col_string col_bigint col_double col_boolean col_datetime
      01 -1.2649110640673515 -1.264911064068383 true 2016-07-01 10:00:00
      NULL -0.6324555320336758 -0.6324555320341971 false 2016-07-02 10:00:00
      02 NULL 0.0 true 2016-07-03 10:00:00
      03 0.0 NULL false 2016-07-04 10:00:00
      04 0.6324555320336758 0.6324555320341858 NULL 2016-07-05 10:00:00
      05 1.2649110640673515 1.2649110640683716 true NULL
    • standardize_test_input_output_using_model_model_output
      feature json
      col_bigint {“name”: “standardize”, “type”:”bigint”, “paras”:{“mean”:12, “std”: 1.58113883008419}}
      col_double {“name”: “standardize”, “type”:”double”, “paras”:{“mean”:10.3, “std”: 0.1581138830082909}}