特征编码是将非线性特征通过GBDT编码成线性特征。

功能介绍

特征编码由决策树和Ensemble算法挖掘新特征的一种策略,特征来自一个或多个特征组成的决策树叶子结点的one-hot结果。

例如,下图有三棵树,共有12个叶子结点。根据树的顺序依次编码为0~11号特征,其中第0棵树的叶子结点占据0~3号特征,第二棵树占据4~7号特征,第三棵树占据8~11号特征。该编码策略可以有效转换GBDT非线性特征为线性特征。 特征编码

参数配置

您可以通过以下任意一种方式,配置特征编码组件参数:
  • 可视化方式
    页签 参数 描述
    字段设置 特征列 输入表中,用于训练的特征列。
    标签列 该参数为必选项。

    单击目录图标,在选择字段对话框中,输入关键字搜索列,选中后单击确定

    附加输出列 可选,保留原特征至输出结果表。
    参数设置 计算核心数 计算的核心数,格式为正整数。
    每个核心内存数 每个核心的内存数量,格式为正整数。
  • PAI命令格式
    PAI -name fe_encode_runner -project algo_public
        -DinputTable="pai_temp_2159_19087_1"
        -DencodeModel="xlab_m_GBDT_LR_1_19064"
        -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
        -DlabelCol="y"
        -DoutputTable="pai_temp_2159_19061_1";
       -DcoreNum=10
       -DmemSizePerCore=1024
    参数名称 是否必选 描述 默认值
    inputTable 输入表的名称。
    inputTablePartitions 输入表中指定参与训练的分区,格式为partition_name=value

    如果是多级,格式为name1=value1/name2=value2

    如果指定多个分区,使用英文逗号(,)分隔。

    输入表的所有分区
    encodeModel 编码的输入GBDT二分类的模型。
    outputTable 缩放尺度后的结果表。
    selectedCols 勾选GBDT参与编码的特征,通常是GBDT组件的训练特征。
    labelCol 标签字段。
    lifecycle 结果表的生命周期。 7
    coreNum 指定Instance的总数,支持BIGINT类型。 -1,会根据输入数据量计算需要的Instance的数量。
    memSizePerCore 指定memory大小。 -1,会根据输入数据量计算需要的内存大小。

示例

  1. 使用SQL语句,生成训练数据。
    CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1
    (
        age            BIGINT COMMENT '',
        campaign       BIGINT COMMENT '',
        pdays          BIGINT COMMENT '',
        previous       BIGINT COMMENT '',
        emp_var_rate   DOUBLE COMMENT '',
        cons_price_idx DOUBLE COMMENT '',
        cons_conf_idx  DOUBLE COMMENT '',
        euribor3m      DOUBLE COMMENT '',
        nr_employed    DOUBLE COMMENT '',
        y              BIGINT COMMENT ''
    )
    LIFECYCLE 7;
    insert overwrite table tdl_pai_bank_test1
    select * from
    (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate,
           93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y
    from dual
    union all
    select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate,
           94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y
    from dual
    union all
    select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y
    from dual
    union all
    select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y
    from dual
    union all
    select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate,
           93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y
    from dual
    union all
    select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    union all
    select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    union all
    select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y
    from dual
    union all
    select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y
    from dual
    ) a
  2. 构建如下实验,通常与GBDT二分类组件配合使用。详情请参见算法建模
    设置GBDT二分类组件的参数,树的数目为5,树的最大深度为3, y为标签列,其它字段为特征列。 建模
  3. 运行实验,查看预测结果。
    kv y
    2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1 0.0
    2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1 0.0
    2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1 1.0
    2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1 0.0
    0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1 1.0

    输出的效果可以直接输入至逻辑回归二分类或多分类组件,通常效果会比单独使用LR或GDBT的效果好,且不易拟合。