特征编码是将非线性特征通过GBDT编码成线性特征。
功能介绍
特征编码由决策树和Ensemble算法挖掘新特征的一种策略,特征来自一个或多个特征组成的决策树叶子结点的one-hot结果。
例如,下图有三棵树,共有12个叶子结点。根据树的顺序依次编码为0~11号特征,其中第一棵树的叶子结点占据0~3号特征,第二棵树占据4~7号特征,第三棵树占据8~11号特征。该编码策略可以有效转换GBDT非线性特征为线性特征。

参数配置
您可以通过以下任意一种方式,配置特征编码组件参数:
- 可视化方式
页签 参数 描述 字段设置 特征列 输入表中,用于训练的特征列。 标签列 该参数为必选项。 单击
图标,在选择字段对话框中,输入关键字搜索列,选中后单击确定。
附加输出列 可选,保留原特征至输出结果表。 参数设置 计算核心数 计算的核心数,格式为正整数。 每个核心内存数 每个核心的内存数量,格式为正整数。 - PAI命令格式
PAI -name fe_encode_runner -project algo_public -DinputTable="pai_temp_2159_19087_1" -DencodeModel="xlab_m_GBDT_LR_1_19064" -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign" -DlabelCol="y" -DoutputTable="pai_temp_2159_19061_1"; -DcoreNum=10 -DmemSizePerCore=1024
参数名称 是否必选 描述 默认值 inputTable 是 输入表的名称。 无 inputTablePartitions 否 输入表中指定参与训练的分区,格式为partition_name=value。 如果是多级,格式为name1=value1/name2=value2。
如果指定多个分区,使用英文逗号(,)分隔。
输入表的所有分区 encodeModel 是 编码的输入GBDT二分类的模型。 无 outputTable 是 缩放尺度后的结果表。 无 selectedCols 是 勾选GBDT参与编码的特征,通常是GBDT组件的训练特征。 无 labelCol 是 标签字段。 无 lifecycle 否 结果表的生命周期。 7 coreNum 否 指定Instance的总数,支持BIGINT类型。 -1,会根据输入数据量计算需要的Instance的数量。 memSizePerCore 否 指定memory大小。 -1,会根据输入数据量计算需要的内存大小。
示例
- 使用SQL语句,生成训练数据。
CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1 ( age BIGINT COMMENT '', campaign BIGINT COMMENT '', pdays BIGINT COMMENT '', previous BIGINT COMMENT '', emp_var_rate DOUBLE COMMENT '', cons_price_idx DOUBLE COMMENT '', cons_conf_idx DOUBLE COMMENT '', euribor3m DOUBLE COMMENT '', nr_employed DOUBLE COMMENT '', y BIGINT COMMENT '' ) LIFECYCLE 7; insert overwrite table tdl_pai_bank_test1 select * from (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate, 93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y from dual union all select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate, 94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y from dual union all select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y from dual union all select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate, 93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y from dual union all select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y from dual ) a
- 构建如下实验,通常与GBDT二分类组件配合使用。详情请参见算法建模。
设置GBDT二分类组件的参数,树的数目为5,树的最大深度为3,y为标签列,其它字段为特征列。
- 运行实验,查看预测结果。
kv y 2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1 0.0 2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1 0.0 2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1 1.0 2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1 0.0 0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1 1.0 输出的效果可以直接输入至逻辑回归二分类或多分类组件,通常效果会比单独使用LR或GDBT的效果好,且不易拟合。