特征编码是将非线性特征通过GBDT编码成线性特征。
功能介绍
特征编码由决策树和Ensemble算法挖掘新特征的一种策略,特征来自一个或多个特征组成的决策树叶子结点的one-hot结果。
例如,下图有三棵树,共有12个叶子结点。根据树的顺序依次编码为0~11号特征,其中第一棵树的叶子结点占据0~3号特征,第二棵树占据4~7号特征,第三棵树占据8~11号特征。该编码策略可以有效转换GBDT非线性特征为线性特征。

组件配置
您可以使用以下任意一种方式,配置特征编码组件参数。
方式一:可视化方式
在PAI-Designer(原PAI-Studio)工作流页面配置组件参数。
页签 | 参数 | 描述 |
---|---|---|
字段设置 | 特征列 | 输入表中,用于训练的特征列。 |
标签列 | 该参数为必选项。
单击 |
|
附加输出列 | 可选,保留原特征至输出结果表。 | |
参数设置 | 计算核心数 | 计算的核心数,格式为正整数。 |
每个核心内存数 | 每个核心的内存数量,格式为正整数。 |
方式二:PAI命令方式
使用PAI命令方式,配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见SQL脚本。
PAI -name fe_encode_runner -project algo_public
-DinputTable="pai_temp_2159_19087_1"
-DencodeModel="xlab_m_GBDT_LR_1_19064"
-DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
-DlabelCol="y"
-DoutputTable="pai_temp_2159_19061_1";
-DcoreNum=10
-DmemSizePerCore=1024
参数名称 | 是否必选 | 描述 | 默认值 |
---|---|---|---|
inputTable | 是 | 输入表的名称。 | 无 |
inputTablePartitions | 否 | 输入表中指定参与训练的分区,格式为partition_name=value。
如果是多级,格式为name1=value1/name2=value2。 如果指定多个分区,使用英文逗号(,)分隔。 |
输入表的所有分区 |
encodeModel | 是 | 编码的输入GBDT二分类的模型。 | 无 |
outputTable | 是 | 缩放尺度后的结果表。 | 无 |
selectedCols | 是 | 勾选GBDT参与编码的特征,通常是GBDT组件的训练特征。 | 无 |
labelCol | 是 | 标签字段。 | 无 |
lifecycle | 否 | 结果表的生命周期。 | 7 |
coreNum | 否 | 指定Instance的总数,支持BIGINT类型。 | -1,会根据输入数据量计算需要的Instance的数量。 |
memSizePerCore | 否 | 指定memory大小。 | -1,会根据输入数据量计算需要的内存大小。 |
示例
- 使用SQL语句,生成训练数据。
CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1 ( age BIGINT COMMENT '', campaign BIGINT COMMENT '', pdays BIGINT COMMENT '', previous BIGINT COMMENT '', emp_var_rate DOUBLE COMMENT '', cons_price_idx DOUBLE COMMENT '', cons_conf_idx DOUBLE COMMENT '', euribor3m DOUBLE COMMENT '', nr_employed DOUBLE COMMENT '', y BIGINT COMMENT '' ) LIFECYCLE 7; insert overwrite table tdl_pai_bank_test1 select * from (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate, 93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y from dual union all select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate, 94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y from dual union all select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y from dual union all select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate, 93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y from dual union all select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate, 92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y from dual union all select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate, 92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y from dual union all select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate, 93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y from dual ) a
- 构建如下实验,通常与GBDT二分类组件配合使用。详情请参见算法建模。
设置GBDT二分类组件的参数,树的数目为5,树的最大深度为3,y为标签列,其它字段为特征列。
- 运行实验,查看预测结果。
kv y 2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1 0.0 2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1 0.0 2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1 1.0 2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1 0.0 0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1 1.0 输出的效果可以直接输入至逻辑回归二分类或多分类组件,通常效果会比单独使用LR或GDBT的效果好,且不易拟合。