特征生成概述和配置_智能推荐 AIRec(AIRec)-阿里云帮助中心

特征生成（FeatureGenerator，下文简称FG）是一套把原始输入转换为模型所需输入（特征）的数据变换过程，用来保证离线、在线样本生成结果的一致性。

特征生成只关注同时需要在离线和在线样本生成过程中的变换操作。如果某个变换操作只需要作用在离线阶段，则不需要定义为FG的操作。 FG模块在推荐系统架构中的位置如下图所示：

特征生成过程由一系列特征变换算子（下文简称为FG算子）按照配置文件定义的DAG图的拓扑顺序并行执行。

配置文件示例

features列表配置特征算子，每个特征算子必须包含feature_name、feature_type两项，其余配置项请参见内置特征算子。

reserves指定离线任务时透传出的字段，这些字段会原样输出，不会做特征变换。

{
  "features": [
    {
      "feature_name": "goods_id",
      "feature_type": "id_feature",
      "value_type": "string",
      "expression": "item:goods_id",
      "default_value": "-1024",
      "need_prefix": false
    },
    {
      "feature_name": "color_pair",
      "feature_type": "combo_feature",
      "value_type": "string",
      "expression": ["user:query_color", "item:color"],
      "default_value": "",
      "need_prefix": false
    },
    {
      "feature_name": "current_price",
      "feature_type": "raw_feature",
      "value_type": "double",
      "expression": "item:current_price",
      "default_value": "0",
      "need_prefix": false
    }, 
    {
      "feature_name": "usr_cate1_clk_cnt_1d",
      "feature_type": "lookup_feature",
      "map": "user:usr_cate1_clk_cnt_1d",
      "key": "item:cate1",
      "need_discrete": false,
      "need_key": false,
      "default_value": "0",
      "combiner": "max",
      "need_prefix": false,
      "value_type": "double"
    },
    {
      "feature_name": "recommend_match",
      "feature_type": "overlap_feature",
      "method": "is_contain",
      "query": "user:query_recommend",
      "title": "item:recommend",
      "default_value": "0"
    },
    {
      "feature_name": "norm_title",
      "feature_type": "text_normalizer",
      "expression": "item:title",
      "max_length": 512,
      "parameter": 0,
      "remove_space": false,
      "is_gbk_input": false,
      "is_gbk_output": false
    },
    {
      "feature_name": "title_terms",
      "feature_type": "tokenize_feature",
      "expression": "feature:norm_title",
      "default_value": "",
      "vocab_file": "tokenizer.json",
      "output_type": "word_id",
      "output_delim": ","
    },
    {
      "feature_name": "query_title_match_ratio",
      "feature_type": "overlap_feature",
      "method": "query_common_ratio",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    },
    {
      "feature_name": "title_term_match_ratio",
      "feature_type": "overlap_feature",
      "method": "title_common_ratio",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    },
    {
      "feature_name": "term_proximity_min_cover",
      "feature_type": "overlap_feature",
      "method": "proximity_min_cover",
      "query": "user:query_terms",
      "title": "feature:title_terms",
      "default_value": "0"
    }
  ],
  "reserves": [
    "request_id",
    "user_id",
    "is_click",
    "is_pay",
    "sample_weight",
    "event_unix_time"
  ]
}

输入域

输入域表示当前输入来自哪个实体，目前支持以下4种类型：

user：用户侧特征，包括user profile、user维度的统计特征等。
context：上下文特征，时间、地点、天气等随时变化的特征。
item：物品侧特征，包括静态内容特征、item维度的统计特征等。
feature：表示当前输入是另一个特征变换的输出。

其中，feature输入域比较特殊，通过该输入域配置特征算子之间的依赖关系。从整体来看，所有特征算子构成一个有向无环图（DAG），框架会按照拓扑顺序来并行执行这些特征变换操作。对应的拓扑结构如下：

一般情况下，DAG的中间节点的输出不会作为FG的输出。可以用特征配置项stub_type来改变这一行为。

多值类型及分隔符

FG支持复杂类型的输入，如Array、Map等，与MaxCompute的复杂类型一致。

字符串类型的多值特征可以使用chr(29)分隔符。

例如v1^]v2^]v3，^]表示多值分隔符，这是⼀个符号，其ASCII编码是"\x1D"，不是两个符号。该字符在emacs中的输⼊⽅式是C-q C-5，在vi中的输⼊⽅式是C-v C-5。

特征分箱（离散化）

框架支持如下5种类型的分箱操作：

hash_bucket_size：对特征变换结果进行hash和取模。
vocab_list：把特征变换结果转化为列表的索引。
vocab_dict：把特征变换结果转化为字典的值（必须可转化为int64类型）。
boundaries：指定分箱边界，把特征变换结果转化为对应的桶号。
num_buckets：直接使用特征变换结果作为分箱桶号。

hash_bucket_size

对特征变换结果进行hash和取模，适用于任意类型的特征值。

结果范围：[0,hash_bucket_size)
空的特征分箱结果为hash(default_value)%hash_bucket_size

{
  "hash_bucket_size": 128000,
  "default_value": "默认值"
}

vocab_list

根据词汇表分箱，把输入映射到词汇表的索引；分箱结果为特征值对应的vocab_list数组的索引。

vocab_list数组的元素类型需要与value_type的配置相同
num_oov_bucket: Non-negative integer, the number of out-of-vocabulary buckets.
- All out-of-vocabulary inputs will be assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_buckets) based on a hash of the input value.
- A positive num_oov_buckets can not be specified withdefault_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.
- This can not be specified with a positivenum_oov_buckets.
- 默认值为vocab_list.size()

{
  "vocab_list": [
    "",
    "<OOV>",
    "token1",
    "token2",
    "token3",
    "token4"
  ],
  "num_oov_bucket": 0,
  "default_bucketize_value": 1
}

vocab_dict

分箱结果为特征值对应的vocab_dict字典的值，支持不同的特征值映射到相同的分箱结果。

vocab_dict字典的键的类型需要与value_type的配置相同
要求vocab_dict的值必须能够转换为int64类型
num_oov_bucket: Non-negative integer, the number of out-of-vocabulary buckets.
- All out-of-vocabulary inputs will be assigned IDs in the range [vocabulary_size, vocabulary_size+num_oov_buckets) based on a hash of the input value.
- A positive num_oov_buckets can not be specified withdefault_bucketize_value.
default_bucketize_value: The integer ID value to return for out-of-vocabulary feature values.
- This can not be specified with a positivenum_oov_buckets.
- 默认值为vocab_dict.size()

{
  "vocab_dict": {
    "token1": 1,
    "token2": 2,
    "token3": 3,
    "token4": 1
  },
  "num_oov_bucket": 0,
  "default_bucketize_value": 4
}

boundaries

对数值型特征安装指定的分箱边界分箱。 Represents discretized dense input bucketed by boundaries.

boundaries数组的元素类型需要与value_type的配置相同。
Buckets include the left boundary, and exclude the right boundary.
Namely, boundaries=[0., 1., 2.] generates buckets (-inf, 0.), [0., 1.), [1., 2.), and [2., +inf).

{
  "boundaries": [0.0, 1.0, 2.0],
  "default_value": -1
}

num_buckets

直接使用特征变换结果作为分箱桶号，适用于特征值可以转换为整数的情况。

结果范围：[0,num_buckets)
如果特征值超出配置的范围，则赋值为default_bucketize_value。

{
  "num_buckets": 128000,
  "default_bucketize_value": 127999
}

内置特征算子

每个特征算子的配置方法不同，所有能够作为DAG叶子节点的特征算子都支持配置特征分箱。

更多信息，请参见内置特征算子。

特征类型	说明
id_feature	类别型特征
raw_feature	数值型特征
expr_feature	表达式特征
combo_feature	组合特征
lookup_feature	字典查询特征
match_feature	主从键字典查询特征
overlap_feature	交叠特征
sequence_feature	序列特征
text_normalizer	文本归一化
tokenize_feature	文本分词特征
bm25_feature	BM25文本相关性特征
kv_dot_product	kv向量内积

自定义特征算子

自定义特征算子能够以插件的形式被框架动态加载并执行。

更多信息，请参见自定义特征算子。

注意事项

浮点型的特征，FG只保证6位精度。