Before you configure features, read this article: Experience and Traps in Deploying the Contextual Bandit (LinUCB) Algorithm in a Production Recommendation System. This article describes the dummy variable trap and offers guidance on hyperparameter tuning.
For all feature types described below, `expression` is a required parameter, except for lookup features. It describes the source field for the feature. Use the prefix user: or arm: to specify whether the field comes from user features or item features. For example, user:is_member retrieves the value of the is_member feature from the user_feature parameter. arm:author_id retrieves the value of author_id from the loaded item features. The default prefix is user:. The meaning of the `expression` configuration item is the same for all feature types.
`share_weight` is an optional parameter. When you use a hybrid LinUCB algorithm, features marked with `share_weight` share parameters between Arms. The default value of `share_weight` is `false`. For hybrid algorithms, set the `share_weight` of interaction features to `true`.
Id feature
An id feature is a sparse feature. It is the simplest type of discrete feature. It generates a feature vector using the multi-hot encoding method.
It supports four configuration methods: `vocab_list`, `num_buckets`, `hash_bucket_size`, and `boundaries`.
Configuration example:
{
"FeatureConf": [
{
"feature_type": "id_feature",
"expression": "gender",
"vocab_list": ["M", "F"]
},
{
"feature_type": "id_feature",
"expression": "level",
"num_buckets": 51
},
{
"feature_type": "id_feature",
"expression": "familyid",
"hash_bucket_size": 200
},
{
"feature_type": "id_feature",
"expression": "is_member",
"num_buckets": 2
},
{
"feature_type": "id_feature",
"expression": "fans_num",
"boundaries": [1, 2, 3, 4, 7, 15, 30, 50, 120]
}
]
}Example: The default multi-value separator is ^]. Note that this is a single character with the ASCII code `\x1D`, not two separate characters. You can change the separator using the separator configuration item.
Type | Feature value | Intermediate output |
int64_t | 100 | 100 |
double | 5.2 | 5 (the decimal part is truncated) |
string | abc | abc |
Multi-value string | abc^]bcd | (abc, bcd) |
Multi-value int | 123^]456 | (123, 456) |
The final output feature is transformed into a multi-hot real number vector. The transformation method is determined by the `vocab_list`, `num_buckets`, `hash_bucket_size`, and boundaries configuration items.
Raw feature
A raw feature is a dense feature. It directly uses the value of the source field as the feature value. Raw features only support numeric types, such as `int`, `float`, and `double`. For non-numeric features, use an id feature.
Field name | Description |
expression | Required. Describes the source field for the feature. |
separator | Multi-value separator. |
value_dimension | Optional. The dimension of the output field. The default value is 1. |
normalizer | Optional. The normalization method. For more information, see the following sections. |
Configuration example:
{
"FeatureConf": [
{
"feature_type": "raw_feature",
"expression": "userid_avg_hot_15",
"normalizer": "method=minmax,min=0,max=60"
},
{
"feature_type": "raw_feature",
"expression": "userid_avg_duration_15",
"normalizer": "method=log10"
}
]
}Normalizer
`raw_feature` and `lookup_feature` support normalizers. Three methods are available: `minmax`, `zscore`, and `log10`. The configuration and calculation methods are as follows:
log10
Configuration example: method=log10,threshold=1e-10,default=1e-10 Formula: x = x > threshold ? log10(x) : default;
zscore
Configuration example: method=zscore,mean=0.0,standard_deviation=10.0 Formula: x = (x - mean) / standard_deviation
minmax
Configuration example: method=minmax,min=2.1,max=2.2 Formula: x = (x - min) / (max - min)
Combo feature
A combo feature is a combination (Cartesian product) of multiple fields or expressions. An id feature can be considered a special type of combo feature with only one field for interaction. Typically, the fields involved in the interaction come from different tables, such as user features and item features.
After feature interaction, a feature vector is generated for the combo feature using the one-hot encoding method.
Configuration example:
{
"FeatureConf": [
{
"feature_type": "combo_feature",
"expression": ["user:age_class", "arm:item_id"],
"hash_bucket_size": 200,
"share_weight": true
},
{
"feature_type": "combo_feature",
"expression": ["user:age_class", "arm:level"],
"num_buckets": [5, 8],
"share_weight": true
}
]
}The number of output features is equal to
|F1| × |F2| × ... × |Fn|, where |Fn| is the number of values for the n-th dependent field.
If hash_bucket_size is configured, the combo feature value is hashed into the specified number of buckets.
Lookup feature
A lookup feature finds a matching result from a set of key-value pairs.
A lookup feature depends on two fields: `map` and `key`. The `map` field is a multi-value string (MultiString), where each string is in the format `k1:v2`. The `key` field can be of any type. To generate the feature, the system retrieves the value of the `key`, converts it to a string, and then searches the `map` field for a matching key to retrieve the corresponding value.
The source for the `map` and `key` can be any combination of item, user, and context. Multiple values for an item are separated by the `char(29)` character. Multiple values for a user and context are represented as a list during TPP access. This feature only supports configuration in JSON format.
Configuration example:
{
"FeatureConf": [
{
"feature_type": "lookup_feature",
"map": "user:userid_kv__author__click_cnt_15",
"key": "arm:userId",
"normalizer": "method=log10",
"share_weight": true
}
]
}Geohash feature
A geohash feature is generated by converting a latitude and longitude into a string of a specified length and then hashing the string. Geohash divides a geographical location into a grid and assigns an encoded hash value to each grid cell.
For more information about how geohash works, see this article.
Configuration example:
{
"FeatureConf": [
{
"feature_type": "geohash_feature",
"expression": ["latitude", "longitude"],
"geohash_precision": 4,
"hash_bucket_size": 128
}
]
}The final geohash value is hashed into the number of buckets specified by hash_bucket_size.
Binary feature
A binary feature is a feature with a value of either 0 or 1. Features such as user gender are well-suited to be represented as binary features.
A binary feature is generated by checking if the source feature value exists in the collection specified by vocab_list. If a match is found, the feature value is 1. Otherwise, the value is 0.
Configuration example
{
"FeatureConf": [
{
"feature_type": "binary_feature",
"expression": "gender",
"vocab_list": ["M"]
}
]
}If the source feature is already a binary type with values of 0 and 1, configure it as a `raw feature` instead of a `binary feature`.