特征平台当前提供的特征生产功能旨在简化特征创建过程,通过固化常用的和普遍的生产步骤,您仅需进行简单配置就能轻松生成特征,从而有效降低了特征生产的复杂性。特征生产在多个领域(包括推荐、广告、风控以及机器学习等)都有广泛应用,本文将以推荐场景为例,为您介绍从原始表到特征生产加工生成样本表,再到训练模型的完整过程。
前提条件
在开始执行操作前,请确认您已完成以下准备工作:
依赖产品 | 具体操作 |
人工智能平台PAI |
|
云原生大数据计算服务MaxCompute |
|
大数据开发治理平台DataWorks |
|
一、准备工作
准备原始数据
一般对于推荐场景,特征生产通常需要以下三张原始表,以这三张原始表为基础,特征生产可以生产出成百上千个特征,方便建立模型来拟合目标。
用户表(rec_sln_demo_user_table_preprocess_v1):包含一些基础的用户特征,例如性别、年龄、城市和关注数等。
物品表(rec_sln_demo_item_table_preprocess_v1):包含一些基础的物品特征,例如类别、作者、累计点击数和累计点赞数等。
行为表(rec_sln_demo_behavior_table_preprocess_v1):包含一些行为特征,例如某时用户点击某个物品等。
数据表存放在有公开读取权限的pai_online_project中,其数据均为模拟数据生成。您需要在DataWorks中执行SQL命令,将上表数据从pai_online_project项目同步到您的MaxCompute项目中。具体操作步骤如下:
登录DataWorks控制台。
在左侧导航栏单击数据开发与治理 > 数据开发。
选择已创建的DataWorks工作空间后,单击进入数据开发。
鼠标悬停至新建,选择新建节点 > MaxCompute > ODPS SQL,在弹出的页面中配置节点参数。
参数
取值建议
引擎实例
选择已创建的MaxCompute引擎。
节点类型
ODPS SQL
路径
业务流程/Workflow/MaxCompute
名称
可自定义名称。
单击确认。
在新建节点区域运行以下SQL命令,将用户表、物品表、行为表从pai_online_project项目同步到您自己的MaxCompute中。资源组选择已创建的独享资源组。
同步用户表:rec_sln_demo_user_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_user_table_preprocess_v1 like pai_online_project.rec_sln_demo_user_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_user_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_user_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
同步物品表:rec_sln_demo_item_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_item_table_preprocess_v1 like pai_online_project.rec_sln_demo_item_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_item_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_item_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
同步行为表:rec_sln_demo_behavior_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_preprocess_v1 like pai_online_project.rec_sln_demo_behavior_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_behavior_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
安装FeatureStore Python SDK
以下代码均建议在Jupyter Notebook环境下运行。
安装特征平台Python SDK,要求在Python3环境下运行。
%pip install https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-1.8.0-py3-none-any.whl
导入需要的功能模块:
import os from feature_store_py import FeatureStoreClient from feature_store_py.fs_datasource import MaxComputeDataSource from feature_store_py.feature_engineering import TableTransform, Condition, DayOf, ComboTransform, Feature, AggregationTransform, auto_count_feature_transform, WindowTransform, auto_window_feature_transform
二、原始表初步变换
准备流程完成后,您可以在自己的项目空间中查看已准备就绪的三张表:用户表(rec_sln_demo_user_table_preprocess_v1)、物品表(rec_sln_demo_item_table_preprocess_v1)以及行为表(rec_sln_demo_behavior_table_preprocess_v1),您可以以这三张表为基础进行特征生产。
在进行后续的特征生产加工之前,为了方便您后续做特征统计,您需要对数据表做以下预处理操作。您可以将以下SQL命令粘贴到已创建的ODPS SQL节点中执行,具体操作,请参见准备原始数据。
将用户表、物品表的特征和行为表进行连接。
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_preprocess_wide_v1 ( request_id bigint ,user_id string ,page string ,net_type string ,day_h bigint COMMENT '行为发生在当天的第几小时' ,week_day bigint COMMENT '行为发生在当前周的第几天' ,event_unix_time bigint ,item_id string ,event string ,playtime double ,gender string ,age bigint ,city string ,item_cnt bigint ,follow_cnt bigint ,follower_cnt bigint ,is_new_user bigint ,tags string ,duration double ,category string ,author bigint ,click_count bigint ,praise_count bigint ,is_new_item bigint ) PARTITIONED BY ( ds string ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_preprocess_wide_v1 PARTITION(ds='${bdp.system.bizdate}') SELECT sq0.request_id ,sq0.user_id ,sq0.page ,sq0.net_type ,sq0.day_h ,sq0.week_day ,sq0.event_unix_time ,sq0.item_id ,sq0.event ,sq0.playtime ,sq1.gender ,sq1.age ,sq1.city ,sq1.item_cnt ,sq1.follow_cnt ,sq1.follower_cnt ,sq1.is_new_user ,sq1.tags ,sq2.duration ,sq2.category ,sq2.author ,sq2.click_count ,sq2.praise_count ,sq2.is_new_item FROM ( SELECT * FROM rec_sln_demo_behavior_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq0 LEFT JOIN ( SELECT * FROM rec_sln_demo_user_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq1 ON sq0.user_id = sq1.user_id LEFT JOIN ( SELECT * FROM rec_sln_demo_item_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq2 ON sq0.item_id = sq2.item_id ;
因训练模型需要有标签,您需要对行为表进行预处理,将其转换成Label表。本示例将点击、播放时间、点赞作为标签。
CREATE TABLE IF NOT EXISTS rec_sln_demo_fs_label_table_v1 ( request_id bigint ,user_id string ,page string ,net_type string ,day_h bigint COMMENT '行为发生在当天的第几小时' ,week_day bigint COMMENT '行为发生在当前周的第几天' ,day_min string ,event_unix_time bigint ,item_id string ,playtime double ,is_click BIGINT ,ln_playtime DOUBLE ,is_praise BIGINT ) PARTITIONED BY ( ds string ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE rec_sln_demo_fs_label_table_v1 PARTITION(ds='${bdp.system.bizdate}') SELECT request_id ,user_id ,MAX(page) page ,MAX(net_type) net_type ,MAX(day_h) day_h ,MAX(week_day) week_day ,TO_CHAR(FROM_UNIXTIME(MIN(event_unix_time)),'yyyymmddhhmi') day_min ,MAX(event_unix_time) event_unix_time ,item_id ,MAX(playtime) playtime ,max(if(event='click', 1, 0)) is_click ,ln(sum(playtime) + 1) ln_playtime ,max(if(event='praise', 1, 0)) is_praise FROM rec_sln_demo_behavior_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' GROUP BY request_id ,user_id ,item_id ;
经过上面的初步变换后,您将拥有以下两种表:
新的行为表:rec_sln_demo_behavior_table_preprocess_wide_v1,后续的统计特征将以此表为基础进行变换。
Label表:rec_sln_demo_fs_label_table_v1,后续构建样本表时需要用到此表。
三、特征生产加工
您可以调用特征生产中的自动扩展函数进行特征生产,仅需几行代码,就可以生产出上百种特征。
特征生产加工分别需要加工用户侧特征和物品侧特征。具体操作步骤如下:
用户侧特征的特征生产
以下代码均建议在Jupyter Notebook环境下运行。
初始化Client。
access_key_id = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_ID") # 填入您的Access Key ID access_key_secret = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_SECRET") # 填入您的Access Key Secret project = 'project_name' # 填入您的项目名 region = 'cn-hangzhou' # 填入您的项目所在区域,例如华东1(杭州)为cn-hangzhou fs_client = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, region=region)
指定要进行特征变换的数据源。
input_bhv_table_name = "rec_sln_demo_behavior_table_preprocess_wide_v1" ds_bhv = MaxComputeDataSource(table=input_bhv_table_name, project=project) input_user_table_name = "rec_sln_demo_user_table_preprocess_v1" ds_user = MaxComputeDataSource(table=input_user_table_name, project=project) input_item_table_name = "rec_sln_demo_item_table_preprocess_v1" ds_item = MaxComputeDataSource(table=input_item_table_name, project=project)
指定JoinTransform和AggregationTransform后输出的用户侧特征表名称。
agg_user_table_v1 = 'rec_sln_demo_user_table_preprocess_agg_v1'
利用自动扩展函数,对统计特征进行自动扩展。
name_prefix = "user" input_list = ["playtime", "duration", "click_count", "praise_count"] event_name = 'event' event_type = 'expr' group_by_key = "user_id" window_size = [3,7,15] user_count_feature_list = auto_count_feature_transform(name_prefix, input_list, event_name, event_type, group_by_key, window_size) print("len_count_feature_list = ", len(user_count_feature_list)) print("count_feature_list = ", user_count_feature_list)
执行生产过程,并查看表运行结果。
创建Pipeline。
agg_user_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(user_count_feature_list) agg_user_pipeline =fs_client.create_pipeline(ds_user, agg_user_table_v1).merge(agg_user_bhv_pipeline, keep_input_columns=False)
执行Pipeline生产过程。
execute_date = '20240605' output_agg_user_table = agg_user_pipeline.execute(execute_date, drop_table=True)
查看表运行结果。
agg_user_ret = output_agg_user_table.to_pandas(execute_date, limit=20) agg_user_ret
进行WindowTransform变换,指定WindowTransform后输出的用户侧特征表名称。
win_user_table_v1 = 'rec_sln_demo_user_table_preprocess_win_v1'
利用内置的自动扩展函数,自动生成WindowTransform特征定义。
name_prefix = 'user' input_list = ['day_h', 'category'] agg_field = ['duration', 'click_count'] event_name = 'event' event_type = 'expr' group_by_key = 'user_id' window_size = [7, 15, 30, 45] user_win_feature_list = auto_window_feature_transform(name_prefix, input_list, agg_field, event_name, event_type, group_by_key, window_size) print("len_user_win_feature_list = ", len(user_win_feature_list)) print("user_win_feature_list = ", user_win_feature_list)
执行生产过程,并查看表运行结果。
创建pipeline。
win_user_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(user_win_feature_list) win_user_pipeline = fs_client.create_pipeline(ds_user, win_user_table_v1).merge(win_user_bhv_pipeline, keep_input_columns=False)
执行pipeline生产过程。
execute_date = '20240605' output_win_user_table = win_user_pipeline.execute(execute_date, drop_table=True) # 详细实现见功能文档。因为有中间表存在,第一次执行时需要补数据,可能需要较长时间,运行补数据可以执行下面的命令 # output_win_user_table = win_user_pipeline.execute(execute_date, drop_table=True, backfill_partitions=True)
查看表运行结果。
win_user_ret = output_win_user_table.to_pandas(execute_date, limit=20) win_user_ret
物品侧特征的特征生产
在完成用户侧特征的提取和加工之后,您可以继续对物品表进行特征生产加工,生成基于物品类型的特征。具体操作步骤如下:
指定AggregationTransform后输出的物品侧特征表名称。
agg_item_table_v1 = 'rec_sln_demo_item_table_preprocess_agg_v1'
利用自动扩展函数,对统计特征进行自动扩展。
name_prefix = "item" input_list = ["item_cnt", "follow_cnt", "follower_cnt"] event_name = 'event' event_type = 'expr' group_by_key = "item_id" window_size = [3,7,15] item_count_feature_list = auto_count_feature_transform(name_prefix, input_list, event_name, event_type, group_by_key, window_size) print("len_count_feature_list = ", len(item_count_feature_list)) print("count_feature_list = ", item_count_feature_list)
执行生产过程,并查看表运行结果。
创建pipeline。
agg_item_bhv_pipeline =fs_client.create_pipeline(ds_bhv).add_feature_transform(item_count_feature_list) agg_item_pipeline =fs_client.create_pipeline(ds_item, agg_item_table_v1).merge(agg_item_bhv_pipeline, keep_input_columns=False)
执行pipeline生产过程。
execute_date = '20240605' output_agg_item_table = agg_item_pipeline.execute(execute_date, drop_table=True)
查看表运行结果。
agg_item_ret = output_agg_item_table.to_pandas(execute_date, limit=20) agg_item_ret
指定WindowTransform后输出的用户侧特征表名称。
win_item_table_v1 = 'rec_sln_demo_item_table_preprocess_win_v1'
利用内置的自动扩展函数,自动生成WindowTransform特征定义。
name_prefix = 'item' input_list = ['day_h', 'category'] agg_field = ['click_count', 'praise_count'] event_name = 'event' event_type = 'expr' group_by_key = 'item_id' window_size = [7, 15, 30, 45] item_win_feature_list = auto_window_feature_transform(name_prefix, input_list, agg_field, event_name, event_type, group_by_key, window_size) print("len_item_win_feature_list = ", len(item_win_feature_list)) print("item_win_feature_list = ", item_win_feature_list)
执行生产过程,并查看表运行结果。
创建pipeline。
win_item_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(item_win_feature_list) win_item_pipeline = fs_client.create_pipeline(ds_item, win_item_table_v1).merge(win_item_bhv_pipeline, keep_input_columns=False)
执行pipeline生产过程。
execute_date = '20240605' output_win_item_table = win_item_pipeline.execute(execute_date, drop_table=True) # 详细实现见功能文档。因为有中间表存在,第一次执行时需要补数据,可能需要较长时间,运行补数据可以执行下面的命令 # output_win_item_table = win_item_pipeline.execute(execute_date, drop_table=True, backfill_partitions=True)
查看表运行结果。
win_item_ret = output_win_item_table.to_pandas(execute_date, limit=20) win_item_ret
四、生成样本表
经过上述流程,您已获取了Label表,并在物品侧和用户侧分别整理出三张特征表(包括一张原始表及两张经过特征生产加工出的表)。随后,您可以将以下SQL命令粘贴到已创建的ODPS SQL节点中执行,将这七张表进行合并,以构建完整的样本表。具体操作,请参见准备原始数据。
CREATE TABLE IF NOT EXISTS fs_demo_fs_engineering_v1_training_set(
request_id BIGINT,
user_id STRING,
page STRING,
net_type STRING,
day_h BIGINT,
week_day BIGINT,
day_min STRING,
event_unix_time BIGINT,
item_id STRING,
playtime DOUBLE,
is_click BIGINT,
ln_playtime DOUBLE,
is_praise BIGINT,
duration DOUBLE,
category STRING,
author BIGINT,
click_count BIGINT,
praise_count BIGINT,
is_new_item BIGINT,
item__sum_item_cnt_3d BIGINT,
item__sum_follow_cnt_3d BIGINT,
item__sum_follower_cnt_3d BIGINT,
item__max_item_cnt_3d BIGINT,
item__max_follow_cnt_3d BIGINT,
item__max_follower_cnt_3d BIGINT,
item__min_item_cnt_3d BIGINT,
item__min_follow_cnt_3d BIGINT,
item__min_follower_cnt_3d BIGINT,
item__avg_item_cnt_3d DOUBLE,
item__avg_follow_cnt_3d DOUBLE,
item__avg_follower_cnt_3d DOUBLE,
item__sum_item_cnt_7d BIGINT,
item__sum_follow_cnt_7d BIGINT,
item__sum_follower_cnt_7d BIGINT,
item__max_item_cnt_7d BIGINT,
item__max_follow_cnt_7d BIGINT,
item__max_follower_cnt_7d BIGINT,
item__min_item_cnt_7d BIGINT,
item__min_follow_cnt_7d BIGINT,
item__min_follower_cnt_7d BIGINT,
item__avg_item_cnt_7d DOUBLE,
item__avg_follow_cnt_7d DOUBLE,
item__avg_follower_cnt_7d DOUBLE,
item__sum_item_cnt_15d BIGINT,
item__sum_follow_cnt_15d BIGINT,
item__sum_follower_cnt_15d BIGINT,
item__max_item_cnt_15d BIGINT,
item__max_follow_cnt_15d BIGINT,
item__max_follower_cnt_15d BIGINT,
item__min_item_cnt_15d BIGINT,
item__min_follow_cnt_15d BIGINT,
item__min_follower_cnt_15d BIGINT,
item__avg_item_cnt_15d DOUBLE,
item__avg_follow_cnt_15d DOUBLE,
item__avg_follower_cnt_15d DOUBLE,
item__kv_day_h_click_count_sum_7d STRING,
item__kv_category_click_count_sum_7d STRING,
item__kv_day_h_praise_count_sum_7d STRING,
item__kv_category_praise_count_sum_7d STRING,
item__kv_day_h_click_count_max_7d STRING,
item__kv_category_click_count_max_7d STRING,
item__kv_day_h_praise_count_max_7d STRING,
item__kv_category_praise_count_max_7d STRING,
item__kv_day_h_click_count_min_7d STRING,
item__kv_category_click_count_min_7d STRING,
item__kv_day_h_praise_count_min_7d STRING,
item__kv_category_praise_count_min_7d STRING,
item__kv_day_h_click_count_avg_7d STRING,
item__kv_category_click_count_avg_7d STRING,
item__kv_day_h_praise_count_avg_7d STRING,
item__kv_category_praise_count_avg_7d STRING,
item__kv_day_h_click_count_sum_15d STRING,
item__kv_category_click_count_sum_15d STRING,
item__kv_day_h_praise_count_sum_15d STRING,
item__kv_category_praise_count_sum_15d STRING,
item__kv_day_h_click_count_max_15d STRING,
item__kv_category_click_count_max_15d STRING,
item__kv_day_h_praise_count_max_15d STRING,
item__kv_category_praise_count_max_15d STRING,
item__kv_day_h_click_count_min_15d STRING,
item__kv_category_click_count_min_15d STRING,
item__kv_day_h_praise_count_min_15d STRING,
item__kv_category_praise_count_min_15d STRING,
item__kv_day_h_click_count_avg_15d STRING,
item__kv_category_click_count_avg_15d STRING,
item__kv_day_h_praise_count_avg_15d STRING,
item__kv_category_praise_count_avg_15d STRING,
item__kv_day_h_click_count_sum_30d STRING,
item__kv_category_click_count_sum_30d STRING,
item__kv_day_h_praise_count_sum_30d STRING,
item__kv_category_praise_count_sum_30d STRING,
item__kv_day_h_click_count_max_30d STRING,
item__kv_category_click_count_max_30d STRING,
item__kv_day_h_praise_count_max_30d STRING,
item__kv_category_praise_count_max_30d STRING,
item__kv_day_h_click_count_min_30d STRING,
item__kv_category_click_count_min_30d STRING,
item__kv_day_h_praise_count_min_30d STRING,
item__kv_category_praise_count_min_30d STRING,
item__kv_day_h_click_count_avg_30d STRING,
item__kv_category_click_count_avg_30d STRING,
item__kv_day_h_praise_count_avg_30d STRING,
item__kv_category_praise_count_avg_30d STRING,
item__kv_day_h_click_count_sum_45d STRING,
item__kv_category_click_count_sum_45d STRING,
item__kv_day_h_praise_count_sum_45d STRING,
item__kv_category_praise_count_sum_45d STRING,
item__kv_day_h_click_count_max_45d STRING,
item__kv_category_click_count_max_45d STRING,
item__kv_day_h_praise_count_max_45d STRING,
item__kv_category_praise_count_max_45d STRING,
item__kv_day_h_click_count_min_45d STRING,
item__kv_category_click_count_min_45d STRING,
item__kv_day_h_praise_count_min_45d STRING,
item__kv_category_praise_count_min_45d STRING,
item__kv_day_h_click_count_avg_45d STRING,
item__kv_category_click_count_avg_45d STRING,
item__kv_day_h_praise_count_avg_45d STRING,
item__kv_category_praise_count_avg_45d STRING,
gender STRING,
age BIGINT,
city STRING,
item_cnt BIGINT,
follow_cnt BIGINT,
follower_cnt BIGINT,
is_new_user BIGINT,
tags STRING,
user__sum_playtime_3d DOUBLE,
user__sum_duration_3d DOUBLE,
user__sum_click_count_3d BIGINT,
user__sum_praise_count_3d BIGINT,
user__max_playtime_3d DOUBLE,
user__max_duration_3d DOUBLE,
user__max_click_count_3d BIGINT,
user__max_praise_count_3d BIGINT,
user__min_playtime_3d DOUBLE,
user__min_duration_3d DOUBLE,
user__min_click_count_3d BIGINT,
user__min_praise_count_3d BIGINT,
user__avg_playtime_3d DOUBLE,
user__avg_duration_3d DOUBLE,
user__avg_click_count_3d DOUBLE,
user__avg_praise_count_3d DOUBLE,
user__sum_playtime_7d DOUBLE,
user__sum_duration_7d DOUBLE,
user__sum_click_count_7d BIGINT,
user__sum_praise_count_7d BIGINT,
user__max_playtime_7d DOUBLE,
user__max_duration_7d DOUBLE,
user__max_click_count_7d BIGINT,
user__max_praise_count_7d BIGINT,
user__min_playtime_7d DOUBLE,
user__min_duration_7d DOUBLE,
user__min_click_count_7d BIGINT,
user__min_praise_count_7d BIGINT,
user__avg_playtime_7d DOUBLE,
user__avg_duration_7d DOUBLE,
user__avg_click_count_7d DOUBLE,
user__avg_praise_count_7d DOUBLE,
user__sum_playtime_15d DOUBLE,
user__sum_duration_15d DOUBLE,
user__sum_click_count_15d BIGINT,
user__sum_praise_count_15d BIGINT,
user__max_playtime_15d DOUBLE,
user__max_duration_15d DOUBLE,
user__max_click_count_15d BIGINT,
user__max_praise_count_15d BIGINT,
user__min_playtime_15d DOUBLE,
user__min_duration_15d DOUBLE,
user__min_click_count_15d BIGINT,
user__min_praise_count_15d BIGINT,
user__avg_playtime_15d DOUBLE,
user__avg_duration_15d DOUBLE,
user__avg_click_count_15d DOUBLE,
user__avg_praise_count_15d DOUBLE,
user__kv_day_h_duration_sum_7d STRING,
user__kv_category_duration_sum_7d STRING,
user__kv_day_h_click_count_sum_7d STRING,
user__kv_category_click_count_sum_7d STRING,
user__kv_day_h_duration_max_7d STRING,
user__kv_category_duration_max_7d STRING,
user__kv_day_h_click_count_max_7d STRING,
user__kv_category_click_count_max_7d STRING,
user__kv_day_h_duration_min_7d STRING,
user__kv_category_duration_min_7d STRING,
user__kv_day_h_click_count_min_7d STRING,
user__kv_category_click_count_min_7d STRING,
user__kv_day_h_duration_avg_7d STRING,
user__kv_category_duration_avg_7d STRING,
user__kv_day_h_click_count_avg_7d STRING,
user__kv_category_click_count_avg_7d STRING,
user__kv_day_h_duration_sum_15d STRING,
user__kv_category_duration_sum_15d STRING,
user__kv_day_h_click_count_sum_15d STRING,
user__kv_category_click_count_sum_15d STRING,
user__kv_day_h_duration_max_15d STRING,
user__kv_category_duration_max_15d STRING,
user__kv_day_h_click_count_max_15d STRING,
user__kv_category_click_count_max_15d STRING,
user__kv_day_h_duration_min_15d STRING,
user__kv_category_duration_min_15d STRING,
user__kv_day_h_click_count_min_15d STRING,
user__kv_category_click_count_min_15d STRING,
user__kv_day_h_duration_avg_15d STRING,
user__kv_category_duration_avg_15d STRING,
user__kv_day_h_click_count_avg_15d STRING,
user__kv_category_click_count_avg_15d STRING,
user__kv_day_h_duration_sum_30d STRING,
user__kv_category_duration_sum_30d STRING,
user__kv_day_h_click_count_sum_30d STRING,
user__kv_category_click_count_sum_30d STRING,
user__kv_day_h_duration_max_30d STRING,
user__kv_category_duration_max_30d STRING,
user__kv_day_h_click_count_max_30d STRING,
user__kv_category_click_count_max_30d STRING,
user__kv_day_h_duration_min_30d STRING,
user__kv_category_duration_min_30d STRING,
user__kv_day_h_click_count_min_30d STRING,
user__kv_category_click_count_min_30d STRING,
user__kv_day_h_duration_avg_30d STRING,
user__kv_category_duration_avg_30d STRING,
user__kv_day_h_click_count_avg_30d STRING,
user__kv_category_click_count_avg_30d STRING,
user__kv_day_h_duration_sum_45d STRING,
user__kv_category_duration_sum_45d STRING,
user__kv_day_h_click_count_sum_45d STRING,
user__kv_category_click_count_sum_45d STRING,
user__kv_day_h_duration_max_45d STRING,
user__kv_category_duration_max_45d STRING,
user__kv_day_h_click_count_max_45d STRING,
user__kv_category_click_count_max_45d STRING,
user__kv_day_h_duration_min_45d STRING,
user__kv_category_duration_min_45d STRING,
user__kv_day_h_click_count_min_45d STRING,
user__kv_category_click_count_min_45d STRING,
user__kv_day_h_duration_avg_45d STRING,
user__kv_category_duration_avg_45d STRING,
user__kv_day_h_click_count_avg_45d STRING,
user__kv_category_click_count_avg_45d STRING
)
PARTITIONED BY (ds STRING)
LIFECYCLE 90;
insert overwrite table fs_demo_fs_engineering_v1_training_set partition (ds = '${bdp.system.bizdate}')
select
sq0.request_id,
sq0.user_id,
sq0.page,
sq0.net_type,
sq0.day_h,
sq0.week_day,
sq0.day_min,
sq0.event_unix_time,
sq0.item_id,
sq0.playtime,
sq0.is_click,
sq0.ln_playtime,
sq0.is_praise,
sq2.duration,
sq2.category,
sq2.author,
sq2.click_count,
sq2.praise_count,
sq2.is_new_item,
sq5.item__sum_item_cnt_3d,
sq5.item__sum_follow_cnt_3d,
sq5.item__sum_follower_cnt_3d,
sq5.item__max_item_cnt_3d,
sq5.item__max_follow_cnt_3d,
sq5.item__max_follower_cnt_3d,
sq5.item__min_item_cnt_3d,
sq5.item__min_follow_cnt_3d,
sq5.item__min_follower_cnt_3d,
sq5.item__avg_item_cnt_3d,
sq5.item__avg_follow_cnt_3d,
sq5.item__avg_follower_cnt_3d,
sq5.item__sum_item_cnt_7d,
sq5.item__sum_follow_cnt_7d,
sq5.item__sum_follower_cnt_7d,
sq5.item__max_item_cnt_7d,
sq5.item__max_follow_cnt_7d,
sq5.item__max_follower_cnt_7d,
sq5.item__min_item_cnt_7d,
sq5.item__min_follow_cnt_7d,
sq5.item__min_follower_cnt_7d,
sq5.item__avg_item_cnt_7d,
sq5.item__avg_follow_cnt_7d,
sq5.item__avg_follower_cnt_7d,
sq5.item__sum_item_cnt_15d,
sq5.item__sum_follow_cnt_15d,
sq5.item__sum_follower_cnt_15d,
sq5.item__max_item_cnt_15d,
sq5.item__max_follow_cnt_15d,
sq5.item__max_follower_cnt_15d,
sq5.item__min_item_cnt_15d,
sq5.item__min_follow_cnt_15d,
sq5.item__min_follower_cnt_15d,
sq5.item__avg_item_cnt_15d,
sq5.item__avg_follow_cnt_15d,
sq5.item__avg_follower_cnt_15d,
sq6.item__kv_day_h_click_count_sum_7d,
sq6.item__kv_category_click_count_sum_7d,
sq6.item__kv_day_h_praise_count_sum_7d,
sq6.item__kv_category_praise_count_sum_7d,
sq6.item__kv_day_h_click_count_max_7d,
sq6.item__kv_category_click_count_max_7d,
sq6.item__kv_day_h_praise_count_max_7d,
sq6.item__kv_category_praise_count_max_7d,
sq6.item__kv_day_h_click_count_min_7d,
sq6.item__kv_category_click_count_min_7d,
sq6.item__kv_day_h_praise_count_min_7d,
sq6.item__kv_category_praise_count_min_7d,
sq6.item__kv_day_h_click_count_avg_7d,
sq6.item__kv_category_click_count_avg_7d,
sq6.item__kv_day_h_praise_count_avg_7d,
sq6.item__kv_category_praise_count_avg_7d,
sq6.item__kv_day_h_click_count_sum_15d,
sq6.item__kv_category_click_count_sum_15d,
sq6.item__kv_day_h_praise_count_sum_15d,
sq6.item__kv_category_praise_count_sum_15d,
sq6.item__kv_day_h_click_count_max_15d,
sq6.item__kv_category_click_count_max_15d,
sq6.item__kv_day_h_praise_count_max_15d,
sq6.item__kv_category_praise_count_max_15d,
sq6.item__kv_day_h_click_count_min_15d,
sq6.item__kv_category_click_count_min_15d,
sq6.item__kv_day_h_praise_count_min_15d,
sq6.item__kv_category_praise_count_min_15d,
sq6.item__kv_day_h_click_count_avg_15d,
sq6.item__kv_category_click_count_avg_15d,
sq6.item__kv_day_h_praise_count_avg_15d,
sq6.item__kv_category_praise_count_avg_15d,
sq6.item__kv_day_h_click_count_sum_30d,
sq6.item__kv_category_click_count_sum_30d,
sq6.item__kv_day_h_praise_count_sum_30d,
sq6.item__kv_category_praise_count_sum_30d,
sq6.item__kv_day_h_click_count_max_30d,
sq6.item__kv_category_click_count_max_30d,
sq6.item__kv_day_h_praise_count_max_30d,
sq6.item__kv_category_praise_count_max_30d,
sq6.item__kv_day_h_click_count_min_30d,
sq6.item__kv_category_click_count_min_30d,
sq6.item__kv_day_h_praise_count_min_30d,
sq6.item__kv_category_praise_count_min_30d,
sq6.item__kv_day_h_click_count_avg_30d,
sq6.item__kv_category_click_count_avg_30d,
sq6.item__kv_day_h_praise_count_avg_30d,
sq6.item__kv_category_praise_count_avg_30d,
sq6.item__kv_day_h_click_count_sum_45d,
sq6.item__kv_category_click_count_sum_45d,
sq6.item__kv_day_h_praise_count_sum_45d,
sq6.item__kv_category_praise_count_sum_45d,
sq6.item__kv_day_h_click_count_max_45d,
sq6.item__kv_category_click_count_max_45d,
sq6.item__kv_day_h_praise_count_max_45d,
sq6.item__kv_category_praise_count_max_45d,
sq6.item__kv_day_h_click_count_min_45d,
sq6.item__kv_category_click_count_min_45d,
sq6.item__kv_day_h_praise_count_min_45d,
sq6.item__kv_category_praise_count_min_45d,
sq6.item__kv_day_h_click_count_avg_45d,
sq6.item__kv_category_click_count_avg_45d,
sq6.item__kv_day_h_praise_count_avg_45d,
sq6.item__kv_category_praise_count_avg_45d,
sq1.gender,
sq1.age,
sq1.city,
sq1.item_cnt,
sq1.follow_cnt,
sq1.follower_cnt,
sq1.is_new_user,
sq1.tags,
sq3.user__sum_playtime_3d,
sq3.user__sum_duration_3d,
sq3.user__sum_click_count_3d,
sq3.user__sum_praise_count_3d,
sq3.user__max_playtime_3d,
sq3.user__max_duration_3d,
sq3.user__max_click_count_3d,
sq3.user__max_praise_count_3d,
sq3.user__min_playtime_3d,
sq3.user__min_duration_3d,
sq3.user__min_click_count_3d,
sq3.user__min_praise_count_3d,
sq3.user__avg_playtime_3d,
sq3.user__avg_duration_3d,
sq3.user__avg_click_count_3d,
sq3.user__avg_praise_count_3d,
sq3.user__sum_playtime_7d,
sq3.user__sum_duration_7d,
sq3.user__sum_click_count_7d,
sq3.user__sum_praise_count_7d,
sq3.user__max_playtime_7d,
sq3.user__max_duration_7d,
sq3.user__max_click_count_7d,
sq3.user__max_praise_count_7d,
sq3.user__min_playtime_7d,
sq3.user__min_duration_7d,
sq3.user__min_click_count_7d,
sq3.user__min_praise_count_7d,
sq3.user__avg_playtime_7d,
sq3.user__avg_duration_7d,
sq3.user__avg_click_count_7d,
sq3.user__avg_praise_count_7d,
sq3.user__sum_playtime_15d,
sq3.user__sum_duration_15d,
sq3.user__sum_click_count_15d,
sq3.user__sum_praise_count_15d,
sq3.user__max_playtime_15d,
sq3.user__max_duration_15d,
sq3.user__max_click_count_15d,
sq3.user__max_praise_count_15d,
sq3.user__min_playtime_15d,
sq3.user__min_duration_15d,
sq3.user__min_click_count_15d,
sq3.user__min_praise_count_15d,
sq3.user__avg_playtime_15d,
sq3.user__avg_duration_15d,
sq3.user__avg_click_count_15d,
sq3.user__avg_praise_count_15d,
sq4.user__kv_day_h_duration_sum_7d,
sq4.user__kv_category_duration_sum_7d,
sq4.user__kv_day_h_click_count_sum_7d,
sq4.user__kv_category_click_count_sum_7d,
sq4.user__kv_day_h_duration_max_7d,
sq4.user__kv_category_duration_max_7d,
sq4.user__kv_day_h_click_count_max_7d,
sq4.user__kv_category_click_count_max_7d,
sq4.user__kv_day_h_duration_min_7d,
sq4.user__kv_category_duration_min_7d,
sq4.user__kv_day_h_click_count_min_7d,
sq4.user__kv_category_click_count_min_7d,
sq4.user__kv_day_h_duration_avg_7d,
sq4.user__kv_category_duration_avg_7d,
sq4.user__kv_day_h_click_count_avg_7d,
sq4.user__kv_category_click_count_avg_7d,
sq4.user__kv_day_h_duration_sum_15d,
sq4.user__kv_category_duration_sum_15d,
sq4.user__kv_day_h_click_count_sum_15d,
sq4.user__kv_category_click_count_sum_15d,
sq4.user__kv_day_h_duration_max_15d,
sq4.user__kv_category_duration_max_15d,
sq4.user__kv_day_h_click_count_max_15d,
sq4.user__kv_category_click_count_max_15d,
sq4.user__kv_day_h_duration_min_15d,
sq4.user__kv_category_duration_min_15d,
sq4.user__kv_day_h_click_count_min_15d,
sq4.user__kv_category_click_count_min_15d,
sq4.user__kv_day_h_duration_avg_15d,
sq4.user__kv_category_duration_avg_15d,
sq4.user__kv_day_h_click_count_avg_15d,
sq4.user__kv_category_click_count_avg_15d,
sq4.user__kv_day_h_duration_sum_30d,
sq4.user__kv_category_duration_sum_30d,
sq4.user__kv_day_h_click_count_sum_30d,
sq4.user__kv_category_click_count_sum_30d,
sq4.user__kv_day_h_duration_max_30d,
sq4.user__kv_category_duration_max_30d,
sq4.user__kv_day_h_click_count_max_30d,
sq4.user__kv_category_click_count_max_30d,
sq4.user__kv_day_h_duration_min_30d,
sq4.user__kv_category_duration_min_30d,
sq4.user__kv_day_h_click_count_min_30d,
sq4.user__kv_category_click_count_min_30d,
sq4.user__kv_day_h_duration_avg_30d,
sq4.user__kv_category_duration_avg_30d,
sq4.user__kv_day_h_click_count_avg_30d,
sq4.user__kv_category_click_count_avg_30d,
sq4.user__kv_day_h_duration_sum_45d,
sq4.user__kv_category_duration_sum_45d,
sq4.user__kv_day_h_click_count_sum_45d,
sq4.user__kv_category_click_count_sum_45d,
sq4.user__kv_day_h_duration_max_45d,
sq4.user__kv_category_duration_max_45d,
sq4.user__kv_day_h_click_count_max_45d,
sq4.user__kv_category_click_count_max_45d,
sq4.user__kv_day_h_duration_min_45d,
sq4.user__kv_category_duration_min_45d,
sq4.user__kv_day_h_click_count_min_45d,
sq4.user__kv_category_click_count_min_45d,
sq4.user__kv_day_h_duration_avg_45d,
sq4.user__kv_category_duration_avg_45d,
sq4.user__kv_day_h_click_count_avg_45d,
sq4.user__kv_category_click_count_avg_45d
from
(
select *
from rec_sln_demo_fs_label_table_v1
where ds = '${bdp.system.bizdate}'
) sq0
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq1 on sq0.user_id = sq1.user_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq2 on sq0.item_id = sq2.item_id
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_win_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq4 on sq0.user_id = sq4.user_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_agg_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq5 on sq0.item_id = sq5.item_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_win_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq6 on sq0.item_id = sq6.item_id
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_agg_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq3 on sq0.user_id = sq3.user_id
;
其中<project_name>需要替换为您的项目名称。
五、训练模型
获取样本表后,您可以直接根据样本表来进行模型训练,详情请参见模型训练。
相关文档
特征生产的详细过程,请参见特征平台与特征生产。
阿里云特征平台(FeatureStore)基本适用于所有需要特征的场景,比如推荐场景、金融风控场景、用户增长场景。同时,FeatureStore与阿里云常用数据源引擎、推荐服务引擎完成对接,可为您提供端到端高效便捷的一站式从特征注册管理到模型开发应用的全流程操作平台。更多关于FeatureStore的信息,请参见FeatureStore概述。
如果您在配置或使用过程中有任何问题,可以搜索钉钉群号:34415007523,进入答疑群联系技术人员进行咨询。