This guide uses a public dataset to walk you through configuring feature engineering, recall, and fine-grained ranking in PAI-Rec, then deploying the generated code to DataWorks. The modular approach lets you publish or unpublish individual functions without affecting the rest of the pipeline.
Why modularize recommendation solution customization?
-
Feature groups simplify model iteration.
-
One-click publishing deploys code and starts data backfill automatically.
-
Only the ranking module fully backfills all required components (feature groups, samples, and feature generation modules), reducing rework from feature engineering errors.
Prerequisites
Complete the following preparations before you begin:
-
Create a bucket in OSS.
-
Create a VPC and a vSwitch. Build a VPC with an IPv4 CIDR block.
-
Activate PAI-FeatureStore (see the Prerequisites section in Create a Data Source). Do not activate Hologres. Select FeatureDB as your data source (Create an Online Data Source: FeatureDB).
-
Activate MaxCompute and Create a MaxCompute project named
project_mc. -
Activate DataWorks and complete the following operations:
-
Create a workspace in DataWorks.
-
Purchase a Serverless resource group (Use a Serverless resource group). This resource group synchronizes data for PAI-FeatureStore and runs
eascmdcommands to create and update PAI-EAS services. -
Configure DataWorks data sources:
-
Create and bind an OSS data source (Data Source Management).
-
Create and bind a MaxCompute data source (Bind MaxCompute compute resources).
-
-
-
Create a FeatureStore project and a feature entity (Step 2: Create and register FeatureStore). Skip this step if you use a Serverless resource group. If you use a dedicated resource group, install the FeatureStore Python SDK on the resource group (Install the FeatureStore Python SDK).
-
Activate Realtime Compute for Apache Flink. For Storage type, select OSS bucket (not Fully managed storage). The OSS bucket must match the one configured for PAI-Rec. Flink records real-time user behavior data and computes real-time user features.
-
If you use EasyRec (a TensorFlow-based framework), training runs on MaxCompute by default.
-
If you use TorchEasyRec (a PyTorch-based framework), training runs on PAI-DLC by default. To download MaxCompute data on PAI-DLC, activate Data Transmission Service (Purchase and use a dedicated Data Transmission Service resource group).
1. Create a PAI-Rec instance and initialize the service
-
Log in to the Recommendation System Development Platform homepage and click Buy Now.
-
On the PAI-Rec instance purchase page, configure the following key parameters, and then click Buy Now.
Parameter
Description
Region and zone
The region where your cloud service is deployed.
Service Type
Select Standard Edition and enable Recommendation Solution Customization.
-
Log in to the PAI-Rec Management Console. In the upper-left corner of the top menu bar, select a region.
-
In the left navigation pane, choose Instances, and then click the instance name to open the instance details page.
-
In the Procedure section, click Cloud Service Configuration to open the System Configuration > Cloud Service Configuration page. Click Edit, configure the parameters as shown in the Resource configuration table, and then click Exit.
-
In the left navigation pane, choose System Configuration > Permissions. On the Services tab, verify the authorization status of each cloud service.
2. Clone the public dataset
1. Synchronize data table
This solution offers two data input options:
-
Clone data from the pai_online_project within a fixed time window. This option does not support scheduled tasks.
-
Use a Python script to generate data. You can run the task in DataWorks to generate data for a specified time range.
To schedule daily data generation and model training, use the second option, which deploys Python code to generate data. See the Generate data using code tab.
Fixed time window
PAI-Rec prepares three tables in the public project pai_online_project:
-
User table:
pai_online_project.rec_sln_demo_user_table -
Item table:
pai_online_project.rec_sln_demo_item_table -
Behavior table:
pai_online_project.rec_sln_demo_behavior_table
The operations in this topic are based on these three tables. The data is randomly generated for simulation with no real-world business meaning, which may result in low metrics (such as AUC) during training. Run SQL commands in DataWorks to synchronize the table data from pai_online_project to your DataWorks project (for example, DataWorks_a):
-
Log in to the DataWorks console and select a region from the upper-left corner of the top menu bar.
-
In the left-side navigation pane, click Data Development and O&M > Data Development.
-
Select the DataWorks workspace that you created and click Enter Data Development.
-
Hover over Create, and then choose New node > MaxCompute > ODPS SQL. Configure the parameters as described in the following table, and then click Confirm.
-
In the new node editor, copy and run the following code to synchronize the user, item, and behavior tables from the
pai_online_projectproject to your MaxCompute project (for example,project_mc). Before you run the code, set variables to specify a 100-day data window ending on thebizdate. Typically, you should setbizdateto the previous day. In the Scheduling parameters area, click Add Parameter, and add two parameters: bizdate with the value$[yyyymmdd-1], and bizdate_100 with the value$[yyyymmdd-100]. Running the following code synchronizes the data from the publicpai_online_projectproject to your project:
CREATE TABLE IF NOT EXISTS rec_sln_demo_user_table_v1(
user_id BIGINT COMMENT 'Unique user ID',
gender STRING COMMENT 'Gender',
age BIGINT COMMENT 'Age',
city STRING COMMENT 'City',
item_cnt BIGINT COMMENT 'Number of published items',
follow_cnt BIGINT COMMENT 'Number of followed accounts',
follower_cnt BIGINT COMMENT 'Number of followers',
register_time BIGINT COMMENT 'Registration time',
tags STRING COMMENT 'User tags'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_user_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_user_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
CREATE TABLE IF NOT EXISTS rec_sln_demo_item_table_v1(
item_id BIGINT COMMENT 'Item ID',
duration DOUBLE COMMENT 'Video duration',
title STRING COMMENT 'Title',
category STRING COMMENT 'Primary category',
author BIGINT COMMENT 'Author',
click_count BIGINT COMMENT 'Total clicks',
praise_count BIGINT COMMENT 'Total likes',
pub_time BIGINT COMMENT 'Publish time'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_item_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_item_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_v1(
request_id STRING COMMENT 'Tracking ID/Request ID',
user_id STRING COMMENT 'Unique user ID',
exp_id STRING COMMENT 'Experiment ID',
page STRING COMMENT 'Page',
net_type STRING COMMENT 'Network type',
event_time BIGINT COMMENT 'Event time',
item_id STRING COMMENT 'Item ID',
event STRING COMMENT 'Event type',
playtime DOUBLE COMMENT 'Playback/Reading duration'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_behavior_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
Code generation
Data from a fixed time window cannot be used for scheduled tasks. To run tasks on a schedule, deploy a Python script to generate data:
-
In the DataWorks console, create a PyODPS 3 node (Create and manage MaxCompute nodes).
-
Download create_data.py, and then paste the contents of the file into the PyODPS 3 node editor.
-
On the right side of the editor, click Configure Scheduling and configure the parameters. Then, click the Save
and Submit
icons in the upper-right corner.-
Configure scheduling parameters:
-
Note on variable substitution:
In the scheduling parameters, set the
$user_table_nameargument torec_sln_demo_user_table.Set the
$item_table_nameargument torec_sln_demo_item_table.Set the
$behavior_table_nameargument torec_sln_demo_behavior_table.In addition to the three table name arguments, the script also uses a
bizdatescheduling parameter, which is passed as the$bizdatevariable.Parameter configuration:
Add a parameter named
bizdateand set its value to$[yyyymmdd-1].
-
-
Configure scheduling dependencies.
-
-
Click Operation Center, and select .
-
Click in the Actions column of the target task.
-
In the Data Backfill panel, set the data timestamp, and then click Submit and Navigate.
For an optimal 60-day data backfill period, set the data timestamp to
Scheduled Task Date - 60to ensure data integrity.
2. Configure dependency nodes
To ensure smooth code generation and deployment, add three SQL code nodes to your DataWorks project with scheduling dependencies set to the root node. After configuring the nodes, publish them:
-
Hover over Create and select new node > General-purpose > virtual node. Create three virtual nodes using the following resource configuration, then click Confirm.
-
Select each node and set its content to
select 1;. Then, click Configure Scheduling on the right and complete the following settings:-
In the Time attributes section, set the Rerun attribute to Rerun node after success or failure.
-
In the Scheduling dependencies > Ancestor Nodes section, enter the DataWorks workspace name, select the node with the _root suffix, and click Add.
Apply this dependency configuration to all three virtual nodes.
-
-
Click the
icon for each virtual node to submit the node.
3. Register data
Before configuring feature engineering, recall, and ranking, register the three synchronized tables in PAI-Rec:
-
Log in to the PAI-Rec Management Console and select a region from the top-left corner.
-
In the left-side navigation pane, click Instances, and then click the name of an instance to open its details page.
-
In the left-side navigation pane, navigate to Recommendation Solution Customization > Data Registration. On the MaxCompute Table tab, click Create Data Table. Add a user table, an item table, and a behavior table using the configuration in the table below. Then, click Import.
Parameter
Description
Example
MaxCompute project
Select the MaxCompute project you created.
project_mc
MaxCompute table
Select the data tables synchronized to your DataWorks workspace.
-
User table: rec_sln_demo_user_table_v1
-
Item table: rec_sln_demo_item_table_v1
-
Behavior table: rec_sln_demo_behavior_table_v1
Data table name
Enter a custom name.
-
User table
-
Item table
-
Behavior table
-
4. Create a recommendation scenario
Create a recommendation scenario before configuring recommendation tasks. For basic concepts and traffic encoding, see Basic Concepts.
In the left navigation pane, click Recommendation Scenario and then Create Scenario. Configure a recommendation scenario with the following resource configuration and click Determine.
5. Set up an algorithm
For a full production configuration, use the following recall and fine-grained ranking settings:
-
Global hot recall: Retrieves the top-K most popular items based on log data.
-
Grouped hot recall: Retrieves popular item candidates from specific groups, such as city or gender, to improve relevance.
-
Etrec u2i recall: Uses the etrec collaborative filtering algorithm.
-
Swing u2i recall (optional): Uses the Swing algorithm.
-
Vector recall (optional): Generates candidates using the Deep Structured Semantic Model (DSSM).
-
Fine-grained ranking: Uses the MultiTower model for single-objective ranking and the DBMTL model for multi-objective ranking.
This guide covers global hot recall and etrec u2i recall from RECommender (eTREC, a collaborative filtering implementation), plus fine-grained ranking. Steps:
-
In the left-side navigation pane, choose Recommendation Solution Customization. Select a scenario you have created, and then click Create Modular Recommendation Solution. Create a solution with the following resource configurations, and then click Save and Go to Algorithm Solution Configuration.
Leave unspecified parameters at their default values. Data Table Configuration.
-
On the Configure Table tab, click Add next to a data table. Configure the behavior log table, user table, and item table by setting their corresponding partition, event, feature, and timestamp fields, and then click Next.
Leave unspecified parameters at their default values. Data Table Configuration.
-
Under Feature Group Configuration, click Add. Configure the parameters as shown in the following table. Set the feature module name and version, and select the user table, item table, and behavior log table that you configured on the Configure Table tab.
Click Determine. This action generates various statistical features for both users and items. Click Configure Feature for this feature group to view the validation features. For this solution, do not edit the derived features and keep the default settings. You can edit derived features to suit your business needs (Feature Configuration). Click Publish for this feature group, select the latest data partition date as the task run date, set the maximum task parallelism to 10, and keep other settings at their defaults. Then, click Determine. This action generates a node, deploys it to DataWorks, and backfills data by date. Wait for the module's status to change to Published before you proceed to the next step. Click Online Details to view more information about the module. In the Data Refill Task List, you can check the run status of each task. If a task fails, click View Task Node to go to DataWorks and view detailed error information. After fixing the issue, find the node in the Data Refill Task List and click Rerun. After the task succeeds, you can proceed.
-
Under Label Table Configuration, this module builds sample targets from the behavior table. Click Add, choose Label Module as the module type, enter a name for the label module, and select the feature module that you configured in the previous step. Click Create next to Fine-grained ranking target settings (labels) and add the following two labels:
-
Target 1: Set Fine-grained ranking target name to
is_click, set Fine-grained ranking target expression tomax(if(event='click',1,0)), and select CLASSIFICATION for Target type. -
Target 2 (note that the 'l' in 'ln' is lowercase): Set Fine-grained ranking target name to
ln_playtime, set Fine-grained ranking target expression toln(sum(playtime)+1), set Fine-grained ranking target dependency tois_click, and select REGRESSION for Target type. Then, click OK. -
After you click Determine, publish the module in the same way you published the feature module. Wait for the status to change to Published before proceeding to the next step.
-
-
Under Sample Configuration, this module associates sample target tables with features and generates model features in FeatureStore. Click Add, choose Sample Module as the module type, name the model feature, and select the feature and label modules from the previous steps. Publish the module and wait for the status to change to Published before proceeding to the next step.
-
Under Feature Generation Configuration, this module derives additional features from all features in the sample table. Custom configuration is not currently supported. Click Add, choose Feature Generation as the module type, choose Ranking as the feature generation type, and select the model feature module from the previous sample configuration step. Wait for the status to change to Published before proceeding to the next step.
-
Under Configure Sorting Method, click Add in the fine-grained ranking section. For the feature module output, select the feature generation module from the previous step. Leave the other options at their default values and click Determine. Then, publish the module. After the status changes to Published, the model service is deployed to PAI-EAS.
-
Under Retrieval Configuration, click Add next to the target category, configure the parameters, click Confirm, and then Publish the configuration. This document includes several recall configuration methods. To complete the deployment quickly, you only need to configure Global Hot Recall and etrec u2i Recall. The other methods, such as vector recall and collaborative metric recall, are for reference only.