This topic explains how to use a public dataset to quickly get started with PAI-Rec. You will configure key features for a custom recommendation algorithm, such as feature engineering, recall, and fine-grained ranking. You will then generate and deploy the code to the corresponding workflow in DataWorks.
Prerequisites
Before you begin, complete the following:
-
Activate PAI. For details, see Activate PAI and create a default workspace.
-
Create a Virtual Private Cloud (VPC) and a vSwitch. For details, see Create a VPC with an IPv4 CIDR block.
-
Activate PAI-FeatureStore. For details, see the Prerequisites section of Create a data source. Hologres activation is not required. For the data source, select FeatureDB. For details, see Create an online data source: FeatureDB.
-
Activate MaxCompute and create a MaxCompute project named project_mc. For details, see Activate MaxCompute and Create a MaxCompute project.
-
Create an Object Storage Service (OSS) bucket. For details, see Create a bucket.
-
-
Create a DataWorks workspace. For details, see Create a workspace.
-
Purchase a Serverless resource group for DataWorks. For details, see Use a Serverless resource group. This resource group synchronizes data for PAI-FeatureStore and runs eascmd commands to create and update the PAI-EAS service.
-
Configure DataWorks data sources:
-
Create and bind an OSS data source. For details, see Data Source Management.
-
Create and bind a MaxCompute data source. For details, see Bind a MaxCompute compute engine.
-
-
-
Create a FeatureStore project and feature entities. Skip this step if you use a Serverless resource group. If you use a dedicated resource group for DataWorks, you must install the FeatureStore Python SDK on the resource group. For details, see Step 2: Create and register a FeatureStore and Install the FeatureStore Python SDK.
-
Activate Realtime Compute for Apache Flink. For details, see Activate Realtime Compute for Apache Flink. Note: Set the storage type to OSS bucket, not Fully Managed Storage. Make sure that the OSS bucket for Flink is the same as the one configured in PAI-Rec. Flink records real-time user behavior data and computes real-time user features.
-
If you later choose to use EasyRec (TensorFlow framework), training runs on MaxCompute by default.
-
If you later choose to use TorchEasyRec (PyTorch framework), training runs on PAI-DLC by default. To download MaxCompute data on PAI-DLC, you must activate Data Transmission Service (see Purchase and use a dedicated resource group for Data Transmission Service).
1. Create a PAI-Rec instance
-
Log in to the PAI-Rec homepage and click Buy Now.
-
On the PAI-Rec instance purchase page, configure the following parameters and click Buy Now.
Parameter
Description
Region and availability zone
The region where your service is deployed.
Service Type
For this solution, select Standard Edition and enable the Recommendation Solution Customization feature.
-
Log in to the PAI-Rec console. In the upper-left corner of the top menu bar, select a region.
-
In the navigation pane on the left, choose Instance list. Click the instance name to go to the instance details page.
-
In the Procedure section, click Cloud Service Configuration. This takes you to the System Configuration > Cloud Service Configuration page. Click Edit, configure the following parameters, and then click Exit.
-
In the navigation pane on the left, choose System Configuration > Permissions. On the Services tab, verify that each cloud product is correctly authorized.
2. Clone the public dataset
1. Synchronize tables
This solution accepts input data in two ways:
-
Clone data for a fixed time window from the
pai_online_projectproject. This method does not support routine scheduling. -
Use a Python script to generate data. You can run a task in DataWorks to generate this data for a specific period.
To schedule daily data generation and model training, use the second method. Deploy the specified Python code to generate the data. For more information, see the Generate data using code tab.
Fixed time window
PAI-Rec provides three common tables for recommendation algorithms in the publicly accessible pai_online_project project:
-
user table:
pai_online_project.rec_sln_demo_user_table -
item table:
pai_online_project.rec_sln_demo_item_table -
behavior table:
pai_online_project.rec_sln_demo_behavior_table
The operations in this solution use these three tables. The data is randomly generated for demonstration and is not real business data. Therefore, metrics such as Area Under the Curve (AUC) obtained from training will be low. You must run SQL commands in DataWorks to synchronize the table data from the pai_online_project project to your DataWorks project, for example, DataWorks_a. Follow these steps:
-
Log on to the DataWorks console. In the top menu bar, select a region.
-
In the left-side navigation pane, click Data Development and O&M > Data Development.
-
Select the DataWorks workspace that you created and click Go To Data Development.
-
Hover over Create and choose Create Node > MaxCompute > ODPS SQL. Configure the parameters as described in the following table and click Confirm.
-
In the new node's editor, paste and run the following code to synchronize the user, item, and behavior tables from the
pai_online_projectproject to your MaxCompute project, for example,project_mc. The code requires a 100-day data window. To define this window to end on the previous day, go to the Scheduling Parameters section, click Add Parameter, and add two parameters:bizdatewith the value$[yyyymmdd-1]andbizdate_100with the value$[yyyymmdd-100].
CREATE TABLE IF NOT EXISTS rec_sln_demo_user_table_v1(
user_id BIGINT COMMENT 'Unique user ID',
gender STRING COMMENT 'Gender',
age BIGINT COMMENT 'Age',
city STRING COMMENT 'City',
item_cnt BIGINT COMMENT 'Number of created items',
follow_cnt BIGINT COMMENT 'Number of follows',
follower_cnt BIGINT COMMENT 'Number of followers',
register_time BIGINT COMMENT 'Registration time',
tags STRING COMMENT 'User tags'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_user_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_user_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
CREATE TABLE IF NOT EXISTS rec_sln_demo_item_table_v1(
item_id BIGINT COMMENT 'Item ID',
duration DOUBLE COMMENT 'Video duration',
title STRING COMMENT 'Title',
category STRING COMMENT 'Primary tag',
author BIGINT COMMENT 'Author',
click_count BIGINT COMMENT 'Total clicks',
praise_count BIGINT COMMENT 'Total likes',
pub_time BIGINT COMMENT 'Publication time'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_item_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_item_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_v1(
request_id STRING COMMENT 'Instrumentation ID/Request ID',
user_id STRING COMMENT 'Unique user ID',
exp_id STRING COMMENT 'Experiment ID',
page STRING COMMENT 'Page',
net_type STRING COMMENT 'Network type',
event_time BIGINT COMMENT 'Behavior time',
item_id STRING COMMENT 'Item ID',
event STRING COMMENT 'Behavior type',
playtime DOUBLE COMMENT 'Playback/Read duration'
) PARTITIONED BY (ds STRING) STORED AS ALIORC;
INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_v1 PARTITION(ds)
SELECT *
FROM pai_online_project.rec_sln_demo_behavior_table
WHERE ds >= "${bizdate_100}" and ds <= "${bizdate}";
Code generation
Periodic task scheduling is not supported for fixed time-window data. To run scheduled tasks, you must deploy a Python script to generate data. Follow these steps:
-
In the DataWorks console, create a PyODPS 3 node. For more information, see Create and manage MaxCompute nodes.
-
Download create_data.py and paste its content into the PyODPS 3 node.
-
Click Configure Scheduling on the right, configure the following parameters, and then click Save
and Submit
in the upper-right corner.-
Configure scheduling parameters:
-
Set the
$user_table_nameparameter torec_sln_demo_user_table.Set the
$item_table_nameparameter torec_sln_demo_item_table.Set the
$behavior_table_nameparameter torec_sln_demo_behavior_table.In addition to the table name parameters, you must also add a parameter for the business date.
This parameter is named bizdate.
Set its value to
$[yyyymmdd-1].
-
-
Configure a scheduling dependency.
-
-
Go to the Operation Center and choose .
-
In the Actions column of the target task, choose .
-
In the Backfill Data panel, set a business date and click Submit and Go.
The optimal time range for data backfill is 60 days. We recommend that you set the business date to
Task scheduled date - 60to ensure data integrity.
2. Configure dependency nodes
To ensure smooth code generation and deployment, add three SQL code nodes to your DataWorks project. Set the scheduling dependencies of these nodes to the root node of the workspace. Finally, publish the nodes. The procedure is as follows:
-
Hover over Create and choose Create Node > General-purpose > Virtual Node. Create three virtual nodes with the following resource configuration, and click Confirm.
-
For each virtual node, set its content to
select 1;. Then, click Configure Scheduling on the right and configure the following settings:-
In the Time property section, set Rerun property to Rerun When Succeeded Or Failed.
-
In the Scheduling dependencies > Upstream dependencies section, enter your DataWorks workspace name, select the node with the _root suffix, and click Add.
Apply these settings to all three virtual nodes.
-
-
Click the
icon next to each virtual node to submit it.
3. Register data
To configure feature engineering, recall, and sorting algorithms in the custom recommendation solution, you must first register the three tables that have been synchronized to your DataWorks project. The procedure is as follows:
-
Log in to the PAI-Rec console. In the top menu bar, select a region.
-
In the left-side navigation pane, choose Instance List. Click the instance name to open the instance details page.
-
In the left-side navigation pane, choose Recommendation Solution Customization > Data Registration. On the MaxCompute Table tab, click Create Data Table. Add one user table, one item table, and one behavior table as described in the following table, and then click Import.
Parameter
Description
Example
MaxCompute project
Select your MaxCompute project.
project_mc
MaxCompute table
Select the data tables synchronized to your DataWorks workspace.
-
user table: rec_sln_demo_user_table_v1
-
item table: rec_sln_demo_item_table_v1
-
behavior table: rec_sln_demo_behavior_table_v1
data table name
Enter a custom name.
-
user table
-
item table
-
behavior table
-
4. Create recommendation scenario
Before you can configure a recommendation task, you must first create a recommendation scenario. For an explanation of recommendation scenarios and traffic IDs, see Basic Concepts.
In the left navigation pane, click Recommendation Scenario. Click Create Scenario, configure the new scenario according to the following table, and click Determine.
5. Configure an algorithm solution
To fully configure a real-world scenario, use the following recall and fine-grained ranking configurations:
-
global hot recall: Selects the top-k most popular items from log data.
-
global hot fallback recall: Uses Redis as a fallback to prevent empty responses from the recommendation API.
-
grouped hot recall: Recalls items by groups (e.g., city, gender) to provide more accurate recommendations of popular items.
-
etrec u2i recall: Uses the etrec collaborative filtering algorithm.
-
Swing u2i recall (optional): Uses the Swing algorithm.
-
cold-start recall (optional): Handles cold-start scenarios using the DropoutNet algorithm.
-
fine-grained ranking: Uses MultiTower for single-objective ranking and DBMTL for multi-objective ranking.
You should enable algorithms such as vector recall or PDN recall only after the initial recall covers most of your candidates. Vector recall requires a vector recall engine. FeatureDB does not support vector recall, so this example does not configure it.
This article shows a sample configuration and deployment. To simplify the process, this example uses only global hot recall and the u2i recall strategy from RECommender (eTREC, an implementation of collaborative filtering) for the recall configuration, and fine-grained ranking for the ranking configuration. The steps are as follows:
-
In the left-side navigation pane, choose Recommendation Solution Customization > Solution Configurations. Select the scenario that you created, click Create Recommendation Solution, provide the following resource configuration, and then click Save and Go to Algorithm Solution Configuration.
Keep the default values for unspecified parameters. For more information, see Data table configuration.
-
In the Configure Table node, click Add next to the target data table. Configure the partitions, events, features, and timestamps for the behavior log table, user table, and item table as detailed in the following sections. Click Next.
You can keep the default values for any parameters that are not described. For more information, see Data Table Configuration.
-
In the Configure Feature node, configure the parameters as described in the following table, click Generate Feature, set a feature version, and then click Next.
After you click Generate Feature, the system derives various statistical features for users and items. This solution uses the default derived features, but you can edit them based on your business requirements. For more information, see Feature Configuration.
-
In the Retrieval Configuration node, click Add next to the target category, configure the parameters, click Confirm, and then click Next.
This section describes several recall configuration methods. To get started quickly, you only need to configure global hot recall and etrec u2i recall. The other methods are provided for reference.
-
In the Configure Sorting Method node, click Add next to Fine-grained Ranking, configure the parameters as described below, click Confirm, and then click Next.
-
In the Generate Script node, click Generate Deployment Script.
ImportantAfter the deployment script is generated, the system provides an OSS address that contains all the deployment files. You can save this address to run the deployment script manually later.
-
Once the script is generated, click Determine in the dialog box to go to the Recommendation Solution Customization > Deployment History page.
If script generation fails, check the run logs to identify and fix the error, then regenerate the script.
6. Deploy the recommendation
Once the script is generated, you can deploy it to DataWorks in one of two ways.
Method 1: Integrated deployment
-
Click Go to Deploy to the right of the target solution. On the Deployment Records page, you can filter records by scenario and Deployment Status. Find the target recommendation solution, such as
pai_rec_testdemo_v1, and confirm that its Deployment Status is Ready. -
On the Deployment Preview page, in the Diff File section, select the files to deploy. For a first-time deployment, click Select All and then click Deploy to DataWorks.
The page returns to the Deployment History page. On this page, the Deployment Status for the
pai_rec_testdemo_v1recommendation solution in the HomePage scenario is displayed as Running. -
Wait for a moment, then click
to refresh the list and check the deployment status.-
If the deployment fails, click View Log in the Actions column. Analyze and resolve the specific error, and then regenerate and deploy the script.
-
When the Deployment Status changes to Successful, you can go to DataStudio in the DataWorks workspace configured for this solution to view the deployed code. For more information, see Data Development: developer. In the left-side navigation pane of DataStudio, expand Business Flow > Workflow > MaxCompute > Data Development to view the deployed node folders, including feature, feature_v726, rank_v1, recall_v1 (with etrec_u2i_recall and global_hot_recall sub-nodes), and test.
-
-
View the task data backfill process.
-
On the page, click Details in the Actions column for the successfully deployed solution.
-
On the Deployment Preview page, click View Task Data Backfill Process to review the process and its instructions.
-
Ensure that the partitions for the user table, item table, and user behavior table contain data for the most recent n days, where n is the sum of the training time window and the maximum feature time window. If you are using the demo data from this topic, synchronize the latest data partitions. If you are generating data with a Python script, perform a data backfill in the DataWorks Operation Center to generate the latest partitions.
-
Click Create Data Refill Task in the upper-left corner, and then click Start Task by Order under the Data Refill Task List. Ensure that all tasks run successfully. If a task fails, click Details to view the log information, analyze and resolve the error, and then rerun the task. After a successful rerun, click Continue Run in the upper-left corner of the page until all tasks are successful. The data backfill process DAG shows the complete feature engineering pipeline: at the top are the three data source tables (
pairec_demo.rec_sln_demo_item_table_v1,pairec_demo.rec_sln_demo_behavior_table_v1, andpairec_demo.rec_sln_demo_user_table_v1), which sequentially go through preprocessing, wide table construction (such asbehavior_table_v1_wide), aggregate statistics (such asitem_id_30d_agganduser_id_30d_agg), and static feature extraction to generate full feature tables (such asitem_table_v1_all_featanduser_table_v1_all_feat). Finally, a PyODPS3-typecreate_sync_onlinestorenode synchronizes the data to the online feature store. Each node card includes information such as the start and end dates and the order of the data backfill. The Create Backfill Task button in the upper-left corner initiates this data backfill process.
-
Method 2: Migration Assistant
After the script is generated, you can also manually deploy it using Migration Assistant in the DataWorks console. The key parameters are described below. For other operations, see Create and view a DataWorks import task.
-
Import Name: Set this according to the console prompt.
-
Upload Method: Select OSS File, enter the OSS Link, and click Verify.
The deployment file is stored at the OSS address generated in Step 5, for example,
oss://examplebucket/algoconfig/plan/1723717372/package.zip. You can log on to the OSS console and follow the steps below to get the file's URL. In the left-side navigation pane of the OSS console, click Bucket List. Go to the target bucket and click File List. Find and click the target file, such aspackage.zip. In the details panel that appears on the right, enable the HTTPS toggle, and then click Copy File URL to get a signed URL. Use this URL as the OSS Link.
7. Freeze nodes
After the backfill completes, freeze the tasks in the Operation Center (the three nodes in step 2.2) to prevent them from running daily.
In the DataWorks Operation Center, navigate to Auto Triggered Task O&M > Recurring Job. Search for the node (for example, rec_sln_demo_user_table_v1), select the node (Workspace.node name), and choose Suspend (Freeze).