Collaborative filtering (etrec)

更新时间:
复制 MD 格式

The Collaborative Filtering (etrec) component uses an item-based collaborative filtering algorithm to compute pairwise item similarity from user-item interaction data. Given a user column and an item column, it outputs the top N most similar items for each item — typically used as the recall stage in a recommendation pipeline.

How it works

  1. Read user-item interaction records from the input table (user column + item column).

  2. For each item, identify users who interacted with it and compute similarity scores against all other items using the selected similarity type.

  3. Write the top N similar items per item to the output table as key-value pairs (item_id:similarity_score).

Choose a similarity type

Select the similarity algorithm based on your interaction data characteristics:

Similarity typeWhen to use
WbCosineGeneral-purpose item-based collaborative filtering. Use this as the default for most recommendation scenarios.
asymcosineWhen item relationships are directional. Tune alpha (smoothing factor) and weight (weighting coefficient) to control the degree of asymmetry.
JaccardWhen your interaction data is binary (clicked or not clicked, purchased or not).

Configure the component

Two configuration methods are available: the PAI console (visual) and PAI commands (SQL-based). Both expose the same parameters.

Method 1: Configure in the PAI console

On the pipeline page of Machine Learning Designer, select the Collaborative Filtering (etrec) component and set the following parameters.

Fields Setting tab

ParameterDescription
User ColumnThe name of the user column in the input table.
Item ColumnThe name of the item column in the input table.
Delimiter between items in the output tableThe separator between items in the output table. Default: space.
Delimiter between key-value in the output tableThe separator between keys and values in the output table. Default: colon (:). Spaces are not supported.

Parameters Setting tab

ParameterDescription
Similarity TypeThe algorithm used to compute item similarity. See Choose a similarity type. Valid values: WbCosine, asymcosine, Jaccard.
TopNThe maximum number of similar items to retain per item in the output table.
Calculation BehaviorDiscontinued. This parameter is no longer effective. Valid values: Add, Mul, Min, Max.
Minimum Item ValueUsers with fewer interactions than this threshold are excluded from computation. Increase this value to filter out users with sparse interaction histories, whose behavior tends to reflect item popularity rather than personal preference.
Maximum Item ValueUsers with more interactions than this threshold are excluded from computation. Decrease this value to filter out high-volume users (such as automated bots) whose records may introduce noise into similarity scores.
Smoothing FactorValid only when Similarity Type is asymcosine. Valid values: (0, 1). Default: 0.5.
Weighting CoefficientValid only when Similarity Type is asymcosine. The weight index applied to the asymmetric cosine formula. Default: 1.0.

Method 2: Configure using PAI commands

Call the pai_etrec command from an SQL Script. For more information, see SQL Script.

PAI -name pai_etrec
    -project algo_public
    -DsimilarityType="wbcosine"
    -Dweight="1"
    -DminUserBehavior="2"
    -Dlifecycle="28"
    -DtopN="2000"
    -Dalpha="0.5"
    -DoutputTableName="etrec_test_result"
    -DmaxUserBehavior="500"
    -DinputTableName="etrec_test_input"
    -DuserColName="user"
    -DitemColName="item"
ParameterRequiredDescriptionDefault
inputTableNameYesThe name of the input table.
userColNameYesThe name of the user column in the input table.
itemColNameYesThe name of the item column in the input table.
inputTablePartitionsNoThe partitions to read from the input table.Full table
outputTableNameYesThe name of the output table.
outputTablePartitionNoThe partition to write to in the output table.
similarityTypeNoThe similarity algorithm. Valid values: wbcosine, asymcosine, jaccard.wbcosine
topNNoThe number of most similar items to retain per item. Valid values: 1–10000.2000
minUserBehaviorNoThe minimum number of interactions a user must have to be included. Users below this threshold are excluded; their behavior tends to reflect item popularity rather than personal preference.2
maxUserBehaviorNoThe maximum number of interactions a user can have to be included. Users above this threshold are excluded; high-volume records (for example, from automated bots) may introduce noise.500
itemDelimiterNoThe delimiter between items in the output table.Backspace
kvDelimiterNoThe delimiter between keys and values in the output table.:
alphaNoThe smoothing factor when similarityType=asymcosine. Valid values: (0, 1).0.5
weightNoThe weight index when similarityType=asymcosine.1.0
lifecycleNoThe lifecycle of the output table.1
coreNumNoThe number of cores to allocate.System-determined
memSizePerCoreNoThe memory per core, in MB.System-determined

Example

This example creates a minimal input table, runs the collaborative filtering algorithm, and inspects the output.

Step 1: Create the training data table.

DROP TABLE IF EXISTS etrec_test_input;
CREATE TABLE etrec_test_input AS
SELECT *
FROM (
    SELECT CAST(0 AS STRING) AS user, CAST(0 AS STRING) AS item
    UNION ALL SELECT CAST(0 AS STRING), CAST(1 AS STRING)
    UNION ALL SELECT CAST(1 AS STRING), CAST(0 AS STRING)
    UNION ALL SELECT CAST(1 AS STRING), CAST(1 AS STRING)
) a;

The resulting etrec_test_input table:

useritem
00
01
10
11

Step 2: Run the PAI command.

DROP TABLE IF EXISTS etrec_test_result;
PAI -name pai_etrec
    -project algo_public
    -DsimilarityType="wbcosine"
    -Dweight="1"
    -DminUserBehavior="2"
    -Dlifecycle="28"
    -DtopN="2000"
    -Dalpha="0.5"
    -DoutputTableName="etrec_test_result"
    -DmaxUserBehavior="500"
    -DinputTableName="etrec_test_input"
    -DuserColName="user"
    -DitemColName="item";

Step 3: View the output table.

Query etrec_test_result. Each row represents an item and its most similar items, formatted as item_id:similarity_score key-value pairs.

itemidsimilarity
01:1
10:1

Item 0 and item 1 are both interacted with by the same two users, so each has a similarity score of 1.0 with the other.