The Collaborative Filtering (etrec) component uses an item-based collaborative filtering algorithm to compute pairwise item similarity from user-item interaction data. Given a user column and an item column, it outputs the top N most similar items for each item — typically used as the recall stage in a recommendation pipeline.
How it works
Read user-item interaction records from the input table (user column + item column).
For each item, identify users who interacted with it and compute similarity scores against all other items using the selected similarity type.
Write the top N similar items per item to the output table as key-value pairs (
item_id:similarity_score).
Choose a similarity type
Select the similarity algorithm based on your interaction data characteristics:
| Similarity type | When to use |
|---|---|
| WbCosine | General-purpose item-based collaborative filtering. Use this as the default for most recommendation scenarios. |
| asymcosine | When item relationships are directional. Tune alpha (smoothing factor) and weight (weighting coefficient) to control the degree of asymmetry. |
| Jaccard | When your interaction data is binary (clicked or not clicked, purchased or not). |
Configure the component
Two configuration methods are available: the PAI console (visual) and PAI commands (SQL-based). Both expose the same parameters.
Method 1: Configure in the PAI console
On the pipeline page of Machine Learning Designer, select the Collaborative Filtering (etrec) component and set the following parameters.
Fields Setting tab
| Parameter | Description |
|---|---|
| User Column | The name of the user column in the input table. |
| Item Column | The name of the item column in the input table. |
| Delimiter between items in the output table | The separator between items in the output table. Default: space. |
| Delimiter between key-value in the output table | The separator between keys and values in the output table. Default: colon (:). Spaces are not supported. |
Parameters Setting tab
| Parameter | Description |
|---|---|
| Similarity Type | The algorithm used to compute item similarity. See Choose a similarity type. Valid values: WbCosine, asymcosine, Jaccard. |
| TopN | The maximum number of similar items to retain per item in the output table. |
| Calculation Behavior | Discontinued. This parameter is no longer effective. Valid values: Add, Mul, Min, Max. |
| Minimum Item Value | Users with fewer interactions than this threshold are excluded from computation. Increase this value to filter out users with sparse interaction histories, whose behavior tends to reflect item popularity rather than personal preference. |
| Maximum Item Value | Users with more interactions than this threshold are excluded from computation. Decrease this value to filter out high-volume users (such as automated bots) whose records may introduce noise into similarity scores. |
| Smoothing Factor | Valid only when Similarity Type is asymcosine. Valid values: (0, 1). Default: 0.5. |
| Weighting Coefficient | Valid only when Similarity Type is asymcosine. The weight index applied to the asymmetric cosine formula. Default: 1.0. |
Method 2: Configure using PAI commands
Call the pai_etrec command from an SQL Script. For more information, see SQL Script.
PAI -name pai_etrec
-project algo_public
-DsimilarityType="wbcosine"
-Dweight="1"
-DminUserBehavior="2"
-Dlifecycle="28"
-DtopN="2000"
-Dalpha="0.5"
-DoutputTableName="etrec_test_result"
-DmaxUserBehavior="500"
-DinputTableName="etrec_test_input"
-DuserColName="user"
-DitemColName="item"| Parameter | Required | Description | Default |
|---|---|---|---|
inputTableName | Yes | The name of the input table. | — |
userColName | Yes | The name of the user column in the input table. | — |
itemColName | Yes | The name of the item column in the input table. | — |
inputTablePartitions | No | The partitions to read from the input table. | Full table |
outputTableName | Yes | The name of the output table. | — |
outputTablePartition | No | The partition to write to in the output table. | — |
similarityType | No | The similarity algorithm. Valid values: wbcosine, asymcosine, jaccard. | wbcosine |
topN | No | The number of most similar items to retain per item. Valid values: 1–10000. | 2000 |
minUserBehavior | No | The minimum number of interactions a user must have to be included. Users below this threshold are excluded; their behavior tends to reflect item popularity rather than personal preference. | 2 |
maxUserBehavior | No | The maximum number of interactions a user can have to be included. Users above this threshold are excluded; high-volume records (for example, from automated bots) may introduce noise. | 500 |
itemDelimiter | No | The delimiter between items in the output table. | Backspace |
kvDelimiter | No | The delimiter between keys and values in the output table. | : |
alpha | No | The smoothing factor when similarityType=asymcosine. Valid values: (0, 1). | 0.5 |
weight | No | The weight index when similarityType=asymcosine. | 1.0 |
lifecycle | No | The lifecycle of the output table. | 1 |
coreNum | No | The number of cores to allocate. | System-determined |
memSizePerCore | No | The memory per core, in MB. | System-determined |
Example
This example creates a minimal input table, runs the collaborative filtering algorithm, and inspects the output.
Step 1: Create the training data table.
DROP TABLE IF EXISTS etrec_test_input;
CREATE TABLE etrec_test_input AS
SELECT *
FROM (
SELECT CAST(0 AS STRING) AS user, CAST(0 AS STRING) AS item
UNION ALL SELECT CAST(0 AS STRING), CAST(1 AS STRING)
UNION ALL SELECT CAST(1 AS STRING), CAST(0 AS STRING)
UNION ALL SELECT CAST(1 AS STRING), CAST(1 AS STRING)
) a;The resulting etrec_test_input table:
| user | item |
|---|---|
| 0 | 0 |
| 0 | 1 |
| 1 | 0 |
| 1 | 1 |
Step 2: Run the PAI command.
DROP TABLE IF EXISTS etrec_test_result;
PAI -name pai_etrec
-project algo_public
-DsimilarityType="wbcosine"
-Dweight="1"
-DminUserBehavior="2"
-Dlifecycle="28"
-DtopN="2000"
-Dalpha="0.5"
-DoutputTableName="etrec_test_result"
-DmaxUserBehavior="500"
-DinputTableName="etrec_test_input"
-DuserColName="user"
-DitemColName="item";Step 3: View the output table.
Query etrec_test_result. Each row represents an item and its most similar items, formatted as item_id:similarity_score key-value pairs.
| itemid | similarity |
|---|---|
| 0 | 1:1 |
| 1 | 0:1 |
Item 0 and item 1 are both interacted with by the same two users, so each has a similarity score of 1.0 with the other.