Automatic similar tag classification

更新时间:
复制 MD 格式

This topic describes how to use PAI's text analysis components to automatically classify product tags.

Background information

Product descriptions often contain tags for multiple dimensions, such as time, origin, and style. For example, a shoe description might include tags like "British style", "lace-up", "genuine leather", and "short boots". A bag description could include "2016", "autumn", "crossbody", and "tassel". Manually classifying tens of thousands of products based on these tags is a major challenge for e-commerce platforms. The main difficulty is extracting dimensional tags from product descriptions. The text analysis components in PAI can automatically learn tag words, which enables automated tag classification.

Prerequisites

Prepare the dataset

The dataset for this pipeline is a curated 2016 Double 11 shopping list with over 2,000 product descriptions. Each row represents the aggregated tags for a single product.

In DataWorks, go to the data development module to create a table that contains only one column named content. Then, upload the prepared dataset to this table. For more information, see Create a table and upload data.

Automatic similar tag classification

  1. Go to the Machine Learning Designer page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer).

  2. Create a custom pipeline and go to the pipeline page. For more information, see Create a custom pipeline.

  3. Build and run the pipeline.

    1. From the component list on the left, drag the Read Table component under Source/Target to the canvas and rename it to shopping_data-1.

    2. From the component list, drag the Split Word, Word Count, and Word2Vec components under Natural Language Processing > Basic NLP to the canvas.

    3. From the component list, drag the Add ID Column and Type Transform components under Data Preprocessing to the canvas.

    4. From the component list, drag the K-Means Clustering component under Machine Learning > Clustering to the canvas.

    5. From the component list, drag the SQL Script component under Custom Script to the canvas.

    6. Connect the components to build the pipeline as shown in the figure. Use the following table to configure the key parameters for each component, and then run them.

    7. 相似标签自动归类实验

      No.

      Description

      Load the shopping_data table and use the Split Word component to tokenize it. To do this:

      1. On the canvas, click the shopping_data-1 component. On the Select Table tab in the right-side pane, select the table that you prepared.

      2. On the canvas, click the Split Word-1 component. On the Field Settings tab, select the column named content.

      3. Click the shopping_data-1 component and choose Run Current Node from the shortcut menu. After the component runs, run the Split Word-1 component in the same way.

      Add an ID column. Because the uploaded data has only one field, you must add an ID column to assign a unique primary key to each row.

      Click the Add ID Column-1 component and choose Run Current Node from the shortcut menu. After the component runs, run the Type Transform-1 component in the same way.

      Count word frequency. This shows how often each word appears in each product description.

      1. On the canvas, click the Word Count-1 component. On the Field Settings tab, set Select Document ID Column to append_id and Select Document Content Column to content.

      2. Click the Word Count-1 component and choose Run Current Node from the shortcut menu.

      Use the Word2Vec component to generate word embeddings. This process maps each word to a vector dimension according to its meaning. Key concepts of word embeddings include:

      • Words that are close in vector distance have similar real-world meanings.

      • The distance between word vectors carries semantic meaning.

      The Word2Vec component maps each word to a 100-dimensional space.

      1. On the canvas, click the Word2Vec-1 component. On the Field Settings tab, set Select Word Column to word. On the Parameters tab, select Use hierarchical softmax.

      2. Click the Word2Vec-1 component and choose Run Current Node from the shortcut menu.

      Example result: The output table of the Word2Vec algorithm contains columns for the sequence number, word, and f0 to f99. The word column shows product keywords (such as thickened, Korean style, new style, free shipping, simple, winter, autumn/winter, and pure cotton), and the f0 to f99 columns show the corresponding floating-point values of their word embeddings.

      The K-Means Clustering algorithm automatically groups semantically similar tags by calculating the distance between their word embeddings.

      1. On the canvas, click the K-Means Clustering-1 component. On the Field Settings tab, set Feature Columns to f0 and Append Columns to word.

        Note

        When this component runs, the number of rows in its upstream input data table must be greater than or equal to the configured number of clusters.

      2. Click the K-Means Clustering-1 component and choose Run Current Node from the shortcut menu.

      The result shows the cluster to which each word belongs. For example, the words household and g are assigned to cluster 83, men and children to cluster 79, warm and fleece-lined to cluster 98, set to cluster 94, trendy to cluster 90, and authentic to cluster 87.

      Verify the results. Use the SQL Script-1 component to select a specific cluster to check whether similar tags are grouped together. This example selects cluster 10. On the canvas, click the SQL Script-1 component. On the Parameters tab, set SQL Script to select * from ${t1} where cluster_index=10.

      In the results, the system automatically classifies geographically related tags. However, some mismatched tags, such as nuts, are included. This mismatch can occur when the training dataset is small. A larger training dataset improves the accuracy of the tag clustering.

Related documents

For more information about the algorithm components, see the following topics: