PMI

更新时间:
复制 MD 格式

The PMI component calculates pointwise mutual information (PMI) across all word pairs in a document corpus, quantifying how strongly two words are associated based on their co-occurrence frequency. Use it in Machine Learning Designer pipelines to identify meaningful word associations.

How PMI works

In information theory, mutual information (MI) measures how much knowing one random variable reduces uncertainty about another. PMI extends this to word pairs, producing a single association score per pair.

Formula: PMI(x,y) = ln(p(x,y) / (p(x) × p(y))) = ln(#(x,y) × D / (#x × #y))

Where:

  • #(x,y) — number of times word pair (x, y) co-occurs

  • D — total number of pairs

  • #x, #y — individual word counts

Each time words x and y appear together in the same window, the component increments #x, #y, and #(x,y) by 1.

Interpreting PMI scores:

PMI value Meaning
Positive The two words co-occur more often than chance predicts — likely meaningfully associated
Near zero No strong association
Negative The two words co-occur less often than chance predicts
Without frequency filtering, rare words can produce artificially high PMI scores because they appear in very few contexts. Set minCount to filter out words that appear too infrequently to yield reliable statistics.

For background on the algorithm, see Pointwise mutual information.

Configure the component

Method 1: Configure in the PAI console

On the pipeline page of Machine Learning Designer, configure the following parameters:

Tab Parameter Description
Fields setting Columns of documents with words separated with spaces The document column whose words are separated by spaces
Parameters setting Minimum frequency of words Words appearing fewer times than this threshold are filtered out. Default: 5. Increase this value to exclude rare words that inflate PMI scores.
Window size The number of words to the right of the current word that form the co-occurrence window. For example, a window size of 5 pairs the current word with each of the 5 words immediately to its right.
Tuning Computing cores Number of CPU cores for the calculation. Default: determined by the system.
Memory size per core (unit: MB) Memory allocated to each core. Default: determined by the system.

Method 2: Configure using PAI commands

Use an SQL Script component or an ODPS SQL node to run PAI commands. The command name is PointwiseMutualInformation.

PAI -name PointwiseMutualInformation
    -project algo_public
    -DinputTableName=maple_test_pmi_basic_input
    -DdocColName=doc
    -DoutputTableName=maple_test_pmi_basic_output
    -DminCount=0
    -DwindowSize=2
    -DcoreNum=1
    -DmemSizePerCore=110;

Parameters

Parameter Required Description Default
inputTableName Yes Input table
outputTableName Yes Output table
docColName Yes Name of the document column after word segmentation; words must be separated by spaces
windowSize No Number of words to the right of the current word that form the co-occurrence window. For example, a value of 5 indicates the five adjacent words on the right of the current word. All words in the row
minCount No Minimum word frequency. Words appearing fewer times are filtered out. Increase this value if low-frequency words are producing unreliable PMI scores. 5
inputTablePartitions No Partitions to use from the input table, in Partition_name=value format. Specify multiple partitions as name1=value1/name2=value2. Separate multiple entries with commas. All partitions
lifecycle No Lifecycle of the output table
coreNum No Number of CPU cores. Valid values: 1–9999. Determined by the system
memSizePerCore No Memory per core, in MB. Valid values: 1024–65536. Determined by the system

Example

This example demonstrates the full workflow: create an input table, run the PMI command, and inspect the output.

Input

Create a table named maple_test_pmi_basic_input using an ODPS SQL node. For details, see Develop a MaxCompute SQL task.

CREATE TABLE maple_test_pmi_basic_input AS
SELECT * FROM (
    SELECT "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" AS doc
    UNION ALL SELECT "w1 w3 w5 w6 w9" AS doc
    UNION ALL SELECT "w0" AS doc
    UNION ALL SELECT "w0 w0" AS doc
    UNION ALL SELECT "w9 w1 w9 w1 w9" AS doc
) tmp;

The resulting table has one document per row, with words separated by spaces:

doc
w1 w2 w3 w4 w5 w6 w7 w8 w8 w9
w1 w3 w5 w6 w9
w0
w0 w0
w9 w1 w9 w1 w9

Run the command

Run the following command using an SQL Script component or an ODPS SQL node:

PAI -name PointwiseMutualInformation
    -project algo_public
    -DinputTableName=maple_test_pmi_basic_input
    -DdocColName=doc
    -DoutputTableName=maple_test_pmi_basic_output
    -DminCount=0
    -DwindowSize=2
    -DcoreNum=1
    -DmemSizePerCore=110;

Output

The output table maple_test_pmi_basic_output contains one row per word pair:

Column Description
word1, word2 The word pair
word1_count, word2_count Total occurrences of each word across the corpus
co_occurrences_count Number of times the pair co-occurs within the window
pmi Natural logarithm of the co-occurrence ratio. Positive values indicate words that co-occur more than chance predicts; negative values indicate less than chance; zero indicates no association.

Sample output for maple_test_pmi_basic_output:

word1 word2 word1_count word2_count co_occurrences_count pmi
w0 w0 2 2 1 2.0794415416798357
w1 w1 10 10 1 -1.1394342831883648
w1 w2 10 3 1 0.06453852113757116
w1 w3 10 7 2 -0.08961215868968704
w1 w5 10 8 1 -0.916290731874155
w1 w9 10 12 4 0.06453852113757116
w2 w3 3 7 1 0.4212134650763035
w2 w4 3 4 1 0.9808292530117262
w3 w4 7 4 1 0.13353139262452257
w3 w5 7 8 2 0.13353139262452257
w3 w6 7 7 1 -0.42608439531090014
w4 w5 4 8 1 0.0
w4 w6 4 7 1 0.13353139262452257
w5 w6 8 7 2 0.13353139262452257
w5 w7 8 4 1 0.0
w5 w9 8 12 1 -1.0986122886681098
w6 w7 7 4 1 0.13353139262452257
w6 w8 7 7 1 -0.42608439531090014
w6 w9 7 12 1 -0.9650808960435872
w7 w8 4 7 2 0.8266785731844679
w8 w8 7 7 1 -0.42608439531090014
w8 w9 7 12 2 -0.2719337154836418
w9 w9 12 12 2 -0.8109302162163288

What's next