What is the PMI algorithm component-Platform For AI(PAI)-阿里云帮助中心

How PMI works

In information theory, mutual information (MI) measures how much knowing one random variable reduces uncertainty about another. PMI extends this to word pairs, producing a single association score per pair.

Formula: PMI(x,y) = ln(p(x,y) / (p(x) × p(y))) = ln(#(x,y) × D / (#x × #y))

Where:

#(x,y) — number of times word pair (x, y) co-occurs
D — total number of pairs
#x, #y — individual word counts

Each time words x and y appear together in the same window, the component increments #x, #y, and #(x,y) by 1.

Interpreting PMI scores:

PMI value	Meaning
Positive	The two words co-occur more often than chance predicts — likely meaningfully associated
Near zero	No strong association
Negative	The two words co-occur less often than chance predicts

Without frequency filtering, rare words can produce artificially high PMI scores because they appear in very few contexts. Set minCount to filter out words that appear too infrequently to yield reliable statistics.

For background on the algorithm, see Pointwise mutual information.

Configure the component

Method 1: Configure in the PAI console

On the pipeline page of Machine Learning Designer, configure the following parameters:

Tab	Parameter	Description
Fields setting	Columns of documents with words separated with spaces	The document column whose words are separated by spaces
Parameters setting	Minimum frequency of words	Words appearing fewer times than this threshold are filtered out. Default: 5. Increase this value to exclude rare words that inflate PMI scores.
Parameters setting	Window size	The number of words to the right of the current word that form the co-occurrence window. For example, a window size of 5 pairs the current word with each of the 5 words immediately to its right.
Tuning	Computing cores	Number of CPU cores for the calculation. Default: determined by the system.
Tuning	Memory size per core (unit: MB)	Memory allocated to each core. Default: determined by the system.

Method 2: Configure using PAI commands

Use an SQL Script component or an ODPS SQL node to run PAI commands. The command name is PointwiseMutualInformation.

PAI -name PointwiseMutualInformation
    -project algo_public
    -DinputTableName=maple_test_pmi_basic_input
    -DdocColName=doc
    -DoutputTableName=maple_test_pmi_basic_output
    -DminCount=0
    -DwindowSize=2
    -DcoreNum=1
    -DmemSizePerCore=110;

Parameters

Parameter	Required	Description	Default
`inputTableName`	Yes	Input table	—
`outputTableName`	Yes	Output table	—
`docColName`	Yes	Name of the document column after word segmentation; words must be separated by spaces	—
`windowSize`	No	Number of words to the right of the current word that form the co-occurrence window. For example, a value of 5 indicates the five adjacent words on the right of the current word.	All words in the row
`minCount`	No	Minimum word frequency. Words appearing fewer times are filtered out. Increase this value if low-frequency words are producing unreliable PMI scores.	5
`inputTablePartitions`	No	Partitions to use from the input table, in `Partition_name=value` format. Specify multiple partitions as `name1=value1/name2=value2`. Separate multiple entries with commas.	All partitions
`lifecycle`	No	Lifecycle of the output table	—
`coreNum`	No	Number of CPU cores. Valid values: 1–9999.	Determined by the system
`memSizePerCore`	No	Memory per core, in MB. Valid values: 1024–65536.	Determined by the system

Example

This example demonstrates the full workflow: create an input table, run the PMI command, and inspect the output.

Input

Create a table named maple_test_pmi_basic_input using an ODPS SQL node. For details, see Develop a MaxCompute SQL task.

CREATE TABLE maple_test_pmi_basic_input AS
SELECT * FROM (
    SELECT "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" AS doc
    UNION ALL SELECT "w1 w3 w5 w6 w9" AS doc
    UNION ALL SELECT "w0" AS doc
    UNION ALL SELECT "w0 w0" AS doc
    UNION ALL SELECT "w9 w1 w9 w1 w9" AS doc
) tmp;

The resulting table has one document per row, with words separated by spaces:

doc
w1 w2 w3 w4 w5 w6 w7 w8 w8 w9
w1 w3 w5 w6 w9
w0
w0 w0
w9 w1 w9 w1 w9

Run the command

Run the following command using an SQL Script component or an ODPS SQL node:

PAI -name PointwiseMutualInformation
    -project algo_public
    -DinputTableName=maple_test_pmi_basic_input
    -DdocColName=doc
    -DoutputTableName=maple_test_pmi_basic_output
    -DminCount=0
    -DwindowSize=2
    -DcoreNum=1
    -DmemSizePerCore=110;

Output

The output table maple_test_pmi_basic_output contains one row per word pair:

Column	Description
`word1`, `word2`	The word pair
`word1_count`, `word2_count`	Total occurrences of each word across the corpus
`co_occurrences_count`	Number of times the pair co-occurs within the window
`pmi`	Natural logarithm of the co-occurrence ratio. Positive values indicate words that co-occur more than chance predicts; negative values indicate less than chance; zero indicates no association.

Sample output for maple_test_pmi_basic_output:

word1	word2	word1_count	word2_count	co_occurrences_count	pmi
w0	w0	2	2	1	2.0794415416798357
w1	w1	10	10	1	-1.1394342831883648
w1	w2	10	3	1	0.06453852113757116
w1	w3	10	7	2	-0.08961215868968704
w1	w5	10	8	1	-0.916290731874155
w1	w9	10	12	4	0.06453852113757116
w2	w3	3	7	1	0.4212134650763035
w2	w4	3	4	1	0.9808292530117262
w3	w4	7	4	1	0.13353139262452257
w3	w5	7	8	2	0.13353139262452257
w3	w6	7	7	1	-0.42608439531090014
w4	w5	4	8	1	0.0
w4	w6	4	7	1	0.13353139262452257
w5	w6	8	7	2	0.13353139262452257
w5	w7	8	4	1	0.0
w5	w9	8	12	1	-1.0986122886681098
w6	w7	7	4	1	0.13353139262452257
w6	w8	7	7	1	-0.42608439531090014
w6	w9	7	12	1	-0.9650808960435872
w7	w8	4	7	2	0.8266785731844679
w8	w8	7	7	1	-0.42608439531090014
w8	w9	7	12	2	-0.2719337154836418
w9	w9	12	12	2	-0.8109302162163288

What's next

Machine Learning Designer overview — learn about the pipeline-based workflow
Component reference: overview of all components — explore other text analysis and feature engineering components