The PMI component calculates pointwise mutual information (PMI) across all word pairs in a document corpus, quantifying how strongly two words are associated based on their co-occurrence frequency. Use it in Machine Learning Designer pipelines to identify meaningful word associations.
How PMI works
In information theory, mutual information (MI) measures how much knowing one random variable reduces uncertainty about another. PMI extends this to word pairs, producing a single association score per pair.
Formula: PMI(x,y) = ln(p(x,y) / (p(x) × p(y))) = ln(#(x,y) × D / (#x × #y))
Where:
-
#(x,y)— number of times word pair (x, y) co-occurs -
D— total number of pairs -
#x,#y— individual word counts
Each time words x and y appear together in the same window, the component increments #x, #y, and #(x,y) by 1.
Interpreting PMI scores:
| PMI value | Meaning |
|---|---|
| Positive | The two words co-occur more often than chance predicts — likely meaningfully associated |
| Near zero | No strong association |
| Negative | The two words co-occur less often than chance predicts |
Without frequency filtering, rare words can produce artificially high PMI scores because they appear in very few contexts. Set minCount to filter out words that appear too infrequently to yield reliable statistics.
For background on the algorithm, see Pointwise mutual information.
Configure the component
Method 1: Configure in the PAI console
On the pipeline page of Machine Learning Designer, configure the following parameters:
| Tab | Parameter | Description |
|---|---|---|
| Fields setting | Columns of documents with words separated with spaces | The document column whose words are separated by spaces |
| Parameters setting | Minimum frequency of words | Words appearing fewer times than this threshold are filtered out. Default: 5. Increase this value to exclude rare words that inflate PMI scores. |
| Window size | The number of words to the right of the current word that form the co-occurrence window. For example, a window size of 5 pairs the current word with each of the 5 words immediately to its right. | |
| Tuning | Computing cores | Number of CPU cores for the calculation. Default: determined by the system. |
| Memory size per core (unit: MB) | Memory allocated to each core. Default: determined by the system. |
Method 2: Configure using PAI commands
Use an SQL Script component or an ODPS SQL node to run PAI commands. The command name is PointwiseMutualInformation.
PAI -name PointwiseMutualInformation
-project algo_public
-DinputTableName=maple_test_pmi_basic_input
-DdocColName=doc
-DoutputTableName=maple_test_pmi_basic_output
-DminCount=0
-DwindowSize=2
-DcoreNum=1
-DmemSizePerCore=110;
Parameters
| Parameter | Required | Description | Default |
|---|---|---|---|
inputTableName |
Yes | Input table | — |
outputTableName |
Yes | Output table | — |
docColName |
Yes | Name of the document column after word segmentation; words must be separated by spaces | — |
windowSize |
No | Number of words to the right of the current word that form the co-occurrence window. For example, a value of 5 indicates the five adjacent words on the right of the current word. | All words in the row |
minCount |
No | Minimum word frequency. Words appearing fewer times are filtered out. Increase this value if low-frequency words are producing unreliable PMI scores. | 5 |
inputTablePartitions |
No | Partitions to use from the input table, in Partition_name=value format. Specify multiple partitions as name1=value1/name2=value2. Separate multiple entries with commas. |
All partitions |
lifecycle |
No | Lifecycle of the output table | — |
coreNum |
No | Number of CPU cores. Valid values: 1–9999. | Determined by the system |
memSizePerCore |
No | Memory per core, in MB. Valid values: 1024–65536. | Determined by the system |
Example
This example demonstrates the full workflow: create an input table, run the PMI command, and inspect the output.
Input
Create a table named maple_test_pmi_basic_input using an ODPS SQL node. For details, see Develop a MaxCompute SQL task.
CREATE TABLE maple_test_pmi_basic_input AS
SELECT * FROM (
SELECT "w1 w2 w3 w4 w5 w6 w7 w8 w8 w9" AS doc
UNION ALL SELECT "w1 w3 w5 w6 w9" AS doc
UNION ALL SELECT "w0" AS doc
UNION ALL SELECT "w0 w0" AS doc
UNION ALL SELECT "w9 w1 w9 w1 w9" AS doc
) tmp;
The resulting table has one document per row, with words separated by spaces:
| doc |
|---|
| w1 w2 w3 w4 w5 w6 w7 w8 w8 w9 |
| w1 w3 w5 w6 w9 |
| w0 |
| w0 w0 |
| w9 w1 w9 w1 w9 |
Run the command
Run the following command using an SQL Script component or an ODPS SQL node:
PAI -name PointwiseMutualInformation
-project algo_public
-DinputTableName=maple_test_pmi_basic_input
-DdocColName=doc
-DoutputTableName=maple_test_pmi_basic_output
-DminCount=0
-DwindowSize=2
-DcoreNum=1
-DmemSizePerCore=110;
Output
The output table maple_test_pmi_basic_output contains one row per word pair:
| Column | Description |
|---|---|
word1, word2 |
The word pair |
word1_count, word2_count |
Total occurrences of each word across the corpus |
co_occurrences_count |
Number of times the pair co-occurs within the window |
pmi |
Natural logarithm of the co-occurrence ratio. Positive values indicate words that co-occur more than chance predicts; negative values indicate less than chance; zero indicates no association. |
Sample output for maple_test_pmi_basic_output:
| word1 | word2 | word1_count | word2_count | co_occurrences_count | pmi |
|---|---|---|---|---|---|
| w0 | w0 | 2 | 2 | 1 | 2.0794415416798357 |
| w1 | w1 | 10 | 10 | 1 | -1.1394342831883648 |
| w1 | w2 | 10 | 3 | 1 | 0.06453852113757116 |
| w1 | w3 | 10 | 7 | 2 | -0.08961215868968704 |
| w1 | w5 | 10 | 8 | 1 | -0.916290731874155 |
| w1 | w9 | 10 | 12 | 4 | 0.06453852113757116 |
| w2 | w3 | 3 | 7 | 1 | 0.4212134650763035 |
| w2 | w4 | 3 | 4 | 1 | 0.9808292530117262 |
| w3 | w4 | 7 | 4 | 1 | 0.13353139262452257 |
| w3 | w5 | 7 | 8 | 2 | 0.13353139262452257 |
| w3 | w6 | 7 | 7 | 1 | -0.42608439531090014 |
| w4 | w5 | 4 | 8 | 1 | 0.0 |
| w4 | w6 | 4 | 7 | 1 | 0.13353139262452257 |
| w5 | w6 | 8 | 7 | 2 | 0.13353139262452257 |
| w5 | w7 | 8 | 4 | 1 | 0.0 |
| w5 | w9 | 8 | 12 | 1 | -1.0986122886681098 |
| w6 | w7 | 7 | 4 | 1 | 0.13353139262452257 |
| w6 | w8 | 7 | 7 | 1 | -0.42608439531090014 |
| w6 | w9 | 7 | 12 | 1 | -0.9650808960435872 |
| w7 | w8 | 4 | 7 | 2 | 0.8266785731844679 |
| w8 | w8 | 7 | 7 | 1 | -0.42608439531090014 |
| w8 | w9 | 7 | 12 | 2 | -0.2719337154836418 |
| w9 | w9 | 12 | 12 | 2 | -0.8109302162163288 |
What's next
-
Machine Learning Designer overview — learn about the pipeline-based workflow
-
Component reference: overview of all components — explore other text analysis and feature engineering components