Word frequency statistics is a text analysis method that converts text into numerical features by counting word occurrences. The results provide foundational data for feature extraction in subsequent Natural Language Processing tasks, such as text classification, clustering, and information retrieval.
Algorithm
Word frequency is the number of times a word appears in a specific corpus, which helps measure a word's importance in the text. To calculate word frequency, the component first tokenizes the document content (docContent) into words. Then, for each document, the component outputs its document ID (docId) and the associated word data in the input order. Finally, it counts the occurrences of each word within that document. This process reveals the lexical structure of the text and supports subsequent text analysis tasks such as text classification, topic modeling, and information retrieval.
Input and output
Input port
Output ports
Component configuration
Method 1: Designer
In the Designer pipeline page, add the Word Frequency Statistics component and configure its parameters in the right pane.
|
Parameter group |
Parameter |
Description |
|
Field Settings |
Document ID column |
Select the document ID column ( |
|
Document content column |
Select the document content column ( |
|
|
Tuning |
Number of cores |
The number of cores for the job. |
|
Memory size per core |
The memory allocated to each core, in MB. |
Method 2: PAI command
You can also configure the Word Frequency Statistics component by running a PAI command in the SQL Script component. For more information, see Run a PAI command in the SQL Script component.
pai -name doc_word_stat
-project algo_public
-DinputTableName=tdl_doc_test_split_word
-DdocId=docid
-DdocContent=content
-DoutputTableNameMulti=doc_test_stat_multi
-DoutputTableNameTriple=doc_test_stat_triple
-DinputTablePartitions="region=cctv_news"
-Dlifecycle=7
|
Parameter |
Required |
Default |
Description |
|
inputTableName |
Yes |
None |
The name of the input table. |
|
docId |
Yes |
None |
The name of the document ID column. You can specify only one column. |
|
docContent |
Yes |
None |
The name of the document content column. You can specify only one column. |
|
outputTableNameMulti |
Yes |
None |
The name of the output table that stores tokenized words in their original order. This table contains the document ID (from the |
|
outputTableNameTriple |
No |
None |
The name of the output table for word frequency counts. The output is in a triple format, containing the document ID, the word, and its frequency count. |
|
inputTablePartitions |
No |
All partitions |
The partitions of the input table for processing. The following formats are supported:
Note
To specify multiple partition values, separate them with a comma, for example, |
|
lifecycle |
No |
-1 |
The lifecycle of the output table, in days. The value must be a positive integer. A value of -1 indicates that the table is permanent. |
Component outputs
-
Output port 1: Triple output
The triple output contains three columns: id, word, and count. These columns represent the document ID, the tokenized word, and its frequency count, respectively. For example, the word "cloud" might have a count of 3, "analytics" a count of 2, and other words a count of 1.
-
Output port 2: Multi-row output. This output table contains two columns: id and word, representing the document ID and the corresponding tokenized word.
This port outputs a table that lists words sequentially as they appear in the document. It does not count word occurrences, so a word may appear multiple times for the same document. This output format is designed primarily for compatibility with the Word2Vec component.