Word frequency statistics-Platform For AI(PAI)-阿里云帮助中心

Word frequency statistics is a text analysis method that converts text into numerical features by counting word occurrences. The results provide foundational data for feature extraction in subsequent Natural Language Processing tasks, such as text classification, clustering, and information retrieval.

Algorithm

Word frequency is the number of times a word appears in a specific corpus, which helps measure a word's importance in the text. To calculate word frequency, the component first tokenizes the document content (docContent) into words. Then, for each document, the component outputs its document ID (docId) and the associated word data in the input order. Finally, it counts the occurrences of each word within that document. This process reveals the lexical structure of the text and supports subsequent text analysis tasks such as text classification, topic modeling, and information retrieval.

Input and output

Input port

Split Word

Output ports

Component configuration

Method 1: Designer

In the Designer pipeline page, add the Word Frequency Statistics component and configure its parameters in the right pane.

Parameter group	Parameter	Description
Field Settings	Document ID column	Select the document ID column (`docId`).
Field Settings	Document content column	Select the document content column (`docContent`). The text in this column is tokenized and used to calculate word frequencies.
Tuning	Number of cores	The number of cores for the job.
Tuning	Memory size per core	The memory allocated to each core, in MB.

Method 2: PAI command

You can also configure the Word Frequency Statistics component by running a PAI command in the SQL Script component. For more information, see Run a PAI command in the SQL Script component.

pai -name doc_word_stat
    -project algo_public
    -DinputTableName=tdl_doc_test_split_word
    -DdocId=docid
    -DdocContent=content
    -DoutputTableNameMulti=doc_test_stat_multi
    -DoutputTableNameTriple=doc_test_stat_triple
    -DinputTablePartitions="region=cctv_news"
    -Dlifecycle=7

Parameter	Required	Default	Description
inputTableName	Yes	None	The name of the input table.
docId	Yes	None	The name of the document ID column. You can specify only one column.
docContent	Yes	None	The name of the document content column. You can specify only one column.
outputTableNameMulti	Yes	None	The name of the output table that stores tokenized words in their original order. This table contains the document ID (from the `docId` column) and the corresponding tokenized data. Words are listed sequentially as they appear in the original document.
outputTableNameTriple	No	None	The name of the output table for word frequency counts. The output is in a triple format, containing the document ID, the word, and its frequency count.
inputTablePartitions	No	All partitions	The partitions of the input table for processing. The following formats are supported: partition_name=value name1=value1/name2=value2: multi-level partitions Note To specify multiple partition values, separate them with a comma, for example, `name1=value1,value2`.
lifecycle	No	-1	The lifecycle of the output table, in days. The value must be a positive integer. A value of -1 indicates that the table is permanent.

Component outputs

Output port 1: Triple output

The triple output contains three columns: id, word, and count. These columns represent the document ID, the tokenized word, and its frequency count, respectively. For example, the word "cloud" might have a count of 3, "analytics" a count of 2, and other words a count of 1.
Output port 2: Multi-row output. This output table contains two columns: id and word, representing the document ID and the corresponding tokenized word.

This port outputs a table that lists words sequentially as they appear in the document. It does not count word occurrences, so a word may appear multiple times for the same document. This output format is designed primarily for compatibility with the Word2Vec component.