Word frequency statistics

更新时间:
复制 MD 格式

Word frequency statistics is a text analysis method that converts text into numerical features by counting word occurrences. The results provide foundational data for feature extraction in subsequent Natural Language Processing tasks, such as text classification, clustering, and information retrieval.

Algorithm

Word frequency is the number of times a word appears in a specific corpus, which helps measure a word's importance in the text. To calculate word frequency, the component first tokenizes the document content (docContent) into words. Then, for each document, the component outputs its document ID (docId) and the associated word data in the input order. Finally, it counts the occurrences of each word within that document. This process reveals the lexical structure of the text and supports subsequent text analysis tasks such as text classification, topic modeling, and information retrieval.

Input and output

Input port

Split Word

Output ports

Component configuration

Method 1: Designer

In the Designer pipeline page, add the Word Frequency Statistics component and configure its parameters in the right pane.

Parameter group

Parameter

Description

Field Settings

Document ID column

Select the document ID column (docId).

Document content column

Select the document content column (docContent). The text in this column is tokenized and used to calculate word frequencies.

Tuning

Number of cores

The number of cores for the job.

Memory size per core

The memory allocated to each core, in MB.

Method 2: PAI command

You can also configure the Word Frequency Statistics component by running a PAI command in the SQL Script component. For more information, see Run a PAI command in the SQL Script component.

pai -name doc_word_stat
    -project algo_public
    -DinputTableName=tdl_doc_test_split_word
    -DdocId=docid
    -DdocContent=content
    -DoutputTableNameMulti=doc_test_stat_multi
    -DoutputTableNameTriple=doc_test_stat_triple
    -DinputTablePartitions="region=cctv_news"
    -Dlifecycle=7

Parameter

Required

Default

Description

inputTableName

Yes

None

The name of the input table.

docId

Yes

None

The name of the document ID column. You can specify only one column.

docContent

Yes

None

The name of the document content column. You can specify only one column.

outputTableNameMulti

Yes

None

The name of the output table that stores tokenized words in their original order. This table contains the document ID (from the docId column) and the corresponding tokenized data. Words are listed sequentially as they appear in the original document.

outputTableNameTriple

No

None

The name of the output table for word frequency counts. The output is in a triple format, containing the document ID, the word, and its frequency count.

inputTablePartitions

No

All partitions

The partitions of the input table for processing. The following formats are supported:

  • partition_name=value

  • name1=value1/name2=value2: multi-level partitions

Note

To specify multiple partition values, separate them with a comma, for example, name1=value1,value2.

lifecycle

No

-1

The lifecycle of the output table, in days. The value must be a positive integer. A value of -1 indicates that the table is permanent.

Component outputs

  • Output port 1: Triple output

    The triple output contains three columns: id, word, and count. These columns represent the document ID, the tokenized word, and its frequency count, respectively. For example, the word "cloud" might have a count of 3, "analytics" a count of 2, and other words a count of 1.

  • Output port 2: Multi-row output. This output table contains two columns: id and word, representing the document ID and the corresponding tokenized word.

    This port outputs a table that lists words sequentially as they appear in the document. It does not count word occurrences, so a word may appear multiple times for the same document. This output format is designed primarily for compatibility with the Word2Vec component.