The Split Word component in PAI Designer tokenizes text in a specified column using the Alibaba Word Segmenter (AliWS). Tokens in the output are separated by spaces. If you enable part-of-speech (POS) tagging or semantic tagging, the output also includes POS tags (separated by /) and semantic tags (separated by |).
The component supports two tokenizers:
-
TAOBAO_CHN
-
INTERNET_CHN
Output format
The output format depends on which tagging options you enable:
| Configuration | Output format | Example token |
|---|---|---|
| Default (no POS, no semantic) | token |
Beijing |
| POS tagging enabled | token/POS_tag |
Beijing/ns |
| Semantic tagging enabled | token/POS_tag|semantic_tag |
Beijing/ns|place |
Tokens are separated by spaces. When both POS and semantic tagging are enabled, each token follows the token/POS_tag|semantic_tag format.
Configure the Split Word component
Configure the component using the GUI or a PAI command.
Configure using the GUI
On the workflow canvas in Designer, click the Split Word component to open its configuration panel.
| Tab | Parameter | Description |
|---|---|---|
| Fields Setting | Column Name | The column to tokenize. |
| Parameters Setting | Recognition Options | The entity types to detect. Valid values: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters. Default: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters. |
| Merge Options | The token types to merge into a single retrieval unit. Valid values: Merge Chinese numbers, Merge Arabic numerals, Merge Chinese dates, and Merge Chinese times. Default: Merge Arabic numerals. | |
| Filter | The tokenizer type. Valid values: TAOBAO_CHN and INTERNET_CHN. Default: TAOBAO_CHN. | |
| Pos Tagger | Whether to add POS tags to each token. Enabled by default. | |
| Semantic Tagger | Whether to add semantic tags to each token. Disabled by default. | |
| Filter tokens that are numbers | Whether to exclude numeric tokens from the output. Disabled by default. | |
| Filter tokens that are all-English words | Whether to exclude tokens that consist entirely of English letters. Disabled by default. | |
| Filter tokens that are punctuation marks | Whether to exclude punctuation tokens from the output. Disabled by default. | |
| Execution Tuning | Number of cores | The number of worker cores. Automatically allocated by the system. |
| Memory per core | The memory per core, in MB. Automatically allocated by the system. |
Configure using a PAI command
Use the SQL Script component to run PAI commands.
The following command shows all available parameters with their default values:
pai -name split_word_model
-project algo_public
-DoutputModelName=aliws_model
-DcolName=content
-Dtokenizer=TAOBAO_CHN
-DenableDfa=true
-DenablePersonNameTagger=false
-DenableOrgnizationTagger=false
-DenablePosTagger=false
-DenableTelephoneRetrievalUnit=true
-DenableTimeRetrievalUnit=true
-DenableDateRetrievalUnit=true
-DenableNumberLetterRetrievalUnit=true
-DenableChnNumMerge=false
-DenableNumMerge=true
-DenableChnTimeMerge=false
-DenableChnDateMerge=false
-DenableSemanticTagger=true
Parameters:
| Parameter | Required | Description | Default |
|---|---|---|---|
inputTableName |
Yes | The input table name. | — |
inputTablePartitions |
No | The partitions to tokenize. Format: partition_name=value. For multi-level partitions: name1=value1/name2=value2. Separate multiple partitions with commas. |
All partitions |
selectedColNames |
Yes | The columns to tokenize. Separate multiple column names with commas. | — |
dictTableName |
No | A custom dictionary table. The table must have one column, with one word per row. | — |
tokenizer |
No | The tokenizer type. Valid values: TAOBAO_CHN, INTERNET_CHN. |
TAOBAO_CHN |
enableDfa |
No | Whether to detect simple entities. | True |
enablePersonNameTagger |
No | Whether to detect person names. | False |
enableOrgnizationTagger |
No | Whether to detect organization names. | False |
enablePosTagger |
No | Whether to add POS tags. | False |
enableTelephoneRetrievalUnit |
No | Whether to detect phone numbers. | True |
enableTimeRetrievalUnit |
No | Whether to detect time expressions. | True |
enableDateRetrievalUnit |
No | Whether to detect dates. | True |
enableNumberLetterRetrievalUnit |
No | Whether to detect numbers and letters. | True |
enableChnNumMerge |
No | Whether to merge Chinese numbers into a single retrieval unit. | False |
enableNumMerge |
No | Whether to merge Arabic numerals into a single retrieval unit. | True |
enableChnTimeMerge |
No | Whether to merge Chinese time expressions into a single semantic unit. | False |
enableChnDateMerge |
No | Whether to merge Chinese date expressions into a single semantic unit. | False |
enableSemanticTagger |
No | Whether to add semantic tags. | False |
outputTableName |
Yes | The output table name. | — |
outputTablePartition |
No | The output table partition name. | — |
coreNum |
No | The number of workers. Must be a positive integer in [1, 9999]. Takes effect only when memSizePerCore is also set. |
Automatically allocated |
memSizePerCore |
No | The memory per worker, in MB. Must be a positive integer in [1024, 65536]. | Automatically allocated |
lifecycle |
No | The lifecycle of the output table. Must be a positive integer. | — |
For standard tables, leavecoreNumandmemSizePerCoreunset. The component calculates them automatically.
If resources are limited, use the following function to calculate appropriate values:
def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
"""Calculate the number of workers and memory per worker.
Args:
row: The number of rows in the input table.
col: The number of columns in the input table.
kOneCoreDataSize: The data volume per worker, in MB. Must be a positive integer. Default: 1024.
Returns:
coreNum, memSizePerCore
Example:
coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
"""
kMBytes = 1024.0 * 1024.0
# Calculate the number of workers based on data volume.
coreNum = max(1, int(row * col * 1000 / kMBytes / kOneCoreDataSize))
# Memory per worker = data volume size.
memSizePerCore = max(1024, int(kOneCoreDataSize * 2))
return coreNum, memSizePerCore
Example
The following example tokenizes a text column using the default settings.
Step 1: Create the input table.
create table pai_aliws_test
as select
1 as id,
'Today is a good day. The weather is fine and sunny.' as content;
Step 2: Run the Split Word component.
pai -name split_word
-project algo_public
-DinputTableName=pai_aliws_test
-DselectedColNames=content
-DoutputTableName=doc_test_split_word
Input:
| id | content |
|---|---|
| 1 | Today is a good day. The weather is fine and sunny. |
Output behavior:
-
The component tokenizes the specified column and leaves all other columns unchanged.
-
With a custom dictionary, tokenization is based on both the dictionary and context. Results may not strictly follow the custom dictionary.
Limitations
-
Only the TAOBAO_CHN and INTERNET_CHN tokenizers are supported.