Split word-Platform For AI(PAI)-阿里云帮助中心

The Split Word component in PAI Designer tokenizes text in a specified column using the Alibaba Word Segmenter (AliWS). Tokens in the output are separated by spaces. If you enable part-of-speech (POS) tagging or semantic tagging, the output also includes POS tags (separated by /) and semantic tags (separated by |).

The component supports two tokenizers:

TAOBAO_CHN
INTERNET_CHN

Output format

The output format depends on which tagging options you enable:

Configuration	Output format	Example token
Default (no POS, no semantic)	`token`	`Beijing`
POS tagging enabled	`token/POS_tag`	`Beijing/ns`
Semantic tagging enabled	`token/POS_tag\|semantic_tag`	`Beijing/ns\|place`

Tokens are separated by spaces. When both POS and semantic tagging are enabled, each token follows the token/POS_tag|semantic_tag format.

Configure the Split Word component

Configure the component using the GUI or a PAI command.

Configure using the GUI

On the workflow canvas in Designer, click the Split Word component to open its configuration panel.

Tab	Parameter	Description
Fields Setting	Column Name	The column to tokenize.
Parameters Setting	Recognition Options	The entity types to detect. Valid values: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters. Default: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters.
	Merge Options	The token types to merge into a single retrieval unit. Valid values: Merge Chinese numbers, Merge Arabic numerals, Merge Chinese dates, and Merge Chinese times. Default: Merge Arabic numerals.
	Filter	The tokenizer type. Valid values: TAOBAO_CHN and INTERNET_CHN. Default: TAOBAO_CHN.
	Pos Tagger	Whether to add POS tags to each token. Enabled by default.
	Semantic Tagger	Whether to add semantic tags to each token. Disabled by default.
	Filter tokens that are numbers	Whether to exclude numeric tokens from the output. Disabled by default.
	Filter tokens that are all-English words	Whether to exclude tokens that consist entirely of English letters. Disabled by default.
	Filter tokens that are punctuation marks	Whether to exclude punctuation tokens from the output. Disabled by default.
Execution Tuning	Number of cores	The number of worker cores. Automatically allocated by the system.
Execution Tuning	Memory per core	The memory per core, in MB. Automatically allocated by the system.

Configure using a PAI command

Use the SQL Script component to run PAI commands.

The following command shows all available parameters with their default values:

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameters:

Parameter	Required	Description	Default
`inputTableName`	Yes	The input table name.	—
`inputTablePartitions`	No	The partitions to tokenize. Format: `partition_name=value`. For multi-level partitions: `name1=value1/name2=value2`. Separate multiple partitions with commas.	All partitions
`selectedColNames`	Yes	The columns to tokenize. Separate multiple column names with commas.	—
`dictTableName`	No	A custom dictionary table. The table must have one column, with one word per row.	—
`tokenizer`	No	The tokenizer type. Valid values: `TAOBAO_CHN`, `INTERNET_CHN`.	`TAOBAO_CHN`
`enableDfa`	No	Whether to detect simple entities.	`True`
`enablePersonNameTagger`	No	Whether to detect person names.	`False`
`enableOrgnizationTagger`	No	Whether to detect organization names.	`False`
`enablePosTagger`	No	Whether to add POS tags.	`False`
`enableTelephoneRetrievalUnit`	No	Whether to detect phone numbers.	`True`
`enableTimeRetrievalUnit`	No	Whether to detect time expressions.	`True`
`enableDateRetrievalUnit`	No	Whether to detect dates.	`True`
`enableNumberLetterRetrievalUnit`	No	Whether to detect numbers and letters.	`True`
`enableChnNumMerge`	No	Whether to merge Chinese numbers into a single retrieval unit.	`False`
`enableNumMerge`	No	Whether to merge Arabic numerals into a single retrieval unit.	`True`
`enableChnTimeMerge`	No	Whether to merge Chinese time expressions into a single semantic unit.	`False`
`enableChnDateMerge`	No	Whether to merge Chinese date expressions into a single semantic unit.	`False`
`enableSemanticTagger`	No	Whether to add semantic tags.	`False`
`outputTableName`	Yes	The output table name.	—
`outputTablePartition`	No	The output table partition name.	—
`coreNum`	No	The number of workers. Must be a positive integer in [1, 9999]. Takes effect only when `memSizePerCore` is also set.	Automatically allocated
`memSizePerCore`	No	The memory per worker, in MB. Must be a positive integer in [1024, 65536].	Automatically allocated
`lifecycle`	No	The lifecycle of the output table. Must be a positive integer.	—

For standard tables, leave coreNum and memSizePerCore unset. The component calculates them automatically.

If resources are limited, use the following function to calculate appropriate values:

def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
    """Calculate the number of workers and memory per worker.
       Args:
           row: The number of rows in the input table.
           col: The number of columns in the input table.
           kOneCoreDataSize: The data volume per worker, in MB. Must be a positive integer. Default: 1024.
       Returns:
           coreNum, memSizePerCore
       Example:
           coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
    """
    kMBytes = 1024.0 * 1024.0
    # Calculate the number of workers based on data volume.
    coreNum = max(1, int(row * col * 1000 / kMBytes / kOneCoreDataSize))
    # Memory per worker = data volume size.
    memSizePerCore = max(1024, int(kOneCoreDataSize * 2))
    return coreNum, memSizePerCore

Example

The following example tokenizes a text column using the default settings.

Step 1: Create the input table.

create table pai_aliws_test
as select
    1 as id,
    'Today is a good day. The weather is fine and sunny.' as content;

Step 2: Run the Split Word component.

pai -name split_word
    -project algo_public
    -DinputTableName=pai_aliws_test
    -DselectedColNames=content
    -DoutputTableName=doc_test_split_word

Input:

id	content
1	Today is a good day. The weather is fine and sunny.

Output behavior:

The component tokenizes the specified column and leaves all other columns unchanged.
With a custom dictionary, tokenization is based on both the dictionary and context. Results may not strictly follow the custom dictionary.

Limitations

Only the TAOBAO_CHN and INTERNET_CHN tokenizers are supported.