Split word

更新时间:
复制 MD 格式

The Split Word component in PAI Designer tokenizes text in a specified column using the Alibaba Word Segmenter (AliWS). Tokens in the output are separated by spaces. If you enable part-of-speech (POS) tagging or semantic tagging, the output also includes POS tags (separated by /) and semantic tags (separated by |).

The component supports two tokenizers:

  • TAOBAO_CHN

  • INTERNET_CHN

Output format

The output format depends on which tagging options you enable:

Configuration Output format Example token
Default (no POS, no semantic) token Beijing
POS tagging enabled token/POS_tag Beijing/ns
Semantic tagging enabled token/POS_tag|semantic_tag Beijing/ns|place

Tokens are separated by spaces. When both POS and semantic tagging are enabled, each token follows the token/POS_tag|semantic_tag format.

Configure the Split Word component

Configure the component using the GUI or a PAI command.

Configure using the GUI

On the workflow canvas in Designer, click the Split Word component to open its configuration panel.

Tab Parameter Description
Fields Setting Column Name The column to tokenize.
Parameters Setting Recognition Options The entity types to detect. Valid values: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters. Default: Detect simple entities, Detect phone numbers, Detect time, Detect date, and Detect numbers and letters.
Merge Options The token types to merge into a single retrieval unit. Valid values: Merge Chinese numbers, Merge Arabic numerals, Merge Chinese dates, and Merge Chinese times. Default: Merge Arabic numerals.
Filter The tokenizer type. Valid values: TAOBAO_CHN and INTERNET_CHN. Default: TAOBAO_CHN.
Pos Tagger Whether to add POS tags to each token. Enabled by default.
Semantic Tagger Whether to add semantic tags to each token. Disabled by default.
Filter tokens that are numbers Whether to exclude numeric tokens from the output. Disabled by default.
Filter tokens that are all-English words Whether to exclude tokens that consist entirely of English letters. Disabled by default.
Filter tokens that are punctuation marks Whether to exclude punctuation tokens from the output. Disabled by default.
Execution Tuning Number of cores The number of worker cores. Automatically allocated by the system.
Memory per core The memory per core, in MB. Automatically allocated by the system.

Configure using a PAI command

Use the SQL Script component to run PAI commands.

The following command shows all available parameters with their default values:

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameters:

Parameter Required Description Default
inputTableName Yes The input table name.
inputTablePartitions No The partitions to tokenize. Format: partition_name=value. For multi-level partitions: name1=value1/name2=value2. Separate multiple partitions with commas. All partitions
selectedColNames Yes The columns to tokenize. Separate multiple column names with commas.
dictTableName No A custom dictionary table. The table must have one column, with one word per row.
tokenizer No The tokenizer type. Valid values: TAOBAO_CHN, INTERNET_CHN. TAOBAO_CHN
enableDfa No Whether to detect simple entities. True
enablePersonNameTagger No Whether to detect person names. False
enableOrgnizationTagger No Whether to detect organization names. False
enablePosTagger No Whether to add POS tags. False
enableTelephoneRetrievalUnit No Whether to detect phone numbers. True
enableTimeRetrievalUnit No Whether to detect time expressions. True
enableDateRetrievalUnit No Whether to detect dates. True
enableNumberLetterRetrievalUnit No Whether to detect numbers and letters. True
enableChnNumMerge No Whether to merge Chinese numbers into a single retrieval unit. False
enableNumMerge No Whether to merge Arabic numerals into a single retrieval unit. True
enableChnTimeMerge No Whether to merge Chinese time expressions into a single semantic unit. False
enableChnDateMerge No Whether to merge Chinese date expressions into a single semantic unit. False
enableSemanticTagger No Whether to add semantic tags. False
outputTableName Yes The output table name.
outputTablePartition No The output table partition name.
coreNum No The number of workers. Must be a positive integer in [1, 9999]. Takes effect only when memSizePerCore is also set. Automatically allocated
memSizePerCore No The memory per worker, in MB. Must be a positive integer in [1024, 65536]. Automatically allocated
lifecycle No The lifecycle of the output table. Must be a positive integer.
For standard tables, leave coreNum and memSizePerCore unset. The component calculates them automatically.

If resources are limited, use the following function to calculate appropriate values:

def CalcCoreNumAndMem(row, col, kOneCoreDataSize=1024):
    """Calculate the number of workers and memory per worker.
       Args:
           row: The number of rows in the input table.
           col: The number of columns in the input table.
           kOneCoreDataSize: The data volume per worker, in MB. Must be a positive integer. Default: 1024.
       Returns:
           coreNum, memSizePerCore
       Example:
           coreNum, memSizePerCore = CalcCoreNumAndMem(1000, 99, kOneCoreDataSize=2048)
    """
    kMBytes = 1024.0 * 1024.0
    # Calculate the number of workers based on data volume.
    coreNum = max(1, int(row * col * 1000 / kMBytes / kOneCoreDataSize))
    # Memory per worker = data volume size.
    memSizePerCore = max(1024, int(kOneCoreDataSize * 2))
    return coreNum, memSizePerCore

Example

The following example tokenizes a text column using the default settings.

Step 1: Create the input table.

create table pai_aliws_test
as select
    1 as id,
    'Today is a good day. The weather is fine and sunny.' as content;

Step 2: Run the Split Word component.

pai -name split_word
    -project algo_public
    -DinputTableName=pai_aliws_test
    -DselectedColNames=content
    -DoutputTableName=doc_test_split_word

Input:

id content
1 Today is a good day. The weather is fine and sunny.

Output behavior:

  • The component tokenizes the specified column and leaves all other columns unchanged.

  • With a custom dictionary, tokenization is based on both the dictionary and context. Results may not strictly follow the custom dictionary.

Limitations

  • Only the TAOBAO_CHN and INTERNET_CHN tokenizers are supported.