Introduction to the Split Word (Generate Model) component-Platform For AI(PAI)-阿里云帮助中心

The Split Word (Generate Model) component trains a Chinese word segmentation model using the Alibaba Word Segmenter (AliWS) lexical analysis system. Unlike the Split Word component, which segments text directly, this component produces a deployable model — you must deploy it first, then run online or offline predictions.

Use this component when you need a reusable segmentation model for repeated inference, online API calls, or batch offline predictions.

How it works

Configure the component parameters and, optionally, supply a custom dictionary.
Run the component to generate an offline word segmentation model.
Deploy the model as an online model.
Run predictions — either through the online API or via batch offline segmentation.

Quick start

The only required parameter is outputModelName. Run the following PAI command in the SQL Script component to generate a model with default settings:

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model

Then deploy and predict:

-- Deploy the model
create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;

// Online prediction (Java)
KVJsonRequest request = new KVJsonRequest();
Map<String, JsonFeatureValue> row = request.addRow();
row.put(col_name, new JsonFeatureValue("The big data algorithm platform is a new platform"));
KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
List<ResponseItem> ri = res.getOutputs();
for (ResponseItem item : ri) {
    System.out.println(item.getOutputLabel());
}

-- Offline batch segmentation
pai -name prediction
    -DmodelName=ning_test_aliws_model
    -DinputTableName=ning_test_aliws
    -DoutputTableName=ning_test_aliws_offline_predict;

Configure the component

Configure the component using the Designer GUI or PAI commands.

GUI — use for interactive exploration and one-off model generation.
PAI commands — use for automation, reproducibility, and pipeline integration via the SQL Script component.

Method 1: Use the GUI

On the Designer workflow page, configure the component across three tabs.

Fields Setting tab

Parameter	Description
Selected field column	The field column used to generate the model

Parameters Setting tab

Parameter	Description	Default
Recognized options	Entity types to detect. Options: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Time detected, Detection date, Detect numbers and letters	Detect simple entities, Detect phone numbers, Detect time, Detect dates, and Detect numbers and letters
Merge options	Token types to merge into a single retrieval unit. Options: Merge Chinese numerals, Merge Arabic numerals, Merge Chinese dates, Merge Chinese time	Merge Arabic numerals
Tokenizer	The segmentation domain. `TAOBAO_CHN` is optimized for product listing and e-commerce text; `INTERNET_CHN` is suited for general web content.	`TAOBAO_CHN`
POS tagger	Whether to perform part-of-speech (POS) tagging	Disabled
Semantic tagger	Whether to perform semantic tagging	Disabled
Filter out words that contain only numbers	Whether to discard number-only tokens from results	Disabled
Filter out words that contain only English letters	Whether to discard all-English tokens from results	Disabled
Filter out words that contain only punctuation marks	Whether to discard punctuation-only tokens from results	Disabled

Execution Tuning tab

Parameter	Default
Number of cores	System assigned
Memory per core	System assigned

Method 2: Use PAI commands

Run PAI commands in the SQL Script component. All parameters except outputModelName are optional.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true

Parameter	Required	Description	Default
`outputModelName`	Yes	Name of the output model	—
`colName`	No	Column name of the text for prediction	`context`
`userDictTableName`	No	Custom dictionary table name. The table must have one column, with one word per row.	—
`dictTableName`	No	Custom dictionary table name. The table must have one column, with one word per row.	—
`tokenizer`	No	Segmentation domain. `TAOBAO_CHN` for e-commerce and product text; `INTERNET_CHN` for general web content.	`TAOBAO_CHN`
`enableDfa`	No	Detect simple entities	`true`
`enablePersonNameTagger`	No	Detect person names	`false`
`enableOrgnizationTagger`	No	Detect organization names	`false`
`enablePosTagger`	No	Perform part-of-speech tagging	`false`
`enableTelephoneRetrievalUnit`	No	Detect phone numbers	`true`
`enableTimeRetrievalUnit`	No	Detect time expressions	`true`
`enableDateRetrievalUnit`	No	Detect date expressions	`true`
`enableNumberLetterRetrievalUnit`	No	Detect numbers and letters	`true`
`enableChnNumMerge`	No	Merge Chinese numerals into a single retrieval unit	`false`
`enableNumMerge`	No	Merge Arabic numerals into a single retrieval unit	`true`
`enableChnTimeMerge`	No	Merge Chinese time expressions into a semantic unit	`false`
`enableChnDateMerge`	No	Merge Chinese date expressions into a semantic unit	`false`
`enableSemanticTagger`	No	Perform semantic tagging	`false`

For more information about the SQL Script component, see SQL Script.