The Split Word (Generate Model) component trains a Chinese word segmentation model using the Alibaba Word Segmenter (AliWS) lexical analysis system. Unlike the Split Word component, which segments text directly, this component produces a deployable model — you must deploy it first, then run online or offline predictions.
Use this component when you need a reusable segmentation model for repeated inference, online API calls, or batch offline predictions.
How it works
Configure the component parameters and, optionally, supply a custom dictionary.
Run the component to generate an offline word segmentation model.
Deploy the model as an online model.
Run predictions — either through the online API or via batch offline segmentation.
Quick start
The only required parameter is outputModelName. Run the following PAI command in the SQL Script component to generate a model with default settings:
pai -name split_word_model
-project algo_public
-DoutputModelName=aliws_modelThen deploy and predict:
-- Deploy the model
create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;// Online prediction (Java)
KVJsonRequest request = new KVJsonRequest();
Map<String, JsonFeatureValue> row = request.addRow();
row.put(col_name, new JsonFeatureValue("The big data algorithm platform is a new platform"));
KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
List<ResponseItem> ri = res.getOutputs();
for (ResponseItem item : ri) {
System.out.println(item.getOutputLabel());
}-- Offline batch segmentation
pai -name prediction
-DmodelName=ning_test_aliws_model
-DinputTableName=ning_test_aliws
-DoutputTableName=ning_test_aliws_offline_predict;Configure the component
Configure the component using the Designer GUI or PAI commands.
GUI — use for interactive exploration and one-off model generation.
PAI commands — use for automation, reproducibility, and pipeline integration via the SQL Script component.
Method 1: Use the GUI
On the Designer workflow page, configure the component across three tabs.
Fields Setting tab
| Parameter | Description |
|---|---|
| Selected field column | The field column used to generate the model |
Parameters Setting tab
| Parameter | Description | Default |
|---|---|---|
| Recognized options | Entity types to detect. Options: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Time detected, Detection date, Detect numbers and letters | Detect simple entities, Detect phone numbers, Detect time, Detect dates, and Detect numbers and letters |
| Merge options | Token types to merge into a single retrieval unit. Options: Merge Chinese numerals, Merge Arabic numerals, Merge Chinese dates, Merge Chinese time | Merge Arabic numerals |
| Tokenizer | The segmentation domain. TAOBAO_CHN is optimized for product listing and e-commerce text; INTERNET_CHN is suited for general web content. | TAOBAO_CHN |
| POS tagger | Whether to perform part-of-speech (POS) tagging | Disabled |
| Semantic tagger | Whether to perform semantic tagging | Disabled |
| Filter out words that contain only numbers | Whether to discard number-only tokens from results | Disabled |
| Filter out words that contain only English letters | Whether to discard all-English tokens from results | Disabled |
| Filter out words that contain only punctuation marks | Whether to discard punctuation-only tokens from results | Disabled |
Execution Tuning tab
| Parameter | Default |
|---|---|
| Number of cores | System assigned |
| Memory per core | System assigned |
Method 2: Use PAI commands
Run PAI commands in the SQL Script component. All parameters except outputModelName are optional.
pai -name split_word_model
-project algo_public
-DoutputModelName=aliws_model
-DcolName=content
-Dtokenizer=TAOBAO_CHN
-DenableDfa=true
-DenablePersonNameTagger=false
-DenableOrgnizationTagger=false
-DenablePosTagger=false
-DenableTelephoneRetrievalUnit=true
-DenableTimeRetrievalUnit=true
-DenableDateRetrievalUnit=true
-DenableNumberLetterRetrievalUnit=true
-DenableChnNumMerge=false
-DenableNumMerge=true
-DenableChnTimeMerge=false
-DenableChnDateMerge=false
-DenableSemanticTagger=true| Parameter | Required | Description | Default |
|---|---|---|---|
outputModelName | Yes | Name of the output model | — |
colName | No | Column name of the text for prediction | context |
userDictTableName | No | Custom dictionary table name. The table must have one column, with one word per row. | — |
dictTableName | No | Custom dictionary table name. The table must have one column, with one word per row. | — |
tokenizer | No | Segmentation domain. TAOBAO_CHN for e-commerce and product text; INTERNET_CHN for general web content. | TAOBAO_CHN |
enableDfa | No | Detect simple entities | true |
enablePersonNameTagger | No | Detect person names | false |
enableOrgnizationTagger | No | Detect organization names | false |
enablePosTagger | No | Perform part-of-speech tagging | false |
enableTelephoneRetrievalUnit | No | Detect phone numbers | true |
enableTimeRetrievalUnit | No | Detect time expressions | true |
enableDateRetrievalUnit | No | Detect date expressions | true |
enableNumberLetterRetrievalUnit | No | Detect numbers and letters | true |
enableChnNumMerge | No | Merge Chinese numerals into a single retrieval unit | false |
enableNumMerge | No | Merge Arabic numerals into a single retrieval unit | true |
enableChnTimeMerge | No | Merge Chinese time expressions into a semantic unit | false |
enableChnDateMerge | No | Merge Chinese date expressions into a semantic unit | false |
enableSemanticTagger | No | Perform semantic tagging | false |
For more information about the SQL Script component, see SQL Script.