Split word (generate model)

更新时间:
复制 MD 格式

The Split Word (Generate Model) component trains a Chinese word segmentation model using the Alibaba Word Segmenter (AliWS) lexical analysis system. Unlike the Split Word component, which segments text directly, this component produces a deployable model — you must deploy it first, then run online or offline predictions.

Use this component when you need a reusable segmentation model for repeated inference, online API calls, or batch offline predictions.

How it works

  1. Configure the component parameters and, optionally, supply a custom dictionary.

  2. Run the component to generate an offline word segmentation model.

  3. Deploy the model as an online model.

  4. Run predictions — either through the online API or via batch offline segmentation.

Quick start

The only required parameter is outputModelName. Run the following PAI command in the SQL Script component to generate a model with default settings:

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model

Then deploy and predict:

-- Deploy the model
create onlinemodel ning_test_aliws_model_2 -offlinemodelName ning_test_aliws_model -instanceNum 1 -cpu 100 -memory 4096;
// Online prediction (Java)
KVJsonRequest request = new KVJsonRequest();
Map<String, JsonFeatureValue> row = request.addRow();
row.put(col_name, new JsonFeatureValue("The big data algorithm platform is a new platform"));
KVJsonResponse res = predictClient.syncPredict(new JsonPredictRequest(project_name, model_name, request));
List<ResponseItem> ri = res.getOutputs();
for (ResponseItem item : ri) {
    System.out.println(item.getOutputLabel());
}
-- Offline batch segmentation
pai -name prediction
    -DmodelName=ning_test_aliws_model
    -DinputTableName=ning_test_aliws
    -DoutputTableName=ning_test_aliws_offline_predict;

Configure the component

Configure the component using the Designer GUI or PAI commands.

  • GUI — use for interactive exploration and one-off model generation.

  • PAI commands — use for automation, reproducibility, and pipeline integration via the SQL Script component.

Method 1: Use the GUI

On the Designer workflow page, configure the component across three tabs.

Fields Setting tab

ParameterDescription
Selected field columnThe field column used to generate the model

Parameters Setting tab

ParameterDescriptionDefault
Recognized optionsEntity types to detect. Options: Detect simple entities, Detect person names, Detect organization names, Detect phone numbers, Time detected, Detection date, Detect numbers and lettersDetect simple entities, Detect phone numbers, Detect time, Detect dates, and Detect numbers and letters
Merge optionsToken types to merge into a single retrieval unit. Options: Merge Chinese numerals, Merge Arabic numerals, Merge Chinese dates, Merge Chinese timeMerge Arabic numerals
TokenizerThe segmentation domain. TAOBAO_CHN is optimized for product listing and e-commerce text; INTERNET_CHN is suited for general web content.TAOBAO_CHN
POS taggerWhether to perform part-of-speech (POS) taggingDisabled
Semantic taggerWhether to perform semantic taggingDisabled
Filter out words that contain only numbersWhether to discard number-only tokens from resultsDisabled
Filter out words that contain only English lettersWhether to discard all-English tokens from resultsDisabled
Filter out words that contain only punctuation marksWhether to discard punctuation-only tokens from resultsDisabled

Execution Tuning tab

ParameterDefault
Number of coresSystem assigned
Memory per coreSystem assigned

Method 2: Use PAI commands

Run PAI commands in the SQL Script component. All parameters except outputModelName are optional.

pai -name split_word_model
    -project algo_public
    -DoutputModelName=aliws_model
    -DcolName=content
    -Dtokenizer=TAOBAO_CHN
    -DenableDfa=true
    -DenablePersonNameTagger=false
    -DenableOrgnizationTagger=false
    -DenablePosTagger=false
    -DenableTelephoneRetrievalUnit=true
    -DenableTimeRetrievalUnit=true
    -DenableDateRetrievalUnit=true
    -DenableNumberLetterRetrievalUnit=true
    -DenableChnNumMerge=false
    -DenableNumMerge=true
    -DenableChnTimeMerge=false
    -DenableChnDateMerge=false
    -DenableSemanticTagger=true
ParameterRequiredDescriptionDefault
outputModelNameYesName of the output model
colNameNoColumn name of the text for predictioncontext
userDictTableNameNoCustom dictionary table name. The table must have one column, with one word per row.
dictTableNameNoCustom dictionary table name. The table must have one column, with one word per row.
tokenizerNoSegmentation domain. TAOBAO_CHN for e-commerce and product text; INTERNET_CHN for general web content.TAOBAO_CHN
enableDfaNoDetect simple entitiestrue
enablePersonNameTaggerNoDetect person namesfalse
enableOrgnizationTaggerNoDetect organization namesfalse
enablePosTaggerNoPerform part-of-speech taggingfalse
enableTelephoneRetrievalUnitNoDetect phone numberstrue
enableTimeRetrievalUnitNoDetect time expressionstrue
enableDateRetrievalUnitNoDetect date expressionstrue
enableNumberLetterRetrievalUnitNoDetect numbers and letterstrue
enableChnNumMergeNoMerge Chinese numerals into a single retrieval unitfalse
enableNumMergeNoMerge Arabic numerals into a single retrieval unittrue
enableChnTimeMergeNoMerge Chinese time expressions into a semantic unitfalse
enableChnDateMergeNoMerge Chinese date expressions into a semantic unitfalse
enableSemanticTaggerNoPerform semantic taggingfalse

For more information about the SQL Script component, see SQL Script.