PS-SMART multiclass classification

更新时间:
复制 MD 格式

Parameter Server (PS) is designed for large-scale offline and online training tasks. Scalable Multiple Additive Regression Tree (SMART) is an iterative algorithm based on a Gradient Boosting Decision Tree (GBDT) and implemented on PS. PS-SMART supports training tasks with tens of billions of samples and hundreds of thousands of features and can run on thousands of nodes. It also supports multiple data formats and optimization techniques, such as histogram approximation.

Limits

The input data for the PS-SMART multiclass classification component must meet the following requirements:

  • The target column supports only numeric types. If the data in a MaxCompute table is of the STRING type, you must convert its data type. For example, if the classification targets are Good/Medium/Bad strings, you must convert them to 0/1/2.

  • If the data is in KV format, feature IDs must be positive integers and feature values must be real numbers. If feature IDs are strings, you can use the serialize component to serialize them. If feature values are categorical strings, you can perform feature engineering, such as feature discretization.

  • The PS-SMART multiclass classification component supports tasks with hundreds of thousands of features. However, these tasks consume significant resources and run slowly. GBDT algorithms are suitable for training directly with continuous features. Therefore, except for performing one-hot encoding on categorical features to filter out low-frequency ones, do not perform discretization on other continuous numerical features.

  • The PS-SMART algorithm introduces randomness. Examples include data and feature sampling indicated by data_sample_ratio and fea_sample_ratio, histogram approximation optimization used by the algorithm, and the random order of merging local sketches into a global sketch. Although the tree structures may differ when multiple workers run in a distributed manner, the model performance is theoretically similar. It is normal to obtain inconsistent results in multiple runs with the same data and parameters.

  • To accelerate training, you can increase the number of computing cores. However, the PS-SMART algorithm will not start training until all servers have acquired the necessary resources. Therefore, requesting more resources when the cluster is busy will increase the waiting time.

Notes

Note the following when you use the PS-SMART multiclass classification component:

  • The PS-SMART multiclass classification component supports tasks with hundreds of thousands of features. However, these tasks consume significant resources and run slowly. GBDT algorithms are suitable for training directly with continuous features. Therefore, except for performing one-hot encoding on categorical features to filter out low-frequency ones, do not perform discretization on other continuous numerical features.

  • The PS-SMART algorithm introduces randomness. Examples include data and feature sampling indicated by data_sample_ratio and fea_sample_ratio, histogram approximation optimization used by the algorithm, and the random order of merging local sketches into a global sketch. Although the tree structures may differ when multiple workers run in a distributed manner, the model performance is theoretically similar. It is normal to obtain inconsistent results in multiple runs with the same data and parameters.

  • You can increase the number of computing cores to accelerate training. However, the PS-SMART algorithm starts training only after all servers acquire the necessary resources. Therefore, requesting more resources when the cluster is busy increases the waiting time.

Configure the component

Method 1: Use the GUI

In the Designer workflow, add the PS-SMART multiclass classification component and configure its parameters in the right-side pane:

Parameter type

Parameter

Description

Fields setting

Is sparse format

For sparse format, use spaces to separate KV pairs and colons (:) to separate the key and value. For example: 1:0.3 3:0.9.

Feature columns

The feature columns from the input table used for training. If the input data is in dense format, you can select only numeric (BIGINT or DOUBLE) type columns. If the input data is in sparse KV format and both the key and value are numeric types, you can select only STRING type columns.

Label column

The label column of the input table. STRING and numeric types are supported. For internal storage, only numeric types are supported. For example, 0 and 1 in binary classification.

Weight column

The column used to weigh each sample row. Only numeric types are supported.

Parameters setting

Number of classes

The number of classes for multiclass classification. If the number of classes is n, the values in the label column must be {0,1,2,...,n-1}.

Evaluation metric type

Supported types are multiclass negative log likelihood and multiclass classification error.

Number of trees

The number of trees. This must be a positive integer. The training time is proportional to the number of trees.

Maximum tree depth

The default value is 5, which means a maximum of 32 leaf nodes.

Data sampling ratio

When building each tree, sample a portion of the data to build a weak learner and accelerate training.

Feature sampling ratio

When building each tree, sample a portion of the features to build a weak learner and accelerate training.

L1 penalty coefficient

Controls the size of leaf nodes. A larger value results in a more uniform distribution of leaf node sizes. Increase this value if overfitting occurs.

L2 penalty coefficient

Controls the size of leaf nodes. A larger value results in a more uniform distribution of leaf node sizes. Increase this value if overfitting occurs.

Learning rate

The value must be in the range of (0,1).

Approximate sketch precision

The quantile threshold for constructing the sketch. A smaller value results in more buckets. Typically, use the default value 0.03. Manual configuration is not required.

Minimum split loss change

The minimum loss change required to split a node. A larger value makes splitting more conservative.

Number of features

The number of features or the maximum feature ID. You must configure this parameter to estimate resource usage.

Global bias

The initial prediction value for all samples.

Random number generator seed

The random number seed. Must be an integer.

Feature importance type

The value can be one of the following:

  • The number of times the feature is used for splitting in the model.

  • The information gain brought by the feature in the model.

  • The number of samples covered by the feature at the split node in the model.

Execution tuning

Number of cores

By default, the system automatically allocates the cores.

Memory size per core

The memory used by a single core, in MB. Manual configuration is usually not required. The system allocates memory automatically.

Method 2: Use PAI commands

You can use a SQL script component to call PAI commands and configure the parameters of the PS-SMART multiclass classification component. For more information, see Scenario 4: Execute PAI commands in a SQL script component.

--Train
PAI -name ps_smart
    -project algo_public
    -DinputTableName="smart_multiclass_input"
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545859_2"
    -DoutputImportanceTableName="pai_temp_24515_545859_3"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -DenableSparse="true"
    -Dobjective="multi:softprob"
    -Dmetric="mlogloss"
    -DfeatureImportanceType="gain"
    -DtreeCount="5"
    -DmaxDepth="5"
    -Dshrinkage="0.3"
    -Dl2="1.0"
    -Dl1="0"
    -Dlifecycle="3"
    -DsketchEps="0.03"
    -DsampleRatio="1.0"
    -DfeatureRatio="1.0"
    -DbaseScore="0.5"
    -DminSplitLoss="0"
--Predict
PAI -name prediction
    -project algo_public
    -DinputTableName="smart_multiclass_input";
    -DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
    -DoutputTableName="pai_temp_24515_545860_1"
    -DfeatureColNames="features"
    -DappendColNames="label,features"
    -DenableSparse="true"
    -DkvDelimiter=":"
    -Dlifecycle="28"

Module

Parameter

Required

Default value

Description

Data parameters

featureColNames

Yes

None

The feature columns from the input table used for training. If the input table is in dense format, you can select only numeric (BIGINT or DOUBLE) type columns. If the input table is in sparse KV format and the key and value in the KV format are numeric types, you can select only STRING type columns.

labelColName

Yes

None

The label column of the input table. STRING and numeric types are supported. For internal storage, only numeric types are supported. For example, for multiclass classification, the values can be {0,1,2,…,n-1}, where n is the number of classes.

weightCol

No

None

The column used to weigh each sample row. Only numeric types are supported.

enableSparse

No

false

Specifies whether the data is in sparse format. Valid values are {true,false}. For sparse format, use spaces to separate KV pairs and colons (:) to separate the key and value. For example: 1:0.3 3:0.9.

inputTableName

Yes

None

The name of the input table.

modelName

Yes

None

The name of the output model.

outputImportanceTableName

No

None

The name of the output table for feature importance.

inputTablePartitions

No

None

The format is ds=1/pt=1.

outputTableName

No

None

The output table in MaxCompute. It is in binary format and cannot be read directly. It can only be accessed by the SMART prediction component.

lifecycle

No

3

The lifecycle of the output table.

Algorithm parameters

classNum

Yes

None

The number of classes for multiclass classification. If the number of classes is n, the values in the label column must be {0,1,2,...,n-1}.

objective

Yes

None

The type of objective function. For multiclass classification training, select multi:softprob.

metric

No

None

The evaluation metric type for the training dataset. The output is written to stdout in the Coordinator area of the Logview file. The following types are supported:

  • mlogloss: Corresponds to the multiclass negative log likelihood type in the GUI.

  • merror: Corresponds to the multiclass classification error type in the GUI.

treeCount

No

1

The number of trees. The training time is proportional to this value.

maxDepth

No

5

The maximum depth of a tree. The value must be in the range of 1 to 20.

sampleRatio

No

1.0

The data sampling ratio. The value must be in the range of (0,1]. A value of 1.0 means no sampling.

featureRatio

No

1.0

The feature sampling ratio. The value must be in the range of (0,1]. A value of 1.0 means no sampling.

l1

No

0

The L1 penalty coefficient. A larger value results in a more uniform distribution of leaf nodes. Increase this value if overfitting occurs.

l2

No

1.0

The L2 penalty coefficient. A larger value results in a more uniform distribution of leaf nodes. Increase this value if overfitting occurs.

shrinkage

No

0.3

The value must be in the range of (0,1).

sketchEps

No

0.03

The quantile threshold for constructing the sketch. The number of buckets is O(1.0/sketchEps). A smaller value results in more buckets. Typically, use the default value. Manual configuration is not required. The value must be in the range of (0,1).

minSplitLoss

No

0

The minimum loss change required to split a node. A larger value makes splitting more conservative.

featureNum

No

None

The number of features or the maximum feature ID. You must configure this parameter to estimate resource usage.

baseScore

No

0.5

The initial prediction value for all samples.

randSeed

No

None

The random number seed. Must be an integer.

featureImportanceType

No

gain

The type of feature importance to calculate. It includes:

  • weight: The number of times the feature is used as a split feature in the model.

  • gain: The information gain brought by the feature in the model.

  • cover: The number of samples covered by the feature at the split node in the model.

Tuning parameters

coreNum

No

System allocated

The number of cores. A larger value makes the algorithm run faster.

memSizePerCore

No

System allocated

The memory used by each core, in MB.

PS-SMART model deployment instructions

To deploy the model generated by the PS-SMART component as an online service, you must add the General-purpose Model Export component downstream of the PS-SMART component. You can configure the component parameters in the same way as for other PS-series components. For more information, see General-purpose Model Export.

Upon successful execution, you can go to the PAI-EAS Model Online Service page to deploy the model service. For more information, see Deploy a service in the console.