Parameter Server (PS) is designed for large-scale offline and online training tasks. Scalable Multiple Additive Regression Tree (SMART) is an iterative algorithm based on a Gradient Boosting Decision Tree (GBDT) and implemented on PS. PS-SMART supports training tasks with tens of billions of samples and hundreds of thousands of features and can run on thousands of nodes. It also supports multiple data formats and optimization techniques, such as histogram approximation.
Limits
The input data for the PS-SMART multiclass classification component must meet the following requirements:
The target column supports only numeric types. If the data in a MaxCompute table is of the STRING type, you must convert its data type. For example, if the classification targets are Good/Medium/Bad strings, you must convert them to 0/1/2.
If the data is in KV format, feature IDs must be positive integers and feature values must be real numbers. If feature IDs are strings, you can use the serialize component to serialize them. If feature values are categorical strings, you can perform feature engineering, such as feature discretization.
The PS-SMART multiclass classification component supports tasks with hundreds of thousands of features. However, these tasks consume significant resources and run slowly. GBDT algorithms are suitable for training directly with continuous features. Therefore, except for performing one-hot encoding on categorical features to filter out low-frequency ones, do not perform discretization on other continuous numerical features.
The PS-SMART algorithm introduces randomness. Examples include data and feature sampling indicated by data_sample_ratio and fea_sample_ratio, histogram approximation optimization used by the algorithm, and the random order of merging local sketches into a global sketch. Although the tree structures may differ when multiple workers run in a distributed manner, the model performance is theoretically similar. It is normal to obtain inconsistent results in multiple runs with the same data and parameters.
To accelerate training, you can increase the number of computing cores. However, the PS-SMART algorithm will not start training until all servers have acquired the necessary resources. Therefore, requesting more resources when the cluster is busy will increase the waiting time.
Notes
Note the following when you use the PS-SMART multiclass classification component:
The PS-SMART multiclass classification component supports tasks with hundreds of thousands of features. However, these tasks consume significant resources and run slowly. GBDT algorithms are suitable for training directly with continuous features. Therefore, except for performing one-hot encoding on categorical features to filter out low-frequency ones, do not perform discretization on other continuous numerical features.
The PS-SMART algorithm introduces randomness. Examples include data and feature sampling indicated by data_sample_ratio and fea_sample_ratio, histogram approximation optimization used by the algorithm, and the random order of merging local sketches into a global sketch. Although the tree structures may differ when multiple workers run in a distributed manner, the model performance is theoretically similar. It is normal to obtain inconsistent results in multiple runs with the same data and parameters.
You can increase the number of computing cores to accelerate training. However, the PS-SMART algorithm starts training only after all servers acquire the necessary resources. Therefore, requesting more resources when the cluster is busy increases the waiting time.
Configure the component
Method 1: Use the GUI
In the Designer workflow, add the PS-SMART multiclass classification component and configure its parameters in the right-side pane:
Parameter type | Parameter | Description |
Fields setting | Is sparse format | For sparse format, use spaces to separate KV pairs and colons (:) to separate the key and value. For example: 1:0.3 3:0.9. |
Feature columns | The feature columns from the input table used for training. If the input data is in dense format, you can select only numeric (BIGINT or DOUBLE) type columns. If the input data is in sparse KV format and both the key and value are numeric types, you can select only STRING type columns. | |
Label column | The label column of the input table. STRING and numeric types are supported. For internal storage, only numeric types are supported. For example, 0 and 1 in binary classification. | |
Weight column | The column used to weigh each sample row. Only numeric types are supported. | |
Parameters setting | Number of classes | The number of classes for multiclass classification. If the number of classes is n, the values in the label column must be {0,1,2,...,n-1}. |
Evaluation metric type | Supported types are multiclass negative log likelihood and multiclass classification error. | |
Number of trees | The number of trees. This must be a positive integer. The training time is proportional to the number of trees. | |
Maximum tree depth | The default value is 5, which means a maximum of 32 leaf nodes. | |
Data sampling ratio | When building each tree, sample a portion of the data to build a weak learner and accelerate training. | |
Feature sampling ratio | When building each tree, sample a portion of the features to build a weak learner and accelerate training. | |
L1 penalty coefficient | Controls the size of leaf nodes. A larger value results in a more uniform distribution of leaf node sizes. Increase this value if overfitting occurs. | |
L2 penalty coefficient | Controls the size of leaf nodes. A larger value results in a more uniform distribution of leaf node sizes. Increase this value if overfitting occurs. | |
Learning rate | The value must be in the range of (0,1). | |
Approximate sketch precision | The quantile threshold for constructing the sketch. A smaller value results in more buckets. Typically, use the default value 0.03. Manual configuration is not required. | |
Minimum split loss change | The minimum loss change required to split a node. A larger value makes splitting more conservative. | |
Number of features | The number of features or the maximum feature ID. You must configure this parameter to estimate resource usage. | |
Global bias | The initial prediction value for all samples. | |
Random number generator seed | The random number seed. Must be an integer. | |
Feature importance type | The value can be one of the following:
| |
Execution tuning | Number of cores | By default, the system automatically allocates the cores. |
Memory size per core | The memory used by a single core, in MB. Manual configuration is usually not required. The system allocates memory automatically. |
Method 2: Use PAI commands
You can use a SQL script component to call PAI commands and configure the parameters of the PS-SMART multiclass classification component. For more information, see Scenario 4: Execute PAI commands in a SQL script component.
--Train
PAI -name ps_smart
-project algo_public
-DinputTableName="smart_multiclass_input"
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545859_2"
-DoutputImportanceTableName="pai_temp_24515_545859_3"
-DlabelColName="label"
-DfeatureColNames="features"
-DenableSparse="true"
-Dobjective="multi:softprob"
-Dmetric="mlogloss"
-DfeatureImportanceType="gain"
-DtreeCount="5"
-DmaxDepth="5"
-Dshrinkage="0.3"
-Dl2="1.0"
-Dl1="0"
-Dlifecycle="3"
-DsketchEps="0.03"
-DsampleRatio="1.0"
-DfeatureRatio="1.0"
-DbaseScore="0.5"
-DminSplitLoss="0"
--Predict
PAI -name prediction
-project algo_public
-DinputTableName="smart_multiclass_input";
-DmodelName="xlab_m_pai_ps_smart_bi_545859_v0"
-DoutputTableName="pai_temp_24515_545860_1"
-DfeatureColNames="features"
-DappendColNames="label,features"
-DenableSparse="true"
-DkvDelimiter=":"
-Dlifecycle="28"Module | Parameter | Required | Default value | Description |
Data parameters | featureColNames | Yes | None | The feature columns from the input table used for training. If the input table is in dense format, you can select only numeric (BIGINT or DOUBLE) type columns. If the input table is in sparse KV format and the key and value in the KV format are numeric types, you can select only STRING type columns. |
labelColName | Yes | None | The label column of the input table. STRING and numeric types are supported. For internal storage, only numeric types are supported. For example, for multiclass classification, the values can be {0,1,2,…,n-1}, where n is the number of classes. | |
weightCol | No | None | The column used to weigh each sample row. Only numeric types are supported. | |
enableSparse | No | false | Specifies whether the data is in sparse format. Valid values are {true,false}. For sparse format, use spaces to separate KV pairs and colons (:) to separate the key and value. For example: 1:0.3 3:0.9. | |
inputTableName | Yes | None | The name of the input table. | |
modelName | Yes | None | The name of the output model. | |
outputImportanceTableName | No | None | The name of the output table for feature importance. | |
inputTablePartitions | No | None | The format is ds=1/pt=1. | |
outputTableName | No | None | The output table in MaxCompute. It is in binary format and cannot be read directly. It can only be accessed by the SMART prediction component. | |
lifecycle | No | 3 | The lifecycle of the output table. | |
Algorithm parameters | classNum | Yes | None | The number of classes for multiclass classification. If the number of classes is n, the values in the label column must be {0,1,2,...,n-1}. |
objective | Yes | None | The type of objective function. For multiclass classification training, select multi:softprob. | |
metric | No | None | The evaluation metric type for the training dataset. The output is written to stdout in the Coordinator area of the Logview file. The following types are supported:
| |
treeCount | No | 1 | The number of trees. The training time is proportional to this value. | |
maxDepth | No | 5 | The maximum depth of a tree. The value must be in the range of 1 to 20. | |
sampleRatio | No | 1.0 | The data sampling ratio. The value must be in the range of (0,1]. A value of 1.0 means no sampling. | |
featureRatio | No | 1.0 | The feature sampling ratio. The value must be in the range of (0,1]. A value of 1.0 means no sampling. | |
l1 | No | 0 | The L1 penalty coefficient. A larger value results in a more uniform distribution of leaf nodes. Increase this value if overfitting occurs. | |
l2 | No | 1.0 | The L2 penalty coefficient. A larger value results in a more uniform distribution of leaf nodes. Increase this value if overfitting occurs. | |
shrinkage | No | 0.3 | The value must be in the range of (0,1). | |
sketchEps | No | 0.03 | The quantile threshold for constructing the sketch. The number of buckets is O(1.0/sketchEps). A smaller value results in more buckets. Typically, use the default value. Manual configuration is not required. The value must be in the range of (0,1). | |
minSplitLoss | No | 0 | The minimum loss change required to split a node. A larger value makes splitting more conservative. | |
featureNum | No | None | The number of features or the maximum feature ID. You must configure this parameter to estimate resource usage. | |
baseScore | No | 0.5 | The initial prediction value for all samples. | |
randSeed | No | None | The random number seed. Must be an integer. | |
featureImportanceType | No | gain | The type of feature importance to calculate. It includes:
| |
Tuning parameters | coreNum | No | System allocated | The number of cores. A larger value makes the algorithm run faster. |
memSizePerCore | No | System allocated | The memory used by each core, in MB. |
PS-SMART model deployment instructions
To deploy the model generated by the PS-SMART component as an online service, you must add the General-purpose Model Export component downstream of the PS-SMART component. You can configure the component parameters in the same way as for other PS-series components. For more information, see General-purpose Model Export.
Upon successful execution, you can go to the PAI-EAS Model Online Service page to deploy the model service. For more information, see Deploy a service in the console.