GBDT binary classification training V2

更新时间:
复制 MD 格式

Gradient Boosting Decision Trees (GBDT) Binary Classification V2 trains a binary classification model by combining multiple weak decision trees into a single strong learner. It incorporates XGBoost's second-order optimization and LightGBM's histogram approximation, making it fast, accurate, and interpretable.

This component runs on MaxCompute only.

How it works

The model is an ensemble of CART decision trees, where each tree corrects the residual errors of the previous one. The process follows the recursive gradient boosting formula:

image

In this formula, image is a CART decision tree, image are the tree's parameters, and image is the step size. Each tree optimizes the objective function relative to the previous tree, and the final model contains multiple decision trees.

Label requirement: Binary class labels must be 0 and 1.

Supported input formats

The component accepts two input formats:

FormatColumn selectionData format
Multiple feature columns (default)Multiple columns of double, bigint, or string typeNumerical features are binned; categorical features use a many-vs-many splitting strategy (no one-hot encoding needed)
Sparse vector formatOne string columnKey-value pairs separated by spaces; key and value separated by a colon. Example: 1:0.3 3:0.9

Configure the component

Pair this component with GBDT Binary Classification Prediction V2 to score new data. After training, deploy the model as an online service. For details, see Deploy a pipeline as an online service.

Input ports

PortRequiredRecommended upstream component
Input DataYesRead Table

Fields setting

ParameterRequiredDefaultDescription
Use Sparse Vector FormatNoNoEnable this if your feature data is in sparse vector format (key:value pairs). When enabled, select exactly one string column as the feature column.
Select Feature ColumnsYesFeature columns used for training. In non-sparse mode, select columns of double, bigint, or string type. In sparse mode, select one string column.
Select Categorical Feature ColumnsNoColumns to treat as categorical features. All other selected feature columns are treated as numerical. Only applies in non-sparse mode.
Select Label ColumnYesThe label column. Values must be 0 or 1.
Select Weight ColumnNoAn optional column of sample weights for training.

Parameter setting

ParameterDefaultValid valuesDescription
Number of Trees1Positive integerThe number of trees in the ensemble. More trees improve accuracy but increase training time. Use alongside Learning Rate: smaller learning rates generally require more trees.
Maximum Number of Leaf Nodes32Positive integerMaximum leaf nodes per tree. Larger values let each tree capture more complex patterns but increase the risk of overfitting.
Learning Rate0.05FloatShrinks each tree's contribution. Lower values make training more conservative and reduce overfitting risk, but require more trees to reach the same accuracy.
Ratio of Samples0.6(0, 1]Fraction of training samples used per tree. Values below 1.0 introduce randomness (stochastic gradient boosting), which reduces variance and helps prevent overfitting.
Ratio of Features0.6(0, 1]Fraction of features considered per tree. Values below 1.0 increase diversity among trees and reduce overfitting, at the cost of some accuracy.
Minimum Number of Samples in a Leaf Node500Positive integerMinimum samples required in a leaf node. Higher values prevent the model from fitting to very small data subsets, helping to control overfitting.
Maximum Number of Bins32Positive integerMaximum bins when discretizing continuous features. More bins produce more precise splits but increase training cost. Equivalent to 1 / Sketch-based Approximate Precision in PS-SMART.
Maximum Number of Distinct Categories1024Positive integerMaximum distinct categories for categorical features. Categories exceeding this rank (sorted by frequency) are merged into one bucket. More categories allow finer splits but increase overfitting risk and training cost.
Number of featuresAuto-calculatedPositive integerFor sparse vector format only. Set to max feature ID + 1. Leave blank to let the system scan the data automatically.
Initial PredictionAuto-calculatedFloatThe prior probability of positive samples. Leave blank to let the system estimate from the data.
Random Seed0IntegerSeed for random sampling. Set a fixed value for reproducible runs.

Tuning

These parameters control compute resources and do not affect model accuracy.

ParameterDefaultDescription
Choose Running ModeMaxComputeRunning environment. Valid values: MaxCompute, Flink.
Number of InstancesAuto-calculatedNumber of compute instances. Adjust from the auto-generated value if jobs fail or are slow.
Memory Per InstanceAuto-calculatedMemory per instance, in MB. Adjust from the auto-generated value if jobs run out of memory.
Num of Threads1Threads per instance. Multi-threading increases resource use; performance gains are non-linear and may decrease if you exceed the optimal thread count.

Output ports

PortData typeContentRecommended downstream component
Output ModelMaxCompute tableTrained GBDT model, ready for prediction or online deploymentGBDT Binary Classification Prediction V2
Output Feature ImportanceMaxCompute tableFeature importance scores using the gain metric by default. View directly; cannot connect to PAI command-based components such as GBDT Feature Importance V2.

The area under curve (AUC) is the default evaluation metric. After the job completes, view AUC metrics in the worker log.

Migrate from PS-SMART Binary Classification Training

If you previously used PS-SMART Binary Classification Training, use the following table to map parameters to their GBDT Binary Classification V2 equivalents.

PS-SMART parameterGBDT V2 equivalentNotes
Use Sparse FormatUse Sparse Vector Format
Feature ColumnsSelect Feature Columns
Label ColumnSelect Label Column
Weight ColumnSelect Weight Column
Evaluation Indicator TypeNot supportedAUC is used by default. View metrics in the worker log.
TreesNumber of Trees
Maximum Tree DepthMaximum Number of Leaf NodesMaximum Number of Leaf Nodes = 2 ^ (Maximum Tree Depth - 1)
Data Sampling FractionRatio of Samples
Feature Sampling FractionRatio of Features
L1 Penalty CoefficientNot supported
L2 Penalty CoefficientNot supported
Learning RateLearning Rate
Sketch-based Approximate PrecisionMaximum Number of BinsMaximum Number of Bins = 1 / Sketch-based Approximate Precision
Minimum Split Loss ChangeMinimum Number of Samples in a Leaf NodeCannot be converted directly. Both parameters help prevent overfitting.
FeaturesFeatures
Global OffsetGlobal Offset
Random SeedRandom Seed
Feature Importance TypeNot supportedDefaults to gain.
CoresNumber of InstancesValues are not equivalent. Start from the system-generated value and adjust.
Memory Size per CoreMemory Per InstanceValues are not equivalent. Start from the system-generated value and adjust.

What's next