GBDT binary classification training V2-Platform For AI(PAI)-阿里云帮助中心

Gradient Boosting Decision Trees (GBDT) Binary Classification V2 trains a binary classification model by combining multiple weak decision trees into a single strong learner. It incorporates XGBoost's second-order optimization and LightGBM's histogram approximation, making it fast, accurate, and interpretable.

This component runs on MaxCompute only.

How it works

The model is an ensemble of CART decision trees, where each tree corrects the residual errors of the previous one. The process follows the recursive gradient boosting formula:

In this formula, is a CART decision tree, are the tree's parameters, and is the step size. Each tree optimizes the objective function relative to the previous tree, and the final model contains multiple decision trees.

Label requirement: Binary class labels must be 0 and 1.

Supported input formats

The component accepts two input formats:

Format	Column selection	Data format
Multiple feature columns (default)	Multiple columns of `double`, `bigint`, or `string` type	Numerical features are binned; categorical features use a many-vs-many splitting strategy (no one-hot encoding needed)
Sparse vector format	One `string` column	Key-value pairs separated by spaces; key and value separated by a colon. Example: `1:0.3 3:0.9`

Configure the component

Pair this component with GBDT Binary Classification Prediction V2 to score new data. After training, deploy the model as an online service. For details, see Deploy a pipeline as an online service.

Input ports

Port	Required	Recommended upstream component
Input Data	Yes	Read Table

Fields setting

Parameter	Required	Default	Description
Use Sparse Vector Format	No	No	Enable this if your feature data is in sparse vector format (`key:value` pairs). When enabled, select exactly one `string` column as the feature column.
Select Feature Columns	Yes	—	Feature columns used for training. In non-sparse mode, select columns of `double`, `bigint`, or `string` type. In sparse mode, select one `string` column.
Select Categorical Feature Columns	No	—	Columns to treat as categorical features. All other selected feature columns are treated as numerical. Only applies in non-sparse mode.
Select Label Column	Yes	—	The label column. Values must be `0` or `1`.
Select Weight Column	No	—	An optional column of sample weights for training.

Parameter setting

Parameter	Default	Valid values	Description
Number of Trees	1	Positive integer	The number of trees in the ensemble. More trees improve accuracy but increase training time. Use alongside Learning Rate: smaller learning rates generally require more trees.
Maximum Number of Leaf Nodes	32	Positive integer	Maximum leaf nodes per tree. Larger values let each tree capture more complex patterns but increase the risk of overfitting.
Learning Rate	0.05	Float	Shrinks each tree's contribution. Lower values make training more conservative and reduce overfitting risk, but require more trees to reach the same accuracy.
Ratio of Samples	0.6	(0, 1]	Fraction of training samples used per tree. Values below 1.0 introduce randomness (stochastic gradient boosting), which reduces variance and helps prevent overfitting.
Ratio of Features	0.6	(0, 1]	Fraction of features considered per tree. Values below 1.0 increase diversity among trees and reduce overfitting, at the cost of some accuracy.
Minimum Number of Samples in a Leaf Node	500	Positive integer	Minimum samples required in a leaf node. Higher values prevent the model from fitting to very small data subsets, helping to control overfitting.
Maximum Number of Bins	32	Positive integer	Maximum bins when discretizing continuous features. More bins produce more precise splits but increase training cost. Equivalent to `1 / Sketch-based Approximate Precision` in PS-SMART.
Maximum Number of Distinct Categories	1024	Positive integer	Maximum distinct categories for categorical features. Categories exceeding this rank (sorted by frequency) are merged into one bucket. More categories allow finer splits but increase overfitting risk and training cost.
Number of features	Auto-calculated	Positive integer	For sparse vector format only. Set to `max feature ID + 1`. Leave blank to let the system scan the data automatically.
Initial Prediction	Auto-calculated	Float	The prior probability of positive samples. Leave blank to let the system estimate from the data.
Random Seed	0	Integer	Seed for random sampling. Set a fixed value for reproducible runs.

Tuning

These parameters control compute resources and do not affect model accuracy.

Parameter	Default	Description
Choose Running Mode	MaxCompute	Running environment. Valid values: `MaxCompute`, `Flink`.
Number of Instances	Auto-calculated	Number of compute instances. Adjust from the auto-generated value if jobs fail or are slow.
Memory Per Instance	Auto-calculated	Memory per instance, in MB. Adjust from the auto-generated value if jobs run out of memory.
Num of Threads	1	Threads per instance. Multi-threading increases resource use; performance gains are non-linear and may decrease if you exceed the optimal thread count.

Output ports

Port	Data type	Content	Recommended downstream component
Output Model	MaxCompute table	Trained GBDT model, ready for prediction or online deployment	GBDT Binary Classification Prediction V2
Output Feature Importance	MaxCompute table	Feature importance scores using the `gain` metric by default. View directly; cannot connect to PAI command-based components such as GBDT Feature Importance V2.	—

The area under curve (AUC) is the default evaluation metric. After the job completes, view AUC metrics in the worker log.

Migrate from PS-SMART Binary Classification Training

If you previously used PS-SMART Binary Classification Training, use the following table to map parameters to their GBDT Binary Classification V2 equivalents.

PS-SMART parameter	GBDT V2 equivalent	Notes
Use Sparse Format	Use Sparse Vector Format	—
Feature Columns	Select Feature Columns	—
Label Column	Select Label Column	—
Weight Column	Select Weight Column	—
Evaluation Indicator Type	Not supported	AUC is used by default. View metrics in the worker log.
Trees	Number of Trees	—
Maximum Tree Depth	Maximum Number of Leaf Nodes	`Maximum Number of Leaf Nodes = 2 ^ (Maximum Tree Depth - 1)`
Data Sampling Fraction	Ratio of Samples	—
Feature Sampling Fraction	Ratio of Features	—
L1 Penalty Coefficient	Not supported	—
L2 Penalty Coefficient	Not supported	—
Learning Rate	Learning Rate	—
Sketch-based Approximate Precision	Maximum Number of Bins	`Maximum Number of Bins = 1 / Sketch-based Approximate Precision`
Minimum Split Loss Change	Minimum Number of Samples in a Leaf Node	Cannot be converted directly. Both parameters help prevent overfitting.
Features	Features	—
Global Offset	Global Offset	—
Random Seed	Random Seed	—
Feature Importance Type	Not supported	Defaults to `gain`.
Cores	Number of Instances	Values are not equivalent. Start from the system-generated value and adjust.
Memory Size per Core	Memory Per Instance	Values are not equivalent. Start from the system-generated value and adjust.

What's next

GBDT Binary Classification Prediction V2 — score new data using the trained model
Deploy a pipeline as an online service — serve the model for real-time inference
PS-SMART Binary Classification Training — the predecessor component, for reference