Feature discretization-Platform For AI(PAI)-阿里云帮助中心

Continuous numeric features with many unique values or extreme outliers can reduce model accuracy, especially in classification tasks. Feature Discretization converts continuous numeric columns into discrete bins, making them compatible with algorithms that expect categorical inputs and reducing the influence of outliers.

The component works only with dense features of numeric data types. Sparse features are automatically filtered out.

Supported discretization methods

Unsupervised methods

These methods partition values based on the data distribution, without using label information.

Method	How it works	Default
Equal width discretization	Divides the value range into bins of equal width. Each bin covers the same range of values.	Yes
Equal frequency discretization	Distributes values so each bin contains the same number of data points. Reduces the influence of outliers by spreading values evenly.	No

Supervised methods

These methods find optimal split points based on label information, using entropy gain traversal. Because the algorithm performs a full traversal of the data, supervised discretization takes significantly longer than unsupervised methods.

The label column must be of type ENUM, STRING, or BIGINT. The number of bins produced by supervised methods is not controlled by the maxBins parameter.

Method	How it works
Gini gain-based discretization	Finds split points that minimize Gini impurity at each step.
Entropy gain-based discretization	Finds split points that maximize information gain at each step.

Choose a discretization method

Scenario	Recommended method
Column has too many unique values to model effectively	Equal width or equal frequency
Values have extreme outliers that distort the model	Equal frequency (spreads values evenly, reducing outlier influence)
You have label data and want splits to reflect class boundaries	Gini gain-based or entropy gain-based
Need Weight of Evidence (WOE) metrics for credit scoring or risk modeling	Use the Binning component instead

Configure the component

Method 1: Configure on the pipeline page (recommended)

Configure the Feature Discretization component in Machine Learning Designer (formerly Machine Learning Studio) on the pipeline page.

Fields Setting tab

Parameter	Description
Discrete features	The features to discretize.
Label column	(Optional) The label column. When specified, x-y histograms showing the relationship between each feature and the label are available in the output.

Parameters Setting tab

Parameter	Description
Discretization method	The method to use. Valid values: Equal Width Discretization, Equal Frequency Discretization, Gini Gain-based Discretization, Entropy Gain-based Discretization. Default: Equal Width Discretization.
Discretization interval	The number of bins. Must be a positive integer greater than 1.

Tuning tab

Parameter	Description
Cores	The number of cores for computation. Must be a positive integer.
Memory size per core	The memory allocated to each core.

Method 2: Use PAI commands

Call PAI commands through the SQL Script component. For more information, see SQL Script.

PAI -name fe_discrete_runner_1 -project algo_public
   -DdiscreteMethod=SameFrequecy
   -Dlifecycle=28
   -DmaxBins=5
   -DinputTable=pai_dense_10_1
   -DdiscreteCols=nr_employed
   -DoutputTable=pai_temp_2262_20382_1
   -DmodelTable=pai_temp_2262_20382_2;

Parameters

Parameter	Required	Default	Description
`inputTable`	Yes	—	The input table name.
`inputTablePartitions`	No	All partitions	Partitions to use for training. Format: `Partition_name=value`. For multi-level partitions: `name1=value1/name2=value2`. Separate multiple partitions with commas.
`outputTable`	Yes	—	The output table containing discretized values.
`discreteCols`	Yes	`""`	The features to discretize. Sparse features are automatically filtered.
`labelCol`	No	—	The label column. Enables x-y histograms in the output.
`discreteMethod`	No	`Isometric Discretization`	The discretization method. Valid values: `Isometric Discretization` (equal width), `Isofrequecy Discretization` (equal frequency), `Gini-gain-based Discretization`, `Entropy-gain-based Discretization`.
`maxBins`	No	`100`	The number of bins. Must be a positive integer greater than 1. Not applicable to supervised methods.
`lifecycle`	No	`7`	The lifecycle of the output table in days. Must be a positive integer.
`coreNum`	No	Determined by the system	The number of cores. Used together with `memSizePerCore`. Must be a positive integer.
`memSizePerCore`	No	Determined by the system	The memory size per core, in MB. Must be a positive integer.

Example

This example discretizes a single numeric column into 5 bins using equal width discretization.

Prepare input data

Run the following SQL statement to create the input table:

CREATE TABLE IF NOT EXISTS pai_dense_10_1 AS
SELECT nr_employed
FROM bank_data
LIMIT 10;

Configure the component

Input table: pai_dense_10_1
Fields Setting tab: Set Discrete features to nr_employed.
Parameters Setting tab: Set Discretization method to Equal Width Discretization and Discretization interval to 5.

Output

The discretized values for nr_employed are:

nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0