Continuous numeric features with many unique values or extreme outliers can reduce model accuracy, especially in classification tasks. Feature Discretization converts continuous numeric columns into discrete bins, making them compatible with algorithms that expect categorical inputs and reducing the influence of outliers.
The component works only with dense features of numeric data types. Sparse features are automatically filtered out.
Supported discretization methods
Unsupervised methods
These methods partition values based on the data distribution, without using label information.
| Method | How it works | Default |
|---|---|---|
| Equal width discretization | Divides the value range into bins of equal width. Each bin covers the same range of values. | Yes |
| Equal frequency discretization | Distributes values so each bin contains the same number of data points. Reduces the influence of outliers by spreading values evenly. | No |
Supervised methods
These methods find optimal split points based on label information, using entropy gain traversal. Because the algorithm performs a full traversal of the data, supervised discretization takes significantly longer than unsupervised methods.
The label column must be of type ENUM, STRING, or BIGINT. The number of bins produced by supervised methods is not controlled by the maxBins parameter.| Method | How it works |
|---|---|
| Gini gain-based discretization | Finds split points that minimize Gini impurity at each step. |
| Entropy gain-based discretization | Finds split points that maximize information gain at each step. |
Choose a discretization method
| Scenario | Recommended method |
|---|---|
| Column has too many unique values to model effectively | Equal width or equal frequency |
| Values have extreme outliers that distort the model | Equal frequency (spreads values evenly, reducing outlier influence) |
| You have label data and want splits to reflect class boundaries | Gini gain-based or entropy gain-based |
| Need Weight of Evidence (WOE) metrics for credit scoring or risk modeling | Use the Binning component instead |
Configure the component
Method 1: Configure on the pipeline page (recommended)
Configure the Feature Discretization component in Machine Learning Designer (formerly Machine Learning Studio) on the pipeline page.
Fields Setting tab
| Parameter | Description |
|---|---|
| Discrete features | The features to discretize. |
| Label column | (Optional) The label column. When specified, x-y histograms showing the relationship between each feature and the label are available in the output. |
Parameters Setting tab
| Parameter | Description |
|---|---|
| Discretization method | The method to use. Valid values: Equal Width Discretization, Equal Frequency Discretization, Gini Gain-based Discretization, Entropy Gain-based Discretization. Default: Equal Width Discretization. |
| Discretization interval | The number of bins. Must be a positive integer greater than 1. |
Tuning tab
| Parameter | Description |
|---|---|
| Cores | The number of cores for computation. Must be a positive integer. |
| Memory size per core | The memory allocated to each core. |
Method 2: Use PAI commands
Call PAI commands through the SQL Script component. For more information, see SQL Script.
PAI -name fe_discrete_runner_1 -project algo_public
-DdiscreteMethod=SameFrequecy
-Dlifecycle=28
-DmaxBins=5
-DinputTable=pai_dense_10_1
-DdiscreteCols=nr_employed
-DoutputTable=pai_temp_2262_20382_1
-DmodelTable=pai_temp_2262_20382_2;Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
inputTable | Yes | — | The input table name. |
inputTablePartitions | No | All partitions | Partitions to use for training. Format: Partition_name=value. For multi-level partitions: name1=value1/name2=value2. Separate multiple partitions with commas. |
outputTable | Yes | — | The output table containing discretized values. |
discreteCols | Yes | "" | The features to discretize. Sparse features are automatically filtered. |
labelCol | No | — | The label column. Enables x-y histograms in the output. |
discreteMethod | No | Isometric Discretization | The discretization method. Valid values: Isometric Discretization (equal width), Isofrequecy Discretization (equal frequency), Gini-gain-based Discretization, Entropy-gain-based Discretization. |
maxBins | No | 100 | The number of bins. Must be a positive integer greater than 1. Not applicable to supervised methods. |
lifecycle | No | 7 | The lifecycle of the output table in days. Must be a positive integer. |
coreNum | No | Determined by the system | The number of cores. Used together with memSizePerCore. Must be a positive integer. |
memSizePerCore | No | Determined by the system | The memory size per core, in MB. Must be a positive integer. |
Example
This example discretizes a single numeric column into 5 bins using equal width discretization.
Prepare input data
Run the following SQL statement to create the input table:
CREATE TABLE IF NOT EXISTS pai_dense_10_1 AS
SELECT nr_employed
FROM bank_data
LIMIT 10;Configure the component
Input table:
pai_dense_10_1Fields Setting tab: Set Discrete features to
nr_employed.Parameters Setting tab: Set Discretization method to Equal Width Discretization and Discretization interval to
5.
Output
The discretized values for nr_employed are:
| nr_employed |
|---|
| 4.0 |
| 3.0 |
| 1.0 |
| 3.0 |
| 2.0 |
| 4.0 |
| 3.0 |
| 3.0 |
| 2.0 |
| 3.0 |