Feature discretization

更新时间:
复制 MD 格式

Continuous numeric features with many unique values or extreme outliers can reduce model accuracy, especially in classification tasks. Feature Discretization converts continuous numeric columns into discrete bins, making them compatible with algorithms that expect categorical inputs and reducing the influence of outliers.

The component works only with dense features of numeric data types. Sparse features are automatically filtered out.

Supported discretization methods

Unsupervised methods

These methods partition values based on the data distribution, without using label information.

MethodHow it worksDefault
Equal width discretizationDivides the value range into bins of equal width. Each bin covers the same range of values.Yes
Equal frequency discretizationDistributes values so each bin contains the same number of data points. Reduces the influence of outliers by spreading values evenly.No

Supervised methods

These methods find optimal split points based on label information, using entropy gain traversal. Because the algorithm performs a full traversal of the data, supervised discretization takes significantly longer than unsupervised methods.

The label column must be of type ENUM, STRING, or BIGINT. The number of bins produced by supervised methods is not controlled by the maxBins parameter.
MethodHow it works
Gini gain-based discretizationFinds split points that minimize Gini impurity at each step.
Entropy gain-based discretizationFinds split points that maximize information gain at each step.

Choose a discretization method

ScenarioRecommended method
Column has too many unique values to model effectivelyEqual width or equal frequency
Values have extreme outliers that distort the modelEqual frequency (spreads values evenly, reducing outlier influence)
You have label data and want splits to reflect class boundariesGini gain-based or entropy gain-based
Need Weight of Evidence (WOE) metrics for credit scoring or risk modelingUse the Binning component instead

Configure the component

Method 1: Configure on the pipeline page (recommended)

Configure the Feature Discretization component in Machine Learning Designer (formerly Machine Learning Studio) on the pipeline page.

Fields Setting tab

ParameterDescription
Discrete featuresThe features to discretize.
Label column(Optional) The label column. When specified, x-y histograms showing the relationship between each feature and the label are available in the output.

Parameters Setting tab

ParameterDescription
Discretization methodThe method to use. Valid values: Equal Width Discretization, Equal Frequency Discretization, Gini Gain-based Discretization, Entropy Gain-based Discretization. Default: Equal Width Discretization.
Discretization intervalThe number of bins. Must be a positive integer greater than 1.

Tuning tab

ParameterDescription
CoresThe number of cores for computation. Must be a positive integer.
Memory size per coreThe memory allocated to each core.

Method 2: Use PAI commands

Call PAI commands through the SQL Script component. For more information, see SQL Script.

PAI -name fe_discrete_runner_1 -project algo_public
   -DdiscreteMethod=SameFrequecy
   -Dlifecycle=28
   -DmaxBins=5
   -DinputTable=pai_dense_10_1
   -DdiscreteCols=nr_employed
   -DoutputTable=pai_temp_2262_20382_1
   -DmodelTable=pai_temp_2262_20382_2;

Parameters

ParameterRequiredDefaultDescription
inputTableYesThe input table name.
inputTablePartitionsNoAll partitionsPartitions to use for training. Format: Partition_name=value. For multi-level partitions: name1=value1/name2=value2. Separate multiple partitions with commas.
outputTableYesThe output table containing discretized values.
discreteColsYes""The features to discretize. Sparse features are automatically filtered.
labelColNoThe label column. Enables x-y histograms in the output.
discreteMethodNoIsometric DiscretizationThe discretization method. Valid values: Isometric Discretization (equal width), Isofrequecy Discretization (equal frequency), Gini-gain-based Discretization, Entropy-gain-based Discretization.
maxBinsNo100The number of bins. Must be a positive integer greater than 1. Not applicable to supervised methods.
lifecycleNo7The lifecycle of the output table in days. Must be a positive integer.
coreNumNoDetermined by the systemThe number of cores. Used together with memSizePerCore. Must be a positive integer.
memSizePerCoreNoDetermined by the systemThe memory size per core, in MB. Must be a positive integer.

Example

This example discretizes a single numeric column into 5 bins using equal width discretization.

Prepare input data

Run the following SQL statement to create the input table:

CREATE TABLE IF NOT EXISTS pai_dense_10_1 AS
SELECT nr_employed
FROM bank_data
LIMIT 10;

Configure the component

  • Input table: pai_dense_10_1

  • Fields Setting tab: Set Discrete features to nr_employed.

  • Parameters Setting tab: Set Discretization method to Equal Width Discretization and Discretization interval to 5.

Output

The discretized values for nr_employed are:

nr_employed
4.0
3.0
1.0
3.0
2.0
4.0
3.0
3.0
2.0
3.0