The Discrete Feature Analysis component scores categorical features in your training data to identify which ones are most predictive of a label. For each feature, the component computes two distributional metrics — Gini index and entropy — and three feature importance scores: Gini Gain, Information Gain, and Information Gain Ratio.
Configure the component
Two configuration methods are available: the visual pipeline editor in Machine Learning Designer, or PAI commands via the SQL Script component.
Method 1: Configure on the pipeline page
In Machine Learning Designer, add the Discrete Feature Analysis component to your pipeline and set the following parameters.
| Parameter | Description |
|---|---|
| Feature Columns | The feature columns from the input table to analyze. |
| Label Column | The target label column. |
| Sparse Matrix | Enable when input data is in sparse format, where features are represented as key-value pairs. |
Method 2: Use PAI commands
Run the component via PAI commands using the SQL Script component. For details, see Scenario 4: Execute PAI commands within the SQL script component.
Minimal example (key parameters only):
PAI
-name enum_feature_selection
-project algo_public
-DinputTableName=enumfeautreselection_input
-DlabelColName=label
-DfeatureColNames=col0,col1
-DenableSparse=false
-DoutputCntTableName=enumfeautreselection_output_cntTable
-DoutputValueTableName=enumfeautreselection_output_valuetable
-DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;The following table describes all PAI command parameters.
| Parameter | Required | Default value | Description |
|---|---|---|---|
| inputTableName | Yes | — | The name of the input table. |
| inputTablePartitions | No | Full table | The partitions to read from the input table. Supported formats: Partition_name=value for a single partition, or name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas (,). |
| featureColNames | No | — | The feature columns to analyze. |
| labelColName | No | — | The name of the label column in the input table. |
| enableSparse | No | false | Whether the input data is in sparse format. Valid values: true and false. |
| kvFeatureColNames | No | Full table | The names of the feature columns in key-value pair format. |
| kvDelimiter | No | : | The delimiter separating keys and values in sparse input. |
| itemDelimiter | No | , | The delimiter separating key-value pairs in sparse input. |
| outputCntTableName | No | — | Output table containing the value distribution of each discrete feature. Columns: colname, colvalue, labelvalue, cnt. |
| outputValueTableName | No | — | Output table containing per-feature Gini index, entropy, and feature importance scores. Columns: colname, gini, entropy, infogain, ginigain, infogainratio. |
| outputEnumValueTableName | No | — | Output table containing per-value Gini index and entropy for each feature. Columns: colname, colvalue, gini, entropy. A value of 0 means all samples at that category value share the same label (no impurity). |
| lifecycle | No | — | The lifecycle (in days) of the output tables. |
| coreNum | No | System-determined | The number of cores for computation. Must be a positive integer. |
| memSizePerCore | No | System-determined | Memory per core, in MB. Valid values: 1–65536. |
Example
This example runs Discrete Feature Analysis on a dataset with three columns: col_string (string), col_bigint (bigint), and col_double (double). The label column is col_bigint.
Create the input table
Run the following SQL to create and populate the input table:
drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
*
from
(
select
'00' as col_string,
1 as col_bigint,
0.0 as col_double
from dual
union all
select
cast(null as string) as col_string,
0 as col_bigint,
0.0 as col_double
from dual
union all
select
'01' as col_string,
0 as col_bigint,
1.0 as col_double
from dual
union all
select
'01' as col_string,
1 as col_bigint,
cast(null as double) as col_double
from dual
union all
select
'01' as col_string,
1 as col_bigint,
1.0 as col_double
from dual
union all
select
'00' as col_string,
0 as col_bigint,
0.0 as col_double
from dual
) tmp;The input table contains six rows:
+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01 | 1 | 1.0 |
| 01 | 0 | 1.0 |
| 01 | 1 | NULL |
| NULL | 0 | 0.0 |
| 00 | 1 | 0.0 |
| 00 | 0 | 0.0 |
+------------+------------+------------+Run the analysis
Drop any existing output tables, then submit the PAI command:
drop table if exists enum_feature_selection_test_input_enum_value_output;
drop table if exists enum_feature_selection_test_input_cnt_output;
drop table if exists enum_feature_selection_test_input_value_output;
PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";Output
The command produces three output tables.
enum_feature_selection_test_input_cnt_output — value distribution per feature:
+------------+------------+------------+------------+
| colname | colvalue | labelvalue | cnt |
+------------+------------+------------+------------+
| col_double | NULL | 1 | 1 |
| col_double | 0 | 0 | 2 |
| col_double | 0 | 1 | 1 |
| col_double | 1 | 0 | 1 |
| col_double | 1 | 1 | 1 |
| col_string | NULL | 0 | 1 |
| col_string | 00 | 0 | 1 |
| col_string | 00 | 1 | 1 |
| col_string | 01 | 0 | 1 |
| col_string | 01 | 1 | 2 |
+------------+------------+------------+------------+enum_feature_selection_test_input_value_output — feature-level importance scores:
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| colname | gini | entropy | infogain | ginigain | infogainratio |
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| col_double | 0.3888888888888889 | 0.792481250360578 | 0.20751874963942196| 0.1111111111111111 | 0.14221913160264427|
| col_string | 0.38888888888888884| 0.792481250360578 | 0.20751874963942196| 0.11111111111111116| 0.14221913160264427|
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+enum_feature_selection_test_input_enum_value_output — per-value Gini index and entropy (0 = no impurity at that category value, meaning all samples share the same label):
+------------+------------+--------------------+--------------------+
| colname | colvalue | gini | entropy |
+------------+------------+--------------------+--------------------+
| col_double | NULL | 0.0 | 0.0 |
| col_double | 0 | 0.22222222222222224| 0.4591479170272448 |
| col_double | 1 | 0.16666666666666666| 0.3333333333333333 |
| col_string | NULL | 0.0 | 0.0 |
| col_string | 00 | 0.16666666666666666| 0.3333333333333333 |
| col_string | 01 | 0.2222222222222222 | 0.4591479170272448 |
+------------+------------+--------------------+--------------------+In this example, col_double and col_string have identical feature importance scores, meaning both features contribute equally to predicting col_bigint. The NULL values in both features score 0 on Gini index and entropy, indicating that all NULL samples belong to a single label class.