Discrete feature analysis-Platform For AI(PAI)-阿里云帮助中心

The Discrete Feature Analysis component scores categorical features in your training data to identify which ones are most predictive of a label. For each feature, the component computes two distributional metrics — Gini index and entropy — and three feature importance scores: Gini Gain, Information Gain, and Information Gain Ratio.

Configure the component

Two configuration methods are available: the visual pipeline editor in Machine Learning Designer, or PAI commands via the SQL Script component.

Method 1: Configure on the pipeline page

In Machine Learning Designer, add the Discrete Feature Analysis component to your pipeline and set the following parameters.

Parameter	Description
Feature Columns	The feature columns from the input table to analyze.
Label Column	The target label column.
Sparse Matrix	Enable when input data is in sparse format, where features are represented as key-value pairs.

Method 2: Use PAI commands

Run the component via PAI commands using the SQL Script component. For details, see Scenario 4: Execute PAI commands within the SQL script component.

Minimal example (key parameters only):

PAI
-name enum_feature_selection
-project algo_public
-DinputTableName=enumfeautreselection_input
-DlabelColName=label
-DfeatureColNames=col0,col1
-DenableSparse=false
-DoutputCntTableName=enumfeautreselection_output_cntTable
-DoutputValueTableName=enumfeautreselection_output_valuetable
-DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;

The following table describes all PAI command parameters.

Parameter	Required	Default value	Description
inputTableName	Yes	—	The name of the input table.
inputTablePartitions	No	Full table	The partitions to read from the input table. Supported formats: `Partition_name=value` for a single partition, or `name1=value1/name2=value2` for multi-level partitions. Separate multiple partitions with commas (,).
featureColNames	No	—	The feature columns to analyze.
labelColName	No	—	The name of the label column in the input table.
enableSparse	No	false	Whether the input data is in sparse format. Valid values: true and false.
kvFeatureColNames	No	Full table	The names of the feature columns in key-value pair format.
kvDelimiter	No	`:`	The delimiter separating keys and values in sparse input.
itemDelimiter	No	`,`	The delimiter separating key-value pairs in sparse input.
outputCntTableName	No	—	Output table containing the value distribution of each discrete feature. Columns: `colname`, `colvalue`, `labelvalue`, `cnt`.
outputValueTableName	No	—	Output table containing per-feature Gini index, entropy, and feature importance scores. Columns: `colname`, `gini`, `entropy`, `infogain`, `ginigain`, `infogainratio`.
outputEnumValueTableName	No	—	Output table containing per-value Gini index and entropy for each feature. Columns: `colname`, `colvalue`, `gini`, `entropy`. A value of 0 means all samples at that category value share the same label (no impurity).
lifecycle	No	—	The lifecycle (in days) of the output tables.
coreNum	No	System-determined	The number of cores for computation. Must be a positive integer.
memSizePerCore	No	System-determined	Memory per core, in MB. Valid values: 1–65536.

Example

This example runs Discrete Feature Analysis on a dataset with three columns: col_string (string), col_bigint (bigint), and col_double (double). The label column is col_bigint.

Create the input table

Run the following SQL to create and populate the input table:

drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
    *
from
(
    select
        '00' as col_string,
        1 as col_bigint,
        0.0 as col_double
    from dual
    union all
        select
            cast(null as string) as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            0 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            cast(null as double) as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '00' as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
) tmp;

The input table contains six rows:

+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01         | 1          | 1.0        |
| 01         | 0          | 1.0        |
| 01         | 1          | NULL       |
| NULL       | 0          | 0.0        |
| 00         | 1          | 0.0        |
| 00         | 0          | 0.0        |
+------------+------------+------------+

Run the analysis

Drop any existing output tables, then submit the PAI command:

drop table if exists enum_feature_selection_test_input_enum_value_output;
drop table if exists enum_feature_selection_test_input_cnt_output;
drop table if exists enum_feature_selection_test_input_value_output;
PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";

Output

The command produces three output tables.

enum_feature_selection_test_input_cnt_output — value distribution per feature:

+------------+------------+------------+------------+
| colname    | colvalue   | labelvalue | cnt        |
+------------+------------+------------+------------+
| col_double | NULL       | 1          | 1          |
| col_double | 0          | 0          | 2          |
| col_double | 0          | 1          | 1          |
| col_double | 1          | 0          | 1          |
| col_double | 1          | 1          | 1          |
| col_string | NULL       | 0          | 1          |
| col_string | 00         | 0          | 1          |
| col_string | 00         | 1          | 1          |
| col_string | 01         | 0          | 1          |
| col_string | 01         | 1          | 2          |
+------------+------------+------------+------------+

enum_feature_selection_test_input_value_output — feature-level importance scores:

+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| colname    | gini               | entropy            | infogain           | ginigain           | infogainratio      |
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| col_double | 0.3888888888888889 | 0.792481250360578  | 0.20751874963942196| 0.1111111111111111 | 0.14221913160264427|
| col_string | 0.38888888888888884| 0.792481250360578  | 0.20751874963942196| 0.11111111111111116| 0.14221913160264427|
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+

enum_feature_selection_test_input_enum_value_output — per-value Gini index and entropy (0 = no impurity at that category value, meaning all samples share the same label):

+------------+------------+--------------------+--------------------+
| colname    | colvalue   | gini               | entropy            |
+------------+------------+--------------------+--------------------+
| col_double | NULL       | 0.0                | 0.0                |
| col_double | 0          | 0.22222222222222224| 0.4591479170272448 |
| col_double | 1          | 0.16666666666666666| 0.3333333333333333 |
| col_string | NULL       | 0.0                | 0.0                |
| col_string | 00         | 0.16666666666666666| 0.3333333333333333 |
| col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
+------------+------------+--------------------+--------------------+

In this example, col_double and col_string have identical feature importance scores, meaning both features contribute equally to predicting col_bigint. The NULL values in both features score 0 on Gini index and entropy, indicating that all NULL samples belong to a single label class.