Discrete feature analysis

更新时间:
复制 MD 格式

The Discrete Feature Analysis component scores categorical features in your training data to identify which ones are most predictive of a label. For each feature, the component computes two distributional metrics — Gini index and entropy — and three feature importance scores: Gini Gain, Information Gain, and Information Gain Ratio.

Configure the component

Two configuration methods are available: the visual pipeline editor in Machine Learning Designer, or PAI commands via the SQL Script component.

Method 1: Configure on the pipeline page

In Machine Learning Designer, add the Discrete Feature Analysis component to your pipeline and set the following parameters.

ParameterDescription
Feature ColumnsThe feature columns from the input table to analyze.
Label ColumnThe target label column.
Sparse MatrixEnable when input data is in sparse format, where features are represented as key-value pairs.

Method 2: Use PAI commands

Run the component via PAI commands using the SQL Script component. For details, see Scenario 4: Execute PAI commands within the SQL script component.

Minimal example (key parameters only):

PAI
-name enum_feature_selection
-project algo_public
-DinputTableName=enumfeautreselection_input
-DlabelColName=label
-DfeatureColNames=col0,col1
-DenableSparse=false
-DoutputCntTableName=enumfeautreselection_output_cntTable
-DoutputValueTableName=enumfeautreselection_output_valuetable
-DoutputEnumValueTableName=enumfeautreselection_output_enumvaluetable;

The following table describes all PAI command parameters.

ParameterRequiredDefault valueDescription
inputTableNameYesThe name of the input table.
inputTablePartitionsNoFull tableThe partitions to read from the input table. Supported formats: Partition_name=value for a single partition, or name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas (,).
featureColNamesNoThe feature columns to analyze.
labelColNameNoThe name of the label column in the input table.
enableSparseNofalseWhether the input data is in sparse format. Valid values: true and false.
kvFeatureColNamesNoFull tableThe names of the feature columns in key-value pair format.
kvDelimiterNo:The delimiter separating keys and values in sparse input.
itemDelimiterNo,The delimiter separating key-value pairs in sparse input.
outputCntTableNameNoOutput table containing the value distribution of each discrete feature. Columns: colname, colvalue, labelvalue, cnt.
outputValueTableNameNoOutput table containing per-feature Gini index, entropy, and feature importance scores. Columns: colname, gini, entropy, infogain, ginigain, infogainratio.
outputEnumValueTableNameNoOutput table containing per-value Gini index and entropy for each feature. Columns: colname, colvalue, gini, entropy. A value of 0 means all samples at that category value share the same label (no impurity).
lifecycleNoThe lifecycle (in days) of the output tables.
coreNumNoSystem-determinedThe number of cores for computation. Must be a positive integer.
memSizePerCoreNoSystem-determinedMemory per core, in MB. Valid values: 1–65536.

Example

This example runs Discrete Feature Analysis on a dataset with three columns: col_string (string), col_bigint (bigint), and col_double (double). The label column is col_bigint.

Create the input table

Run the following SQL to create and populate the input table:

drop table if exists enum_feature_selection_test_input;
create table enum_feature_selection_test_input
as
select
    *
from
(
    select
        '00' as col_string,
        1 as col_bigint,
        0.0 as col_double
    from dual
    union all
        select
            cast(null as string) as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            0 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            cast(null as double) as col_double
        from dual
    union all
        select
            '01' as col_string,
            1 as col_bigint,
            1.0 as col_double
        from dual
    union all
        select
            '00' as col_string,
            0 as col_bigint,
            0.0 as col_double
        from dual
) tmp;

The input table contains six rows:

+------------+------------+------------+
| col_string | col_bigint | col_double |
+------------+------------+------------+
| 01         | 1          | 1.0        |
| 01         | 0          | 1.0        |
| 01         | 1          | NULL       |
| NULL       | 0          | 0.0        |
| 00         | 1          | 0.0        |
| 00         | 0          | 0.0        |
+------------+------------+------------+

Run the analysis

Drop any existing output tables, then submit the PAI command:

drop table if exists enum_feature_selection_test_input_enum_value_output;
drop table if exists enum_feature_selection_test_input_cnt_output;
drop table if exists enum_feature_selection_test_input_value_output;
PAI -name enum_feature_selection -project algo_public -DitemDelimiter=":" -Dlifecycle="28" -DoutputValueTableName="enum_feature_selection_test_input_value_output" -DkvDelimiter="," -DlabelColName="col_bigint" -DfeatureColNames="col_double,col_string" -DoutputEnumValueTableName="enum_feature_selection_test_input_enum_value_output" -DenableSparse="false" -DinputTableName="enum_feature_selection_test_input" -DoutputCntTableName="enum_feature_selection_test_input_cnt_output";

Output

The command produces three output tables.

enum_feature_selection_test_input_cnt_output — value distribution per feature:

+------------+------------+------------+------------+
| colname    | colvalue   | labelvalue | cnt        |
+------------+------------+------------+------------+
| col_double | NULL       | 1          | 1          |
| col_double | 0          | 0          | 2          |
| col_double | 0          | 1          | 1          |
| col_double | 1          | 0          | 1          |
| col_double | 1          | 1          | 1          |
| col_string | NULL       | 0          | 1          |
| col_string | 00         | 0          | 1          |
| col_string | 00         | 1          | 1          |
| col_string | 01         | 0          | 1          |
| col_string | 01         | 1          | 2          |
+------------+------------+------------+------------+

enum_feature_selection_test_input_value_output — feature-level importance scores:

+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| colname    | gini               | entropy            | infogain           | ginigain           | infogainratio      |
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| col_double | 0.3888888888888889 | 0.792481250360578  | 0.20751874963942196| 0.1111111111111111 | 0.14221913160264427|
| col_string | 0.38888888888888884| 0.792481250360578  | 0.20751874963942196| 0.11111111111111116| 0.14221913160264427|
+------------+--------------------+--------------------+--------------------+--------------------+--------------------+

enum_feature_selection_test_input_enum_value_output — per-value Gini index and entropy (0 = no impurity at that category value, meaning all samples share the same label):

+------------+------------+--------------------+--------------------+
| colname    | colvalue   | gini               | entropy            |
+------------+------------+--------------------+--------------------+
| col_double | NULL       | 0.0                | 0.0                |
| col_double | 0          | 0.22222222222222224| 0.4591479170272448 |
| col_double | 1          | 0.16666666666666666| 0.3333333333333333 |
| col_string | NULL       | 0.0                | 0.0                |
| col_string | 00         | 0.16666666666666666| 0.3333333333333333 |
| col_string | 01         | 0.2222222222222222 | 0.4591479170272448 |
+------------+------------+--------------------+--------------------+

In this example, col_double and col_string have identical feature importance scores, meaning both features contribute equally to predicting col_bigint. The NULL values in both features score 0 on Gini index and entropy, indicating that all NULL samples belong to a single label class.