Feature importance filtering

更新时间:
复制 MD 格式

After a Linear Model Feature Importance, GBDT Feature Importance, or Random Forest Feature Importance component produces a scored weight table, Feature Importance Filtering selects the top N features from that table and writes the filtered dataset to an output table. This lets you reduce input dimensionality before training without manually inspecting or ranking feature scores.

Prerequisites

Before you begin, ensure that you have:

  • A completed run of one of the upstream feature importance components: Linear Model Feature Importance, GBDT Feature Importance, or Random Forest Feature Importance

  • The output weight table from that component (used as weightTable)

  • An input data table whose features you want to filter (used as inputTable)

How it works

Feature Importance Filtering reads the feature scores from weightTable and keeps the top N features. The filtered feature set is written to outputTable. A model file capturing the filter configuration is saved to modelTable.

Configure the component

PAI -name fe_filter_runner -project algo_public
    -DselectedCols=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome
    -DinputTable=pai_dense_10_10
    -DweightTable=pai_temp_2252_20319_1
    -DtopN=5
    -DmodelTable=pai_temp_2252_20320_2
    -DoutputTable=pai_temp_2252_20320_1;

This example filters the top 5 features from pai_dense_10_10, using the weight table from an upstream feature importance component, and writes the result to pai_temp_2252_20320_1.

Parameters

ParameterDescriptionRequiredDefault
inputTableName of the input tableYes
inputTablePartitionsPartitions to read from the input table. By default, all partitions are read. Specify a single partition as partition_name=value, multiple partitions as name1=value1,name2=value2 (comma-separated), or multi-level partitions as name1=value1/name2=value2.NoAll partitions
weightTableThe feature importance weight table. Must be an output table from the Linear Model Feature Importance, GBDT Feature Importance, or Random Forest Feature Importance component.Yes
outputTableThe output table after the top N features are filtered.Yes
modelTableThe model file generated by feature filtering.Yes
selectedColsColumns from inputTable to consider as candidates for filtering. By default, all columns are considered.NoAll columns
topNNumber of top-ranked features to keep. Must be a positive integer.No10
lifecycleRetention period of the output table, in days. Must be a positive integer.No7

What's next

  • Connect outputTable to a training component to build a model using the selected features.

  • To understand how each upstream component calculates feature scores, see the Linear Model Feature Importance, GBDT Feature Importance, and Random Forest Feature Importance documentation.