Random Forest Feature Importance Evaluation-Platform For AI(PAI)-阿里云帮助中心

Random Forest Feature Importance Evaluation is a method for analyzing the contribution of each feature to the prediction results in a random forest model. It determines feature importance by calculating the mean decrease in impurity for each feature across all decision trees or using permutation importance. This process helps identify features with the greatest impact on model performance.

Component configuration

Method 1: GUI method

In Machine Learning Designer, add the Random Forest Feature Importance Evaluation component to your pipeline and configure its parameters in the right-side pane:

Tab	Parameter	Description
Field Settings	Feature Columns	The feature columns to use from the input table. This parameter is optional. By default, all columns except the label column are selected.
Field Settings	Target Column	Required.
Parameter Settings	Parallel Computing Cores	The number of cores for parallel computing.
Parameter Settings	Memory Size per Core	The memory size per core, in MB.

Method 2: PAI command

You can configure the Random Forest Feature Importance Evaluation component by running a PAI command in the SQL Script component. For more information, see Scenario 4: Execute PAI commands in the SQL Script component.

pai -name feature_importance -project algo_public
    -DinputTableName=pai_dense_10_10
    -DmodelName=xlab_m_random_forests_1_20318_v0
    -DoutputTableName=erkang_test_dev.pai_temp_2252_20319_1
    -DlabelColName=y
    -DfeatureColNames="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign,poutcome"
    -Dlifecycle=28 ;

Parameter	Required	Default	Description
inputTableName	Yes	None	The name of the input table.
outputTableName	Yes	None	The name of the output table.
labelColName	Yes	None	The name of the label column in the input table.
modelName	Yes	None	The name of the input model.
featureColNames	No	All columns except the label column	The feature columns to use from the input table.
inputTablePartitions	No	The entire table	The partitions to use from the input table.
lifecycle	No	Not specified	The lifecycle of the output table.
coreNum	No	Calculated automatically	The number of cores.
memSizePerCore	No	Calculated automatically	The memory size per core, in MB.

Example

Use an SQL statement to generate training data.

This example creates a table named pai_dense_10_10 by selecting specific columns and the first 10 rows from the bank_data table. You can create a table based on your actual business needs.
```
drop table if exists pai_dense_10_10;
create table pai_dense_10_10 as
select
    age,campaign,pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed, y
from bank_data limit 10;
```
Build the pipeline. For more information, see Custom pipelines.

Set y as the label column, and all other columns as feature columns. For the Columns Forced to Convert parameter, select age and campaign to process them as enumerated features. Use the default values for all other parameters.

Run the pipeline and view the results. The following output shows the feature importance values generated by the Random Forest Feature Importance Evaluation component.

colname                gini                        entropy
age                    0.06625000000000003          0.13978726292803723
campaign               0.00175000000000003          0.00434851554559677
cons_conf_idx          0.01399999999999999          0.0290840949701885
cons_price_idx         0.002                        0.00498044991346125
emp_var_rate           0.01470000000000003          0.0267863068026093
euribor3m              0.06300000000000003          0.132193634884603
nr_employed            0.10499999999999998          0.220322724807673
pdays                  0.0845                       0.177503292343975
poutcome               0.03360000000000001          0.070503271938455
previous               0.01770000000000004          0.038103810058015

After the run completes, right-click the Random Forest Feature Importance Evaluation component and select View Analytics Report.