The feature scaling component supports common scaling transformations for dense or sparse numeric features.
Overview
The feature scaling component can:
-
Apply common scaling functions, such as log2, log10, ln, abs, and sqrt.
-
Process data in both dense and sparse formats.
Component configuration
You can configure the feature scaling component in one of the following ways.
Method 1: Use the console
Configure the component parameters in the Designer pipeline.
|
Tab |
Parameter |
Description |
|
Fields setting |
Scaled features |
The features to scale. |
|
Label column |
If you specify this parameter, you can view the x-y distribution histogram of features against the target variable. |
|
|
Is k:v,k:v sparse feature |
Specifies whether the training data is sparse. Sparse data is typically stored in a single field instead of multiple fields. |
|
|
Keep original transformed features |
Specifies whether to keep the original features. If this option is selected, new scaled features are created with the scale_ prefix. |
|
|
Parameters setting |
Scaling function |
The feature scaling component supports the following scaling functions:
|
Method 2: Use the CLI
Configure the component parameters using a PAI command. You can run the command in the SQL script component.
PAI -name fe_scale_runner -project algo_public
-Dlifecycle=28
-DscaleMethod=log2
-DscaleCols=nr_employed
-DinputTable=pai_dense_10_1
-DoutputTable=pai_temp_2262_20380_1;
|
Parameter |
Required |
Description |
Default |
|
inputTable |
Yes |
The name of the input table. |
N/A |
|
inputTablePartitions |
No |
The partitions of the input table to use for training. Specify partitions in the format For multi-level partitions, use the format Separate multiple partitions with a comma (,). |
All partitions in the input table. |
|
outputTable |
Yes |
The output table for the scaled results. |
N/A |
|
scaleCols |
Yes |
The features to scale. The component automatically filters out sparse features. Only numeric features can be selected. |
N/A |
|
labelCol |
No |
The label column. If you specify this parameter, you can view the x-y distribution histogram of features against the target variable. |
N/A |
|
categoryCols |
No |
The columns to be treated as categorical features. These columns are not scaled. |
"" |
|
scaleMethod |
No |
The scaling method. Valid values:
|
log2 |
|
scaleTopN |
No |
When the scaleCols parameter is not selected, the system automatically selects the TopN features to be scaled. |
10 |
|
isSparse |
No |
Specifies whether the features are sparse and in the k:v format. |
dense |
|
itemSpliter |
No |
The delimiter for items in a sparse feature. |
, |
|
kvSpliter |
No |
The delimiter between a key and a value in a sparse feature item. |
: |
|
lifecycle |
No |
The lifecycle of the output table, in days. |
7 |
|
coreNum |
No |
The number of nodes. The value must be a positive integer in the range of [1, 9999]. This parameter is used in conjunction with the memSizePerCore parameter. |
Automatically allocated. |
|
memSizePerCore |
No |
The memory size per core in MB. The value must be a positive integer in the range of [2048, 64 * 1024]. |
Automatically allocated. |
Examples
-
Input data
Use the following SQL script to generate the input data.
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10; -
Parameter settings
Select nr_employed as the feature to scale. Only numeric features are supported. For Scaling Function, select log2.
-
Results
nr_employed
12.352071021075528
12.34313018339218
12.285286613666395
12.316026916036957
12.309533196497519
12.352071021075528
12.316026916036957
12.316026916036957
12.309533196497519
12.316026916036957