Feature scaling

更新时间:
复制 MD 格式

The feature scaling component supports common scaling transformations for dense or sparse numeric features.

Overview

The feature scaling component can:

  • Apply common scaling functions, such as log2, log10, ln, abs, and sqrt.

  • Process data in both dense and sparse formats.

Component configuration

You can configure the feature scaling component in one of the following ways.

Method 1: Use the console

Configure the component parameters in the Designer pipeline.

Tab

Parameter

Description

Fields setting

Scaled features

The features to scale.

Label column

If you specify this parameter, you can view the x-y distribution histogram of features against the target variable.

Is k:v,k:v sparse feature

Specifies whether the training data is sparse. Sparse data is typically stored in a single field instead of multiple fields.

Keep original transformed features

Specifies whether to keep the original features. If this option is selected, new scaled features are created with the scale_ prefix.

Parameters setting

Scaling function

The feature scaling component supports the following scaling functions:

  • log2

  • log10

  • ln

  • abs

  • sqrt

Method 2: Use the CLI

Configure the component parameters using a PAI command. You can run the command in the SQL script component.

PAI -name fe_scale_runner -project algo_public
    -Dlifecycle=28
    -DscaleMethod=log2
    -DscaleCols=nr_employed
    -DinputTable=pai_dense_10_1
    -DoutputTable=pai_temp_2262_20380_1;

Parameter

Required

Description

Default

inputTable

Yes

The name of the input table.

N/A

inputTablePartitions

No

The partitions of the input table to use for training. Specify partitions in the format Partition_name=value.

For multi-level partitions, use the format name1=value1/name2=value2;.

Separate multiple partitions with a comma (,).

All partitions in the input table.

outputTable

Yes

The output table for the scaled results.

N/A

scaleCols

Yes

The features to scale.

The component automatically filters out sparse features. Only numeric features can be selected.

N/A

labelCol

No

The label column.

If you specify this parameter, you can view the x-y distribution histogram of features against the target variable.

N/A

categoryCols

No

The columns to be treated as categorical features. These columns are not scaled.

""

scaleMethod

No

The scaling method. Valid values:

  • log2

  • log10

  • ln

  • abs

  • sqrt

log2

scaleTopN

No

When the scaleCols parameter is not selected, the system automatically selects the TopN features to be scaled.

10

isSparse

No

Specifies whether the features are sparse and in the k:v format.

dense

itemSpliter

No

The delimiter for items in a sparse feature.

,

kvSpliter

No

The delimiter between a key and a value in a sparse feature item.

:

lifecycle

No

The lifecycle of the output table, in days.

7

coreNum

No

The number of nodes. The value must be a positive integer in the range of [1, 9999]. This parameter is used in conjunction with the memSizePerCore parameter.

Automatically allocated.

memSizePerCore

No

The memory size per core in MB. The value must be a positive integer in the range of [2048, 64 * 1024].

Automatically allocated.

Examples

  • Input data

    Use the following SQL script to generate the input data.

    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;
  • Parameter settings

    Select nr_employed as the feature to scale. Only numeric features are supported. For Scaling Function, select log2.

  • Results

    nr_employed

    12.352071021075528

    12.34313018339218

    12.285286613666395

    12.316026916036957

    12.309533196497519

    12.352071021075528

    12.316026916036957

    12.316026916036957

    12.309533196497519

    12.316026916036957