Feature anomaly smoothing

更新时间:
复制 MD 格式

Feature Anomaly Smoothing is a machine learning preprocessing technique for handling outliers in input features. This technique smooths anomalous data to a specified range, creating a more uniform data distribution and improving the model's stability and predictive performance. The component supports both sparse and dense data formats, making it effective across different types of datasets.

How it works

The following smoothing methods are available:

  • Z-score smoothing

    If a feature follows a normal distribution, outliers are typically values that fall outside the range of -3×alpha to 3×alpha, where alpha is the standard deviation. This method smooths these outliers by constraining them to the [-3×alpha,3×alpha] range.

    For example, if a feature follows a normal distribution with a mean of 0 and a standard deviation of 3, a feature value of -10 is identified as an outlier and smoothed to -3×3+0, resulting in -9. Similarly, a value of 10 is smoothed to 3×3+0, resulting in 9.

  • Percentile smoothing

    Smooths data points that fall outside the [minPer, maxPer] percentile range to the values at the minPer and maxPer percentiles.

    For example, for an age feature with values from 0 to 200, if you set minPer to 0 and maxPer to 50%, any feature value outside the range of 0 to 100 is corrected to 0 or 100.

  • Threshold smoothing

    Smooths data points that fall outside a specified [minThresh, maxThresh] range by capping the values at the minThresh and maxThresh boundaries.

    For example, for an age feature with values from 0 to 200, if you set minThresh to 10 and maxThresh to 80, any feature value outside the range of 10 to 80 is corrected to 10 or 80.

  • Boxplot smoothing

    This method constructs upper and lower smoothing thresholds based on the data's quartiles. The thresholds are calculated as follows: minThresh=q1-1.5*(q3-q1) and maxThresh=q3+1.5*(q3-q1).

Note

The Feature Anomaly Smoothing component only smooths anomalous values and does not filter or delete any records. Therefore, the number of rows and columns in the data remains unchanged.

Configure the component

Method 1: Use the visual interface

In Machine Learning Designer, add the Feature Anomaly Smoothing component to your pipeline and configure its parameters on the right-side panel.

Group

Parameter

Description

Field Settings

Select Feature Columns to Smooth

Specifies the feature columns to smooth.

Label Column

When specified, this parameter allows you to view an x-y distribution histogram of features against the target variable.

Parameter Settings

Smoothing Method

The smoothing method. Valid values:

  • Z-score smoothing

  • Percentile smoothing

  • Threshold smoothing

  • Boxplot smoothing

Confidence Range

The confidence level. This parameter is required when Smoothing Method is set to Z-Score smoothing.

Lower Threshold

A default value of -9999 indicates that no minimum threshold is set.

This parameter is required when Smoothing Method is set to threshold smoothing.

Upper Threshold

A default value of -9999 indicates that no maximum threshold is set.

This parameter is required when Smoothing Method is set to threshold smoothing.

Lower Percentile

The minimum percentile.

This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing.

Upper Percentile

The maximum percentile.

This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing.

Method 2: Use a PAI command

You can also configure the Feature Anomaly Smoothing component by using a PAI command inside an SQL Script component. For more information, see SQL Script component.

PAI -name fe_soften_runner -project algo_public
    -DminThresh=5000
    -Dlifecycle=28
    -DsoftenMethod=min-max-thresh
    -DsoftenCols=nr_employed
    -DmaxThresh=6000
    -DinputTable=pai_dense_10_1
    -DoutputTable=pai_temp_2262_20381_1;

Parameter

Required

Default

Description

inputTable

Yes

N/A

The name of the input table.

inputTablePartitions

No

All partitions of the input table.

Specify the partitions in the input table to use for training in the format Partition_name=value.

For multi-level partitions, the format is name1=value1/name2=value2;.

If you specify multiple partitions, use a , to separate them.

outputTable

Yes

N/A

The output table that stores the smoothed results.

labelCol

No

None

The label column. If you specify this parameter, you can view an x-y distribution histogram of features against the target variable.

categoryCols

No

None

Specifies the columns to process as categorical features.

softenCols

Yes

N/A

Specifies the feature columns to smooth. If this parameter is omitted for sparse features, the component uses the softenTopN parameter to automatically select features.

softenMethod

No

ZScore

The smoothing method. Valid values:

  • ZScore: Z-score smoothing

  • min-max-per: Percentile smoothing

  • min-max-thresh: Threshold smoothing

  • boxplot: Boxplot smoothing

softenTopN

No

10

If you do not specify the softenCols parameter, the system automatically selects the top N features to smooth. The value must be a positive integer.

cl

No

10

The confidence level. This parameter is required when softenMethod is set to ZScore.

minPer

No

0.0

The minimum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot.

maxPer

No

1.0

The maximum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot.

minThresh

No

-9999

The minimum threshold. This parameter is required when softenMethod is set to min-max-thresh.

maxThresh

No

-9999

The maximum threshold. This parameter is required when softenMethod is set to min-max-thresh.

isSparse

No

false

Specifies whether the input data uses a sparse key-value format. Valid values:

  • true

  • false

By default, the input is treated as dense data.

itemSpliter

No

,

The delimiter for items in a sparse feature.

kvSpliter

No

:

The delimiter for the key and value within each item of a sparse feature.

lifecycle

No

7

The lifecycle of the output table in days. The value must be a positive integer.

coreNum

No

System-assigned

The number of cores. This parameter is used with memSizePerCore. The value must be an integer from 1 to 9,999.

memSizePerCore

No

System-assigned

The memory size per core, in MB. The value must be an integer from 2,048 to 65,536.

Example

  • Input data

    create table if not exists pai_dense_10_1 as
    select
        nr_employed
    from bank_data limit 10;

    nr_employed

    5228.1

    5195.8

    4991.6

    5099.1

    5076.2

    5228.1

    5099.1

    5099.1

    5076.2

    5099.1

  • Parameter configuration

    For Select Feature Columns to Smooth, select nr_employed. In the Parameter Settings section, set Smoothing Method to threshold smoothing, Lower Threshold to 5000, and Upper Threshold to 6000.

  • Results

    nr_employed

    5228.1

    5195.8

    5000.0

    5099.1

    5076.2

    5228.1

    5099.1

    5099.1

    5076.2

    5099.1