Feature anomaly smoothing-Platform For AI(PAI)-阿里云帮助中心

Feature Anomaly Smoothing is a machine learning preprocessing technique for handling outliers in input features. This technique smooths anomalous data to a specified range, creating a more uniform data distribution and improving the model's stability and predictive performance. The component supports both sparse and dense data formats, making it effective across different types of datasets.

How it works

The following smoothing methods are available:

Z-score smoothing

If a feature follows a normal distribution, outliers are typically values that fall outside the range of -3×alpha to 3×alpha, where alpha is the standard deviation. This method smooths these outliers by constraining them to the [-3×alpha,3×alpha] range.

For example, if a feature follows a normal distribution with a mean of 0 and a standard deviation of 3, a feature value of -10 is identified as an outlier and smoothed to -3×3+0, resulting in -9. Similarly, a value of 10 is smoothed to 3×3+0, resulting in 9.
Percentile smoothing

Smooths data points that fall outside the [minPer, maxPer] percentile range to the values at the minPer and maxPer percentiles.

For example, for an age feature with values from 0 to 200, if you set minPer to 0 and maxPer to 50%, any feature value outside the range of 0 to 100 is corrected to 0 or 100.
Threshold smoothing

Smooths data points that fall outside a specified [minThresh, maxThresh] range by capping the values at the minThresh and maxThresh boundaries.

For example, for an age feature with values from 0 to 200, if you set minThresh to 10 and maxThresh to 80, any feature value outside the range of 10 to 80 is corrected to 10 or 80.
Boxplot smoothing

This method constructs upper and lower smoothing thresholds based on the data's quartiles. The thresholds are calculated as follows: minThresh=q1-1.5*(q3-q1) and maxThresh=q3+1.5*(q3-q1).

Note

The Feature Anomaly Smoothing component only smooths anomalous values and does not filter or delete any records. Therefore, the number of rows and columns in the data remains unchanged.

Configure the component

Method 1: Use the visual interface

In Machine Learning Designer, add the Feature Anomaly Smoothing component to your pipeline and configure its parameters on the right-side panel.

Group	Parameter	Description
Field Settings	Select Feature Columns to Smooth	Specifies the feature columns to smooth.
Field Settings	Label Column	When specified, this parameter allows you to view an x-y distribution histogram of features against the target variable.
Parameter Settings	Smoothing Method	The smoothing method. Valid values: Z-score smoothing Percentile smoothing Threshold smoothing Boxplot smoothing
	Confidence Range	The confidence level. This parameter is required when Smoothing Method is set to Z-Score smoothing.
	Lower Threshold	A default value of -9999 indicates that no minimum threshold is set. This parameter is required when Smoothing Method is set to threshold smoothing.
	Upper Threshold	A default value of -9999 indicates that no maximum threshold is set. This parameter is required when Smoothing Method is set to threshold smoothing.
	Lower Percentile	The minimum percentile. This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing.
	Upper Percentile	The maximum percentile. This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing.

Method 2: Use a PAI command

You can also configure the Feature Anomaly Smoothing component by using a PAI command inside an SQL Script component. For more information, see SQL Script component.

PAI -name fe_soften_runner -project algo_public
    -DminThresh=5000
    -Dlifecycle=28
    -DsoftenMethod=min-max-thresh
    -DsoftenCols=nr_employed
    -DmaxThresh=6000
    -DinputTable=pai_dense_10_1
    -DoutputTable=pai_temp_2262_20381_1;

Parameter	Required	Default	Description
inputTable	Yes	N/A	The name of the input table.
inputTablePartitions	No	All partitions of the input table.	Specify the partitions in the input table to use for training in the format `Partition_name=value`. For multi-level partitions, the format is `name1=value1/name2=value2;`. If you specify multiple partitions, use a , to separate them.
outputTable	Yes	N/A	The output table that stores the smoothed results.
labelCol	No	None	The label column. If you specify this parameter, you can view an x-y distribution histogram of features against the target variable.
categoryCols	No	None	Specifies the columns to process as categorical features.
softenCols	Yes	N/A	Specifies the feature columns to smooth. If this parameter is omitted for sparse features, the component uses the softenTopN parameter to automatically select features.
softenMethod	No	ZScore	The smoothing method. Valid values: ZScore: Z-score smoothing min-max-per: Percentile smoothing min-max-thresh: Threshold smoothing boxplot: Boxplot smoothing
softenTopN	No	10	If you do not specify the softenCols parameter, the system automatically selects the top N features to smooth. The value must be a positive integer.
cl	No	10	The confidence level. This parameter is required when softenMethod is set to ZScore.
minPer	No	0.0	The minimum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot.
maxPer	No	1.0	The maximum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot.
minThresh	No	-9999	The minimum threshold. This parameter is required when softenMethod is set to min-max-thresh.
maxThresh	No	-9999	The maximum threshold. This parameter is required when softenMethod is set to min-max-thresh.
isSparse	No	false	Specifies whether the input data uses a sparse key-value format. Valid values: true false By default, the input is treated as dense data.
itemSpliter	No	,	The delimiter for items in a sparse feature.
kvSpliter	No	:	The delimiter for the key and value within each item of a sparse feature.
lifecycle	No	7	The lifecycle of the output table in days. The value must be a positive integer.
coreNum	No	System-assigned	The number of cores. This parameter is used with memSizePerCore. The value must be an integer from 1 to 9,999.
memSizePerCore	No	System-assigned	The memory size per core, in MB. The value must be an integer from 2,048 to 65,536.

Example

Input data

create table if not exists pai_dense_10_1 as
select
    nr_employed
from bank_data limit 10;

nr_employed
5228.1
5195.8
4991.6
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1

Parameter configuration

For Select Feature Columns to Smooth, select nr_employed. In the Parameter Settings section, set Smoothing Method to threshold smoothing, Lower Threshold to 5000, and Upper Threshold to 6000.

Results

nr_employed
5228.1
5195.8
5000.0
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1