Feature Anomaly Smoothing is a machine learning preprocessing technique for handling outliers in input features. This technique smooths anomalous data to a specified range, creating a more uniform data distribution and improving the model's stability and predictive performance. The component supports both sparse and dense data formats, making it effective across different types of datasets.
How it works
The following smoothing methods are available:
-
Z-score smoothing
If a feature follows a normal distribution, outliers are typically values that fall outside the range of -3×alpha to 3×alpha, where alpha is the standard deviation. This method smooths these outliers by constraining them to the [-3×alpha,3×alpha] range.
For example, if a feature follows a normal distribution with a mean of 0 and a standard deviation of 3, a feature value of -10 is identified as an outlier and smoothed to -3×3+0, resulting in -9. Similarly, a value of 10 is smoothed to 3×3+0, resulting in 9.
-
Percentile smoothing
Smooths data points that fall outside the [minPer, maxPer] percentile range to the values at the minPer and maxPer percentiles.
For example, for an age feature with values from 0 to 200, if you set minPer to 0 and maxPer to 50%, any feature value outside the range of 0 to 100 is corrected to 0 or 100.
-
Threshold smoothing
Smooths data points that fall outside a specified [minThresh, maxThresh] range by capping the values at the minThresh and maxThresh boundaries.
For example, for an age feature with values from 0 to 200, if you set minThresh to 10 and maxThresh to 80, any feature value outside the range of 10 to 80 is corrected to 10 or 80.
-
Boxplot smoothing
This method constructs upper and lower smoothing thresholds based on the data's quartiles. The thresholds are calculated as follows: minThresh=q1-1.5*(q3-q1) and maxThresh=q3+1.5*(q3-q1).
The Feature Anomaly Smoothing component only smooths anomalous values and does not filter or delete any records. Therefore, the number of rows and columns in the data remains unchanged.
Configure the component
Method 1: Use the visual interface
In Machine Learning Designer, add the Feature Anomaly Smoothing component to your pipeline and configure its parameters on the right-side panel.
|
Group |
Parameter |
Description |
|
Field Settings |
Select Feature Columns to Smooth |
Specifies the feature columns to smooth. |
|
Label Column |
When specified, this parameter allows you to view an x-y distribution histogram of features against the target variable. |
|
|
Parameter Settings |
Smoothing Method |
The smoothing method. Valid values:
|
|
Confidence Range |
The confidence level. This parameter is required when Smoothing Method is set to Z-Score smoothing. |
|
|
Lower Threshold |
A default value of -9999 indicates that no minimum threshold is set. This parameter is required when Smoothing Method is set to threshold smoothing. |
|
|
Upper Threshold |
A default value of -9999 indicates that no maximum threshold is set. This parameter is required when Smoothing Method is set to threshold smoothing. |
|
|
Lower Percentile |
The minimum percentile. This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing. |
|
|
Upper Percentile |
The maximum percentile. This parameter is required when Smoothing Method is set to percentile smoothing or boxplot smoothing. |
Method 2: Use a PAI command
You can also configure the Feature Anomaly Smoothing component by using a PAI command inside an SQL Script component. For more information, see SQL Script component.
PAI -name fe_soften_runner -project algo_public
-DminThresh=5000
-Dlifecycle=28
-DsoftenMethod=min-max-thresh
-DsoftenCols=nr_employed
-DmaxThresh=6000
-DinputTable=pai_dense_10_1
-DoutputTable=pai_temp_2262_20381_1;
|
Parameter |
Required |
Default |
Description |
|
inputTable |
Yes |
N/A |
The name of the input table. |
|
inputTablePartitions |
No |
All partitions of the input table. |
Specify the partitions in the input table to use for training in the format For multi-level partitions, the format is If you specify multiple partitions, use a , to separate them. |
|
outputTable |
Yes |
N/A |
The output table that stores the smoothed results. |
|
labelCol |
No |
None |
The label column. If you specify this parameter, you can view an x-y distribution histogram of features against the target variable. |
|
categoryCols |
No |
None |
Specifies the columns to process as categorical features. |
|
softenCols |
Yes |
N/A |
Specifies the feature columns to smooth. If this parameter is omitted for sparse features, the component uses the softenTopN parameter to automatically select features. |
|
softenMethod |
No |
ZScore |
The smoothing method. Valid values:
|
|
softenTopN |
No |
10 |
If you do not specify the softenCols parameter, the system automatically selects the top N features to smooth. The value must be a positive integer. |
|
cl |
No |
10 |
The confidence level. This parameter is required when softenMethod is set to ZScore. |
|
minPer |
No |
0.0 |
The minimum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot. |
|
maxPer |
No |
1.0 |
The maximum percentile. This parameter is required when softenMethod is set to min-max-per or boxplot. |
|
minThresh |
No |
-9999 |
The minimum threshold. This parameter is required when softenMethod is set to min-max-thresh. |
|
maxThresh |
No |
-9999 |
The maximum threshold. This parameter is required when softenMethod is set to min-max-thresh. |
|
isSparse |
No |
false |
Specifies whether the input data uses a sparse key-value format. Valid values:
By default, the input is treated as dense data. |
|
itemSpliter |
No |
, |
The delimiter for items in a sparse feature. |
|
kvSpliter |
No |
: |
The delimiter for the key and value within each item of a sparse feature. |
|
lifecycle |
No |
The lifecycle of the output table in days. The value must be a positive integer. |
|
|
coreNum |
No |
System-assigned |
The number of cores. This parameter is used with memSizePerCore. The value must be an integer from 1 to 9,999. |
|
memSizePerCore |
No |
System-assigned |
The memory size per core, in MB. The value must be an integer from 2,048 to 65,536. |
Example
-
Input data
create table if not exists pai_dense_10_1 as select nr_employed from bank_data limit 10;nr_employed
5228.1
5195.8
4991.6
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1
-
Parameter configuration
For Select Feature Columns to Smooth, select nr_employed. In the Parameter Settings section, set Smoothing Method to threshold smoothing, Lower Threshold to 5000, and Upper Threshold to 6000.
-
Results
nr_employed
5228.1
5195.8
5000.0
5099.1
5076.2
5228.1
5099.1
5099.1
5076.2
5099.1