Missing data imputation-Platform For AI(PAI)-阿里云帮助中心

The Missing Data Imputation component replaces null values or empty strings in your dataset with computed or custom values—such as the column mean, maximum, or a user-defined constant. Use it in a Machine Learning Designer pipeline to clean data before training or inference.

How it works

All imputation rules are defined as column-level policies: for each column, you specify what counts as a missing value (the trigger) and what value replaces it. The component writes results to a new output table—the source table is unchanged.

Each run also produces a parameter table that stores the computed replacement values (for example, the calculated mean of a numeric column). Pass this parameter table to a subsequent run via inputParaTableName to apply the same imputation logic to new data with the same schema—useful for batch preprocessing pipelines that re-import data regularly.

Configure the component

Method 1: Machine Learning Designer

In Machine Learning Designer, add the Missing Data Imputation component to your pipeline and configure the following parameters.

Fields Setting tab

Parameter	Description
Columns to Impute	The columns to apply imputation to.
Original value	The condition that triggers imputation. Valid values: `Null (Numerical and String)`, `Empty String`, `Null and Empty String (String)`, `Custom (String)`.
Replace with	The replacement value. Valid values: `Minimum (Numerical)`, `Maximum (Numerical)`, `Mean (Numerical)`, `Custom (Numerical and String)`. `Minimum`, `Maximum`, and `Mean` apply to numeric columns only. To replace missing values in string columns, use `Custom`.
Configs	A custom replacement policy in the format `Column 1, Original value, New value; Column 2, Original value, New value; ...`. Displayed only when Advanced Options is selected.

Tuning tab

Parameter	Description
Cores	Number of compute cores.
Memory size per core	Memory allocated per core, in MB.

Method 2: PAI commands

Use the FillMissingValues algorithm via PAI commands. To run PAI commands in a pipeline, use the SQL Script component (Scenario 4).

PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="poutcome,null-empty,testing"
    -DoutputParaTableName="test_input_model_output"
    -DoutputTableName="test_3"
    -DinputTablePartitions="pt=20150501"
    -DinputTableName="bank_data_partition";

Parameters

Parameter	Required	Default	Description
`inputTableName`	Yes	—	Name of the input table.
`outputTableName`	Yes	—	Name of the output table.
`configs`	Yes	—	Imputation rules. Format: `col1, null, 3.14; col2, empty, hello; col3, empty-null, world`. Use `null` for null values and `empty` for empty strings. For computed replacements, use the variable names `max`, `min`, or `mean` (numeric columns only). For custom string replacements, use the format `col4, user-defined, str, str123`. Columns with empty string replacements must be of the `string` type.
`outputParaTableName`	Yes	Output Parameter Table 1 (non-partitioned)	Name of the output parameter table. Stores the computed replacement values for reuse.
`inputParaTableName`	No	—	Name of a previously saved parameter table. When specified, the component applies saved imputation rules instead of computing new ones from `configs`.
`inputTablePartitions`	No	All partitions	Partitions to read from the input table. Formats: `partition_name=value` or `name1=value1/name2=value2` for multi-level partitions. Separate multiple partitions with commas.
`lifecycle`	No	—	Retention period of the output table, in days. Valid values: [1, 3650].
`coreNum`	No	System default	Number of compute cores. Must be a positive integer.
`memSizePerCore`	No	System default	Memory per core, in MB. Valid values: (1, 65536).

Apply a saved imputation model to new data

After running the component, outputParaTableName stores the computed replacement values as a model artifact. To apply the same imputation rules to a new dataset with the same schema, pass the saved parameter table as inputParaTableName in the next run:

PAI -name FillMissingValues
    -project algo_public
    -DinputParaTableName="fill_missing_values_test_input_model_output"
    -DoutputParaTableName="new_run_model_output"
    -DoutputTableName="new_run_output"
    -DinputTableName="new_input_table";

When inputParaTableName is set, configs is ignored and the component uses the stored rules directly. This is useful for batch preprocessing pipelines that regularly re-import data with the same schema.

Example

The following example imputes missing values across five data types: string, bigint, double, boolean, and datetime.

Step 1: Create the test table

drop table if exists fill_missing_values_test_input;
create table fill_missing_values_test_input(
    col_string  string,
    col_bigint  bigint,
    col_double  double,
    col_boolean boolean,
    col_datetime datetime);
insert overwrite table fill_missing_values_test_input
select * from
(
    select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime
    union all
    select cast(null as string), 11, 10.2, False, cast('2016-07-02 10:00:00' as datetime)
    union all
    select '02', cast(null as bigint), 10.3, True, cast('2016-07-03 10:00:00' as datetime)
    union all
    select '03', 12, cast(null as double), False, cast('2016-07-04 10:00:00' as datetime)
    union all
    select '04', 13, 10.4, cast(null as boolean), cast('2016-07-05 10:00:00' as datetime)
    union all
    select '05', 14, 10.5, True, cast(null as datetime)
) tmp;

Input data:

+------------+------------+------------+-------------+---------------------+
| col_string | col_bigint | col_double | col_boolean | col_datetime        |
+------------+------------+------------+-------------+---------------------+
| 01         | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
| NULL       | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
| 02         | NULL       | 10.3       | true        | 2016-07-03 10:00:00 |
| 03         | 12         | NULL       | false       | 2016-07-04 10:00:00 |
| 04         | 13         | 10.4       | NULL        | 2016-07-05 10:00:00 |
| 05         | 14         | 10.5       | true        | NULL                |
+------------+------------+------------+-------------+---------------------+

Step 2: Run imputation

The following rules are applied: replace null col_double values with the column mean, null-or-empty col_string values with a custom string, null col_bigint values with the column maximum, and null col_boolean and col_datetime values with fixed values.

drop table if exists fill_missing_values_test_input_output;
drop table if exists fill_missing_values_test_input_model_output;
PAI -name FillMissingValues
    -project algo_public
    -Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
    -DoutputParaTableName="fill_missing_values_test_input_model_output"
    -Dlifecycle="28"
    -DoutputTableName="fill_missing_values_test_input_output"
    -DinputTableName="fill_missing_values_test_input";

Step 3: View results

Output table fill_missing_values_test_input_output — all missing values replaced:

+----------------+------------+------------+-------------+---------------------+
| col_string     | col_bigint | col_double | col_boolean | col_datetime        |
+----------------+------------+------------+-------------+---------------------+
| 01             | 10         | 10.1       | true        | 2016-07-01 10:00:00 |
| str_type_empty | 11         | 10.2       | false       | 2016-07-02 10:00:00 |
| 02             | 14         | 10.3       | true        | 2016-07-03 10:00:00 |
| 03             | 12         | 10.3       | false       | 2016-07-04 10:00:00 |
| 04             | 13         | 10.4       | true        | 2016-07-05 10:00:00 |
| 05             | 14         | 10.5       | true        | 2016-07-06 10:00:00 |
+----------------+------------+------------+-------------+---------------------+

The computed replacement values are stored in fill_missing_values_test_input_model_output:

+-------------+-----------------------------------------------------------------------------------------------+
| feature     | json                                                                                          |
+-------------+-----------------------------------------------------------------------------------------------+
| col_string  | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} |
| col_bigint  | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} |
| col_double  | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} |
| col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} |
| col_datetime| {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} |
+-------------+-----------------------------------------------------------------------------------------------+

The mean of col_double is 10.3 (computed from the five non-null values), and the maximum of col_bigint is 14. Pass fill_missing_values_test_input_model_output as inputParaTableName in a subsequent run to apply these same values to new data.