The Missing Data Imputation component replaces null values or empty strings in your dataset with computed or custom values—such as the column mean, maximum, or a user-defined constant. Use it in a Machine Learning Designer pipeline to clean data before training or inference.
How it works
All imputation rules are defined as column-level policies: for each column, you specify what counts as a missing value (the trigger) and what value replaces it. The component writes results to a new output table—the source table is unchanged.
Each run also produces a parameter table that stores the computed replacement values (for example, the calculated mean of a numeric column). Pass this parameter table to a subsequent run via inputParaTableName to apply the same imputation logic to new data with the same schema—useful for batch preprocessing pipelines that re-import data regularly.
Configure the component
Method 1: Machine Learning Designer
In Machine Learning Designer, add the Missing Data Imputation component to your pipeline and configure the following parameters.
Fields Setting tab
| Parameter | Description |
|---|---|
| Columns to Impute | The columns to apply imputation to. |
| Original value | The condition that triggers imputation. Valid values: Null (Numerical and String), Empty String, Null and Empty String (String), Custom (String). |
| Replace with | The replacement value. Valid values: Minimum (Numerical), Maximum (Numerical), Mean (Numerical), Custom (Numerical and String). Minimum, Maximum, and Mean apply to numeric columns only. To replace missing values in string columns, use Custom. |
| Configs | A custom replacement policy in the format Column 1, Original value, New value; Column 2, Original value, New value; .... Displayed only when Advanced Options is selected. |
Tuning tab
| Parameter | Description |
|---|---|
| Cores | Number of compute cores. |
| Memory size per core | Memory allocated per core, in MB. |
Method 2: PAI commands
Use the FillMissingValues algorithm via PAI commands. To run PAI commands in a pipeline, use the SQL Script component (Scenario 4).
PAI -name FillMissingValues
-project algo_public
-Dconfigs="poutcome,null-empty,testing"
-DoutputParaTableName="test_input_model_output"
-DoutputTableName="test_3"
-DinputTablePartitions="pt=20150501"
-DinputTableName="bank_data_partition";Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
inputTableName | Yes | — | Name of the input table. |
outputTableName | Yes | — | Name of the output table. |
configs | Yes | — | Imputation rules. Format: col1, null, 3.14; col2, empty, hello; col3, empty-null, world. Use null for null values and empty for empty strings. For computed replacements, use the variable names max, min, or mean (numeric columns only). For custom string replacements, use the format col4, user-defined, str, str123. Columns with empty string replacements must be of the string type. |
outputParaTableName | Yes | Output Parameter Table 1 (non-partitioned) | Name of the output parameter table. Stores the computed replacement values for reuse. |
inputParaTableName | No | — | Name of a previously saved parameter table. When specified, the component applies saved imputation rules instead of computing new ones from configs. |
inputTablePartitions | No | All partitions | Partitions to read from the input table. Formats: partition_name=value or name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas. |
lifecycle | No | — | Retention period of the output table, in days. Valid values: [1, 3650]. |
coreNum | No | System default | Number of compute cores. Must be a positive integer. |
memSizePerCore | No | System default | Memory per core, in MB. Valid values: (1, 65536). |
Apply a saved imputation model to new data
After running the component, outputParaTableName stores the computed replacement values as a model artifact. To apply the same imputation rules to a new dataset with the same schema, pass the saved parameter table as inputParaTableName in the next run:
PAI -name FillMissingValues
-project algo_public
-DinputParaTableName="fill_missing_values_test_input_model_output"
-DoutputParaTableName="new_run_model_output"
-DoutputTableName="new_run_output"
-DinputTableName="new_input_table";When inputParaTableName is set, configs is ignored and the component uses the stored rules directly. This is useful for batch preprocessing pipelines that regularly re-import data with the same schema.
Example
The following example imputes missing values across five data types: string, bigint, double, boolean, and datetime.
Step 1: Create the test table
drop table if exists fill_missing_values_test_input;
create table fill_missing_values_test_input(
col_string string,
col_bigint bigint,
col_double double,
col_boolean boolean,
col_datetime datetime);
insert overwrite table fill_missing_values_test_input
select * from
(
select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime
union all
select cast(null as string), 11, 10.2, False, cast('2016-07-02 10:00:00' as datetime)
union all
select '02', cast(null as bigint), 10.3, True, cast('2016-07-03 10:00:00' as datetime)
union all
select '03', 12, cast(null as double), False, cast('2016-07-04 10:00:00' as datetime)
union all
select '04', 13, 10.4, cast(null as boolean), cast('2016-07-05 10:00:00' as datetime)
union all
select '05', 14, 10.5, True, cast(null as datetime)
) tmp;Input data:
+------------+------------+------------+-------------+---------------------+
| col_string | col_bigint | col_double | col_boolean | col_datetime |
+------------+------------+------------+-------------+---------------------+
| 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 |
| NULL | 11 | 10.2 | false | 2016-07-02 10:00:00 |
| 02 | NULL | 10.3 | true | 2016-07-03 10:00:00 |
| 03 | 12 | NULL | false | 2016-07-04 10:00:00 |
| 04 | 13 | 10.4 | NULL | 2016-07-05 10:00:00 |
| 05 | 14 | 10.5 | true | NULL |
+------------+------------+------------+-------------+---------------------+Step 2: Run imputation
The following rules are applied: replace null col_double values with the column mean, null-or-empty col_string values with a custom string, null col_bigint values with the column maximum, and null col_boolean and col_datetime values with fixed values.
drop table if exists fill_missing_values_test_input_output;
drop table if exists fill_missing_values_test_input_model_output;
PAI -name FillMissingValues
-project algo_public
-Dconfigs="col_double,null,mean;col_string,null-empty,str_type_empty;col_bigint,null,max;col_boolean,null,true;col_datetime,null,2016-07-06 10:00:00"
-DoutputParaTableName="fill_missing_values_test_input_model_output"
-Dlifecycle="28"
-DoutputTableName="fill_missing_values_test_input_output"
-DinputTableName="fill_missing_values_test_input";Step 3: View results
Output table fill_missing_values_test_input_output — all missing values replaced:
+----------------+------------+------------+-------------+---------------------+
| col_string | col_bigint | col_double | col_boolean | col_datetime |
+----------------+------------+------------+-------------+---------------------+
| 01 | 10 | 10.1 | true | 2016-07-01 10:00:00 |
| str_type_empty | 11 | 10.2 | false | 2016-07-02 10:00:00 |
| 02 | 14 | 10.3 | true | 2016-07-03 10:00:00 |
| 03 | 12 | 10.3 | false | 2016-07-04 10:00:00 |
| 04 | 13 | 10.4 | true | 2016-07-05 10:00:00 |
| 05 | 14 | 10.5 | true | 2016-07-06 10:00:00 |
+----------------+------------+------------+-------------+---------------------+The computed replacement values are stored in fill_missing_values_test_input_model_output:
+-------------+-----------------------------------------------------------------------------------------------+
| feature | json |
+-------------+-----------------------------------------------------------------------------------------------+
| col_string | {"name": "fillMissingValues", "type": "string", "paras":{"missing_value_type": "null-empty", "replaced_value": "str_type_empty"}} |
| col_bigint | {"name": "fillMissingValues", "type": "bigint", "paras":{"missing_value_type": "null", "replaced_value": 14}} |
| col_double | {"name": "fillMissingValues", "type": "double", "paras":{"missing_value_type": "null", "replaced_value": 10.3}} |
| col_boolean | {"name": "fillMissingValues", "type": "boolean", "paras":{"missing_value_type": "null", "replaced_value": 1}} |
| col_datetime| {"name": "fillMissingValues", "type": "datetime", "paras":{"missing_value_type": "null", "replaced_value": 1467770400000}} |
+-------------+-----------------------------------------------------------------------------------------------+The mean of col_double is 10.3 (computed from the five non-null values), and the maximum of col_bigint is 14. Pass fill_missing_values_test_input_model_output as inputParaTableName in a subsequent run to apply these same values to new data.