One-hot encoding

更新时间:
复制 MD 格式

One-hot encoding converts a categorical feature with m possible values into m mutually exclusive binary features, with only one active per row. The output is stored in a sparse key-value (KV) structure.

How it works

For each column you specify, the component maps every distinct value to a unique index and outputs a binary indicator for each mapping. A column with three distinct values produces three binary columns, with exactly one set to 1.0 per row.

The component supports two modes:

  • Training: Reads the input data, builds a mapping table, and writes both the encoded result and the model table. Only the left input node (Input Table Name) is required.

  • Prediction: Reads a previously built model table from the right input node (Input Model Table) and applies the existing encoding to new data. Values absent from the mapping table are ignored and not encoded.

image

Configure the component

Method 1: Configure on the pipeline page

Tab Parameter Description
Fields setting Select binary column Required. The columns to encode. Accepts any data type.
Other reserved features Columns passed through to the output in KV format. Must be DOUBLE type. These columns are not encoded; they are indexed starting from 0.
Appended columns Columns copied to the output table without any transformation.
Parameters setting Output table type KV (sparse, recommended for high-cardinality features) or Table (dense, maximum 1,024 columns).
Delete encoding of last enumeration When set to true, the linear independence of the encoded data is ensured.
Ignore empty elements When set to true, NULL values are not encoded. When set to false (default), NULL is treated as a distinct category and assigned its own index.
Lifecycle Retention period of the output table in days. Default: 7.
Cores Number of cores for the job. Determined by the system if left blank.
Memory size per node Memory allocated to each core, in MB. Valid range: 2,048–65,536 MB. Determined by the system if left blank.

Method 2: Use PAI commands

Call the component from the SQL Script component using the following syntax:

PAI -name one_hot
    -project algo_public
    -DinputTable=one_hot_test
    -DbinaryCols=f0,f1,f2
    -DmodelTable=one_hot_model
    -DoutputTable=one_hot_output
    -Dlifecycle=28;

For more information on calling PAI commands, see SQL script.

Parameter reference

Parameter Required Default Description
inputTable Yes Name of the input table.
inputTablePartitions No All partitions Partitions to use from the input table during training.
binaryCols Yes Comma-separated list of columns to encode. Accepts enumeration features of any data type.
reserveCols No Empty Columns passed through in KV format. Must be DOUBLE type. These columns are not encoded and are indexed from 0.
appendCols No Columns copied to the output table without transformation.
outputTable Yes Name of the output table. Results are stored in KV format by default.
inputModelTable No Empty An existing model table to use for prediction. Either inputModelTable or outputModelTable must be a non-empty string. When specified, the table must be a valid, non-empty model table.
outputModelTable No Empty Name of the model table to write during training. Either inputModelTable or outputModelTable must be a non-empty string.
lifecycle No 7 Retention period of the output table in days.
dropLast Yes false When true, drops the last binary column for each feature to ensure linear independence of the encoded data.
outputTableType Yes kv Output format. kv produces a sparse table suitable for high-cardinality features. table produces a dense table with a maximum of 1,024 columns — an error is raised if the column count exceeds this limit.
ignoreNull Yes false When true, NULL values are skipped during encoding. When false, NULL is treated as a distinct category with its own index.
coreNum No System determined Number of cores for the job.
memSizePerCore No System determined Memory per core in MB. Valid range: 2,048–65,536 MB.
Note: When using a trained model for subsequent encodings, the parameters dropLast, ignoreNull, and reserveCols are locked to the values set during training and cannot be changed. To use different values for these parameters, retrain the model.

Example

The following example encodes three columns — f0 (BIGINT), f2 (DATETIME), and f4 (BOOLEAN) — while passing f3 (DOUBLE) through as a reserved feature and appending the original f0, f2, and f4 columns to the output.

Step 1: Sample input data

f0 (BIGINT) f1 (STRING) f2 (DATETIME) f3 (DOUBLE) f4 (BOOLEAN)
12 prefix1 1970-09-15 12:50:22 0.1 true
12 prefix3 1971-01-22 03:15:33 0.4 false
NULL prefix3 1970-01-01 08:00:00 0.2 NULL
3 NULL 1970-01-01 08:00:00 0.3 false
34 NULL 1970-09-15 12:50:22 0.4 NULL
3 prefix1 1970-09-15 12:50:22 0.2 true
3 prefix1 1970-09-15 12:50:22 0.3 false
3 prefix3 1970-01-01 08:00:00 0.2 true
3 prefix3 1971-01-22 03:15:33 0.1 false
NULL prefix3 1970-01-01 08:00:00 0.3 false

Step 2: PAI command

PAI -project algo_public -name one_hot
    -DinputTable=one_hot
    -DbinaryCols=f0,f2,f4
    -DoutputModelTable=one_hot_model_8
    -DoutputTable=one_hot_in_table_1_output_8
    -DdropLast=false
    -DappendCols=f0,f2,f4
    -DignoreNull=false
    -DoutputTableType=table
    -DreserveCols=f3
    -DcoreNum=4
    -DmemSizePerCore=2048;

Step 3: Output

The model table maps each distinct value to an index. The top row (with col_name fixed to _reserve_) stores the reserved column mapping. All other rows are the encoding mappings.

col_name col_value mapping
_reserve_ f3 0
f0 12 1
f0 3 2
f0 34 3
f0 null 4
f2 0 5
f2 22222222000 6
f2 33333333000 7
f4 0 8
f4 1 9
f4 null 10

Table

The encoded result table (Table format) uses the column naming pattern {field}_{value}_{index}:

f0 f2 f4 _reserve__f3_0 f0_12_1 f0_3_2 f0_34_3 f0_null_4 f2_0_5 f2_22222222_6 f2_33333333_7 f4_0_8 f4_1_9 f4_null_10
12 1970-09-15 12:50:22 true 0.1 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
12 1971-01-22 03:15:33 false 0.4 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
NULL 1970-01-01 08:00:00 NULL 0.2 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0
3 1970-01-01 08:00:00 false 0.3 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
34 1970-09-15 12:50:22 NULL 0.4 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0
3 1970-09-15 12:50:22 true 0.2 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 1970-09-15 12:50:22 false 0.3 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
3 1970-01-01 08:00:00 true 0.2 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
3 1971-01-22 03:15:33 false 0.1 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
NULL 1970-01-01 08:00:00 false 0.3 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0

KV

The same data in KV format:

f0 f2 f4 kv
12 1970-09-15 12:50:22 true 0:0.1,1:1,6:1,9:1
12 1971-01-22 03:15:33 false 0:0.4,1:1,7:1,8:1
NULL 1970-01-01 08:00:00 NULL 0:0.2,4:1,5:1,10:1
3 1970-01-01 08:00:00 false 0:0.3,2:1,5:1,8:1
34 1970-09-15 12:50:22 NULL 0:0.4,3:1,6:1,10:1
3 1970-09-15 12:50:22 true 0:0.2,2:1,6:1,9:1
3 1970-09-15 12:50:22 false 0:0.3,2:1,6:1,8:1
3 1970-01-01 08:00:00 true 0:0.2,2:1,5:1,9:1
3 1971-01-22 03:15:33 false 0:0.1,2:1,7:1,8:1
NULL 1970-01-01 08:00:00 false 0:0.3,4:1,5:1,8:1

Scalability test

Test data: 200 million samples, 100,000 distinct enumeration values.

Core count Training time Prediction time Acceleration ratio
5 84s 181s 1/1
10 60s 93s 1.4/1.95
20 46s 56s 1.8/3.23

Usage notes

  • A single encoding column can have up to tens of millions of distinct values.

  • The KV output uses zero-based indexing.

  • When predicting with a trained model, dropLast, ignoreNull, and reserveCols are fixed to the values from training and cannot be overridden. Retrain if you need different values.

  • When encoding new data using the model, if the data contains discrete values not present in the mapping table of the model, these values are ignored and not encoded. To encode these values, the model mapping table must be retrained.

Console workflow examples:

  • Directly use the component for encoding. 流程

  • Train a model with the component, then use the model to encode new data. 实验