Split

更新时间:
复制 MD 格式

The Split component in Machine Learning Designer of Platform for AI (PAI) divides a dataset into two output tables. Use it to create training and test sets before model training.

Choose your splitting method based on your data:

MethodWhen to use
Split by ratioRandom, proportion-based splits — suitable for most datasets
Split by thresholdOrdered splits based on a numeric column value, such as time-series data where all rows before a cutoff date must go to one table
If both ratio and threshold parameters are configured, the threshold-based method takes precedence.

Configure the component

Method 1: Use the PAI console

On the pipeline details page, find the Split component in the left-side component list, drag it to the canvas, and connect it to an upstream node. Click the component to open its configuration panel.

Split by ratio

Configure the following parameters under the Parameters Setting tab:

ParameterDescriptionRequired
Splitting MethodSelect Split by Ratio.Yes
Splitting FractionProportion of data in Output Table 1 relative to the full dataset. Valid values: (0, 1). For example, set 0.8 to put 80% of rows in Output Table 1 and 20% in Output Table 2.Yes
Random SeedFixes the random generator state so that re-running the pipeline with the same seed produces the same split. If left blank, the system generates a seed automatically.No
ID Column (available under Advanced Options)Keeps all rows sharing the same ID value together in the same output table. Use this when your dataset has multiple rows per entity — for example, multiple transactions per customer — to prevent the same entity from appearing in both tables (data leakage). Only one column is selectable.No

Split by threshold

ParameterDescriptionRequired
Splitting MethodSelect Split by Threshold.Yes
Threshold ColumnThe numeric column used for splitting. STRING columns are not supported.Yes
ThresholdRows where the column value is less than the threshold go to Output Table 1. Rows where the value is greater than or equal to the threshold go to Output Table 2. For example, with a threshold of 100: a row with value 80 goes to Output Table 1, and a row with value 100 or 120 goes to Output Table 2.Yes

Tuning parameters

ParameterDescription
CoresNumber of cores for the job. Allocated automatically based on input data size.
Memory size per coreMemory per core in MB. Allocated automatically based on input data size.

Method 2: Run PAI commands

On the pipeline details page, drag the SQL Script component to the canvas and click it to open the configuration panel. Under Parameters Setting, clear the Whether the system adds a create table statement option, paste the following command into the SQL Script editor, and run it.

The following example splits the wbpc table by ratio, placing 25% of the data in wpbc_split1 and the remaining 75% in wpbc_split2:

PAI -name split -project algo_public
    -DinputTableName=wbpc
    -Doutput1TableName=wpbc_split1
    -Doutput2TableName=wpbc_split2
    -Dfraction=0.25;

For more information on the SQL Script component, see SQL Script.

Ratio parameters (fraction, randomSeed, idColName) and threshold parameters (thresholdColName, threshold) cannot be used in the same command.

Common parameters

ParameterRequiredDescriptionDefault
inputTableNameYesName of the input table.
inputTablePartitionsNoPartitions to read from the input table. Formats: partition_name=value for single-level partitions; name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas.All partitions
output1TableNameYesName of Output Table 1.
output1TablePartitionNoPartition name for Output Table 1.Non-partitioned
output2TableNameYesName of Output Table 2.
output2TablePartitionNoPartition name for Output Table 2.Non-partitioned
lifecycleNoLifecycle of the output tables, in days. Valid values: [1, 3650].
coreNumNoNumber of cores. Tuning parameter; auto-allocated based on input data size.Auto
memSizePerCoreNoMemory per core in MB. Tuning parameter; auto-allocated based on input data size. Valid values: (1, 65536).Auto

Split by ratio parameters

ParameterRequiredDescriptionDefault
fractionYesProportion of data in Output Table 1. Valid values: (0, 1).
randomSeedNoRandom seed; must be a positive integer.Auto
idColNameNoID column name. Keeps all rows with the same ID in the same output table, preventing data leakage between the training and test sets. Only one column is selectable.

Split by threshold parameters

ParameterRequiredDescriptionDefault
thresholdColNameYesThreshold column name. STRING columns are not supported.
thresholdYesSplit threshold. Rows with values less than the threshold go to Output Table 1; rows with values greater than or equal to the threshold go to Output Table 2.