The Split component in Machine Learning Designer of Platform for AI (PAI) divides a dataset into two output tables. Use it to create training and test sets before model training.
Choose your splitting method based on your data:
| Method | When to use |
|---|---|
| Split by ratio | Random, proportion-based splits — suitable for most datasets |
| Split by threshold | Ordered splits based on a numeric column value, such as time-series data where all rows before a cutoff date must go to one table |
If both ratio and threshold parameters are configured, the threshold-based method takes precedence.
Configure the component
Method 1: Use the PAI console
On the pipeline details page, find the Split component in the left-side component list, drag it to the canvas, and connect it to an upstream node. Click the component to open its configuration panel.
Split by ratio
Configure the following parameters under the Parameters Setting tab:
| Parameter | Description | Required |
|---|---|---|
| Splitting Method | Select Split by Ratio. | Yes |
| Splitting Fraction | Proportion of data in Output Table 1 relative to the full dataset. Valid values: (0, 1). For example, set 0.8 to put 80% of rows in Output Table 1 and 20% in Output Table 2. | Yes |
| Random Seed | Fixes the random generator state so that re-running the pipeline with the same seed produces the same split. If left blank, the system generates a seed automatically. | No |
| ID Column (available under Advanced Options) | Keeps all rows sharing the same ID value together in the same output table. Use this when your dataset has multiple rows per entity — for example, multiple transactions per customer — to prevent the same entity from appearing in both tables (data leakage). Only one column is selectable. | No |
Split by threshold
| Parameter | Description | Required |
|---|---|---|
| Splitting Method | Select Split by Threshold. | Yes |
| Threshold Column | The numeric column used for splitting. STRING columns are not supported. | Yes |
| Threshold | Rows where the column value is less than the threshold go to Output Table 1. Rows where the value is greater than or equal to the threshold go to Output Table 2. For example, with a threshold of 100: a row with value 80 goes to Output Table 1, and a row with value 100 or 120 goes to Output Table 2. | Yes |
Tuning parameters
| Parameter | Description |
|---|---|
| Cores | Number of cores for the job. Allocated automatically based on input data size. |
| Memory size per core | Memory per core in MB. Allocated automatically based on input data size. |
Method 2: Run PAI commands
On the pipeline details page, drag the SQL Script component to the canvas and click it to open the configuration panel. Under Parameters Setting, clear the Whether the system adds a create table statement option, paste the following command into the SQL Script editor, and run it.
The following example splits the wbpc table by ratio, placing 25% of the data in wpbc_split1 and the remaining 75% in wpbc_split2:
PAI -name split -project algo_public
-DinputTableName=wbpc
-Doutput1TableName=wpbc_split1
-Doutput2TableName=wpbc_split2
-Dfraction=0.25;For more information on the SQL Script component, see SQL Script.
Ratio parameters (fraction,randomSeed,idColName) and threshold parameters (thresholdColName,threshold) cannot be used in the same command.
Common parameters
| Parameter | Required | Description | Default |
|---|---|---|---|
inputTableName | Yes | Name of the input table. | — |
inputTablePartitions | No | Partitions to read from the input table. Formats: partition_name=value for single-level partitions; name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas. | All partitions |
output1TableName | Yes | Name of Output Table 1. | — |
output1TablePartition | No | Partition name for Output Table 1. | Non-partitioned |
output2TableName | Yes | Name of Output Table 2. | — |
output2TablePartition | No | Partition name for Output Table 2. | Non-partitioned |
lifecycle | No | Lifecycle of the output tables, in days. Valid values: [1, 3650]. | — |
coreNum | No | Number of cores. Tuning parameter; auto-allocated based on input data size. | Auto |
memSizePerCore | No | Memory per core in MB. Tuning parameter; auto-allocated based on input data size. Valid values: (1, 65536). | Auto |
Split by ratio parameters
| Parameter | Required | Description | Default |
|---|---|---|---|
fraction | Yes | Proportion of data in Output Table 1. Valid values: (0, 1). | — |
randomSeed | No | Random seed; must be a positive integer. | Auto |
idColName | No | ID column name. Keeps all rows with the same ID in the same output table, preventing data leakage between the training and test sets. Only one column is selectable. | — |
Split by threshold parameters
| Parameter | Required | Description | Default |
|---|---|---|---|
thresholdColName | Yes | Threshold column name. STRING columns are not supported. | — |
threshold | Yes | Split threshold. Rows with values less than the threshold go to Output Table 1; rows with values greater than or equal to the threshold go to Output Table 2. | — |