PS linear regression

更新时间:
复制 MD 格式

PS Linear Regression combines a linear regression model with a parameter server (PS) architecture to handle large-scale training tasks — up to hundreds of billions of samples and billions of features. The PS architecture distributes both computation and model parameter storage across workers, improving training efficiency and scalability.

Use PS Linear Regression when your dataset is too large for single-machine training or when your feature space is extremely high-dimensional, including sparse formats.

How it works

  1. Load training data from a MaxCompute input table. Dense format supports DOUBLE and BIGINT columns; sparse format uses STRING columns in key-value (KV) format.

  2. Submit a training task. The PS architecture shards model parameters across server nodes and runs gradient updates in parallel across worker nodes.

  3. After training completes, the model is saved as an offline model (default) or written to a MaxCompute table.

  4. Run prediction using the saved model to generate output scores written to a separate output table.

Configure the component

Method 1: Visual interface

Add the PS Linear Regression component to your workflow in Designer, then configure its parameters in the right pane.

Parameter type Parameter Required Description
Fields setting Feature Columns Yes The feature columns from the input data source to use for training.
Label Column Yes The label column. Supports DOUBLE and BIGINT data types.
Is it a sparse format? No Select true if input data is in key-value (KV) format. Default: false.
KV pair separator No The separator between key-value pairs. Default: space. Takes effect only when sparse format is enabled.
Key-value separator No The separator between a key and its value. Default: colon (:). Takes effect only when sparse format is enabled.
Parameters setting L1 weight No The L1 regularization coefficient. A larger value produces a sparser model with fewer non-zero elements. Valid values: non-negative floating-point numbers. Default: 1.0.
L2 weight No The L2 regularization coefficient. A larger value indicates that the absolute values of the model parameters are smaller. Valid values: non-negative floating-point numbers. Default: 0.
Maximum iterations No The maximum number of training iterations. Set to 0 for no limit. Valid values: non-negative integers. Default: 100.
Minimum convergence deviance No The stopping threshold for the optimization algorithm. Training stops when the improvement falls below this value. Valid values: [0, 1]. Default: 0.000001.
Maximum feature ID No The maximum feature ID or feature dimension. Can be set higher than the actual value. If left blank, the system runs an SQL task to calculate it automatically. Valid values: non-negative integers.
Execution tuning Number of cores No The number of CPU cores to allocate. Default: system-allocated.
Memory size per core No The memory allocated per core, in MB. Default: system-allocated.
L1 regularization drives sparsity — useful when many features are irrelevant and you want a sparse model. L2 regularization shrinks all coefficients proportionally — useful when features are correlated. If you are unsure which to use, start with L2 (set L1 weight to 0) and tune from there.

Method 2: PAI command

Run PAI commands through the SQL Script component.

Training command:

PAI -name ps_linearregression
    -project algo_public
    -DinputTableName="lm_test_input"
    -DmodelName="linear_regression_model"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -Dl1Weight=1.0
    -Dl2Weight=0.0
    -DmaxIter=100
    -Depsilon=1e-6
    -DenableSparse=true

Prediction command:

drop table if exists logistic_regression_predict;
PAI -name prediction
    -DmodelName="linear_regression_model"
    -DoutputTableName="linear_regression_predict"
    -DinputTableName="lm_test_input"
    -DappendColNames="label,features"
    -DfeatureColNames="features"
    -DenableSparse=true

Parameters:

Parameter Required Default value Description
inputTableName Yes The name of the input table.
modelName Yes The name of the output model.
labelColName Yes The label column in the input table. Supports DOUBLE and BIGINT data types.
featureColNames Yes The feature columns in the input table. Dense format: DOUBLE and BIGINT. Sparse format: STRING.
outputTableName No The output table for model evaluation results. Required when enableFitGoodness is true.
inputTablePartitions No The partitions of the input table to use for training.
enableSparse No false Whether input data is in sparse (KV) format. Valid values: true, false.
itemDelimiter No Space The separator between key-value pairs. Takes effect only when enableSparse is true.
kvDelimiter No : The separator between a key and its value. Takes effect only when enableSparse is true.
enableModelIo No true Whether to save the model as an offline model. If false, the model is written to a MaxCompute table. Valid values: true, false.
maxIter No 100 The maximum number of training iterations. Valid values: non-negative integers.
epsilon No 0.000001 The convergence threshold for the optimization algorithm. Training stops when improvement falls below this value. Valid values: [0, 1].
l1Weight No 1.0 The L1 regularization coefficient. A larger value produces a sparser model. Valid values: non-negative floating-point numbers.
l2Weight No 0 The L2 regularization coefficient. A larger value indicates that the absolute values of the model parameters are smaller. Valid values: non-negative floating-point numbers.
modelSize No 0 The maximum feature ID or feature dimension. Can be greater than the actual value. If set to 0, the system calculates it automatically. Valid values: non-negative integers.
coreNum No System-allocated The number of CPU cores per worker.
memSizePerCore No System-allocated The memory per core, in MB.

Example

This example trains a PS Linear Regression model on sparse KV-format data and runs prediction.

Step 1: Create input data

Run the following SQL statement in the SQL Script component to generate the input table.

drop table if exists lm_test_input;
create table lm_test_input as
select
*
from
(
select cast(2 as BIGINT) as label, '1:0.55 2:-0.15 3:0.82 4:-0.99 5:0.17' as features
    union all
select cast(1 as BIGINT) as label, '1:-1.26 2:1.36 3:-0.13 4:-2.82 5:-0.41' as features
    union all
select cast(1 as BIGINT) as label, '1:-0.77 2:0.91 3:-0.23 4:-4.46 5:0.91' as features
    union all
select cast(2 as BIGINT) as label, '1:0.86 2:-0.22 3:-0.46 4:0.08 5:-0.60' as features
    union all
select cast(1 as BIGINT) as label, '1:-0.76 2:0.89 3:1.02 4:-0.78 5:-0.86' as features
    union all
select cast(1 as BIGINT) as label, '1:2.22 2:-0.46 3:0.49 4:0.31 5:-1.84' as features
    union all
select cast(0 as BIGINT) as label, '1:-1.21 2:0.09 3:0.23 4:2.04 5:0.30' as features
    union all
select cast(1 as BIGINT) as label, '1:2.17 2:-0.45 3:-1.22 4:-0.48 5:-1.41' as features
    union all
select cast(0 as BIGINT) as label, '1:-0.40 2:0.63 3:0.56 4:0.74 5:-1.44' as features
    union all
select cast(1 as BIGINT) as label, '1:0.17 2:0.49 3:-1.50 4:-2.20 5:-0.35' as features
) tmp;

The table has two columns: label (BIGINT) and features (STRING in KV format). Each features value encodes feature IDs and values separated by colons, with pairs separated by spaces — for example, 1:0.55 2:-0.15.

Generated input data
In KV format, feature IDs must be positive integers and feature values must be real numbers. If feature IDs are strings, serialize the data first. If feature values are categorical strings, apply feature discretization before training.

Step 2: Build the workflow

Build a workflow in Designer using the components shown below. For details on workflow creation, see Algorithm modeling.

PS Linear Regression workflow

Step 3: Configure the components

Read Table-1:

On the Select Table tab, set Table Name to lm_test_input.

PS Linear Regression:

Use the settings below. Leave all other parameters at their defaults.

Parameter type Parameter Value
Fields setting Is it a sparse format? true
Feature Columns features
Label Column label
Execution tuning Number of cores 3
Memory size per core 1024 MB

Prediction:

Use the settings below. Leave all other parameters at their defaults.

Parameter type Parameter Value
Fields setting Feature Columns features
Verbatim Output Column label, features
Sparse Matrix Selected
Key-value separator : (colon)
KV pair separator (leave blank to use space)

Step 4: Run the workflow

Click the Run button image on the canvas to start the workflow.

Step 5: View prediction results

After the workflow completes, right-click the Prediction-1 component and choose View Data > Prediction Result Output.

PS Linear Regression prediction results

Run the same example with PAI commands

The following commands reproduce the workflow above using the PAI command interface. All commands use the same lm_test_input table.

Train:

PAI -name ps_linearregression
    -project algo_public
    -DinputTableName="lm_test_input"
    -DmodelName="linear_regression_model"
    -DlabelColName="label"
    -DfeatureColNames="features"
    -Dl1Weight=1.0
    -Dl2Weight=0.0
    -DmaxIter=100
    -Depsilon=1e-6
    -DenableSparse=true

Predict:

drop table if exists logistic_regression_predict;
PAI -name prediction
    -DmodelName="linear_regression_model"
    -DoutputTableName="linear_regression_predict"
    -DinputTableName="lm_test_input"
    -DappendColNames="label,features"
    -DfeatureColNames="features"
    -DenableSparse=true

The prediction output table (linear_regression_predict) contains the original label and features columns appended with the predicted scores.