Linear regression-Platform For AI(PAI)-阿里云帮助中心

The Linear Regression component trains a model that predicts a continuous numeric output from one or more input features. Use it when your target variable (such as sales revenue, temperature, or price) has a roughly linear relationship with the input features. For sparse, high-dimensional data—where most feature values are zero—enable sparse format input to represent data efficiently in KV format.

Configure the component

Method 1: Use the UI

Add the Linear Regression component to the workflow canvas in Designer, then configure its parameters in the right pane.

Fields setting

Parameter	Description	Default
Select feature columns	Feature columns from the input data source to use for training.	—
Select label column	The dependent variable. DOUBLE and BIGINT types are supported.	—
Is sparse format	Represent input data in KV format. Enable this for sparse, high-dimensional data where most feature values are zero.	Off
Separator between key-value pairs	Separator between key-value pairs when sparse format is enabled.	Comma (,)
Separator between key and value	Separator between a key and its value when sparse format is enabled.	Colon (:)

Parameters setting

Parameter	Description	Default
Maximum iterations	Maximum number of iterations for the algorithm.	100
Minimum likelihood error	Log-likelihood convergence threshold. The algorithm stops when the difference in log-likelihood between consecutive iterations falls below this value.	0.000001
Regularization type	Regularization method to reduce overfitting. Options: L1, L2, None. Use L1 to produce sparse coefficients (useful when many features are irrelevant); use L2 to penalize large coefficients (useful when features are correlated). Set to None to disable regularization.	None
Regularization coefficient	Strength of the regularization penalty. Higher values apply stronger regularization. Ignored when Regularization type is set to None.	1
Generate model evaluation table	Generates a model evaluation table containing: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance.	Off
Regression coefficient evaluation	Adds per-coefficient statistics—T-value, P-value, and confidence interval [2.5%, 97.5%]—to the evaluation table. Available only when Generate model evaluation table is selected.	Off

Execution tuning

Parameter	Description	Default
Number of computing cores	Number of cores to use.	System allocated
Memory size per core	Memory per core.	System allocated

Method 2: Use PAI commands

Use PAI commands to configure the Linear Regression component. Run them through the SQL Script component. For more information, see SQL Script.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DfeatureColNames=x
    -DlabelColName=y
    -DmodelName=lm_test_input_model_out;

Parameter	Required	Default	Description
inputTableName	Yes	—	Name of the input table.
modelName	Yes	—	Name of the output model.
outputTableName	No	—	Name of the output model evaluation table. Required when `enableFitGoodness` is `true`.
labelColName	Yes	—	Dependent variable column. DOUBLE and BIGINT types are supported. Only one column can be specified.
featureColNames	Yes	—	Independent variable columns. For dense input, DOUBLE and BIGINT types are supported. For sparse input, the STRING type is supported.
inputTablePartitions	No	—	Partitions of the input table to read.
enableSparse	No	false	Whether the input data is in sparse format. Valid values: `true`, `false`.
itemDelimiter	No	`,`	Separator between key-value pairs. Used when `enableSparse` is `true`.
kvDelimiter	No	`:`	Separator between a key and its value. Used when `enableSparse` is `true`.
maxIter	No	100	Maximum number of iterations.
epsilon	No	0.000001	Convergence threshold. The algorithm stops when the log-likelihood difference between consecutive iterations is less than this value.
regularizedType	No	None	Regularization method. Valid values: `l1`, `l2`, `None`. Use `l1` for sparse feature selection; use `l2` when features are correlated and you want to penalize large coefficients.
regularizedLevel	No	1	Regularization coefficient. Not used when `regularizedType` is `None`.
enableFitGoodness	No	false	Whether to generate a model evaluation table. When `true`, the output table includes: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance. Valid values: `true`, `false`.
enableCoefficientEstimate	No	false	Whether to include per-coefficient statistics in the evaluation table. When `true`, adds T-value, P-value, and confidence interval [2.5%, 97.5%] for each coefficient. Requires `enableFitGoodness` to be `true`. Valid values: `true`, `false`.
lifecycle	No	-1	Lifecycle of the output model evaluation table.
coreNum	No	System allocated	Number of computing cores.
memSizePerCore	No	System allocated	Memory per core.

Example

This example walks through the full workflow: create training data, train a linear regression model, run predictions, and inspect both output tables.

Step 1: Create the training data

DROP TABLE IF EXISTS lm_test_input;
CREATE TABLE lm_test_input AS
SELECT * FROM
(
  SELECT 10 AS y, 1.84 AS x1, 1 AS x2, '0:1.84 1:1' AS sparsecol1
    UNION ALL
  SELECT 20 AS y, 2.13 AS x1, 0 AS x2, '0:2.13' AS sparsecol1
    UNION ALL
  SELECT 30 AS y, 3.89 AS x1, 0 AS x2, '0:3.89' AS sparsecol1
    UNION ALL
  SELECT 40 AS y, 4.19 AS x1, 0 AS x2, '0:4.19' AS sparsecol1
    UNION ALL
  SELECT 50 AS y, 5.76 AS x1, 0 AS x2, '0:5.76' AS sparsecol1
    UNION ALL
  SELECT 60 AS y, 6.68 AS x1, 2 AS x2, '0:6.68 1:2' AS sparsecol1
    UNION ALL
  SELECT 70 AS y, 7.58 AS x1, 0 AS x2, '0:7.58' AS sparsecol1
    UNION ALL
  SELECT 80 AS y, 8.01 AS x1, 0 AS x2, '0:8.01' AS sparsecol1
    UNION ALL
  SELECT 90 AS y, 9.02 AS x1, 3 AS x2, '0:9.02 1:3' AS sparsecol1
    UNION ALL
  SELECT 100 AS y, 10.56 AS x1, 0 AS x2, '0:10.56' AS sparsecol1
) tmp;

Step 2: Train the model

The following command trains the model with model evaluation and coefficient estimation enabled. Output is written to lm_test_input_conf_out.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DlabelColName=y
    -DfeatureColNames=x1,x2
    -DmodelName=lm_test_input_model_out
    -DoutputTableName=lm_test_input_conf_out
    -DenableCoefficientEstimate=true
    -DenableFitGoodness=true
    -Dlifecycle=1;

Step 3: Run predictions

Pass the trained model and the original input table to the prediction component. The output includes both actual and predicted values.

pai -name prediction
    -project algo_public
    -DmodelName=lm_test_input_model_out
    -DinputTableName=lm_test_input
    -DoutputTableName=lm_test_input_predict_out
    -DappendColNames=y;

Step 4: View the model evaluation table

Query lm_test_input_conf_out to check goodness-of-fit metrics and regression coefficients.

+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| colname              | value               | tscore              | pvalue | confidenceinterval                         | p           |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| Intercept            | -6.42378496687763   | -2.2725755951390028  | 0.06   | {"2.5%": -11.964027, "97.5%": -0.883543}  | coefficient |
| x1                   | 10.260063429838898  | 23.270944360826963  | 0.0    | {"2.5%": 9.395908, "97.5%": 11.124219}    | coefficient |
| x2                   | 0.35374498323846265 | 0.2949247320997519  | 0.81   | {"2.5%": -1.997160, "97.5%": 2.704650}    | coefficient |
| rsquared             | 0.9879675667384592  | NULL                | NULL   | NULL                                       | goodness    |
| adjusted_rsquared    | 0.9845297286637332  | NULL                | NULL   | NULL                                       | goodness    |
| aic                  | 59.331109494251805  | NULL                | NULL   | NULL                                       | goodness    |
| degree_of_freedom    | 7.0                 | NULL                | NULL   | NULL                                       | goodness    |
| standardErr_residual | 3.765777749448906   | NULL                | NULL   | NULL                                       | goodness    |
| deviance             | 99.26757440771128   | NULL                | NULL   | NULL                                       | goodness    |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+

The R-Squared value of 0.988 indicates an excellent fit. The P-value for x2 (0.81) suggests it has low statistical significance in this dataset.

Step 5: View the prediction results

+-----+-------------------+---------------------+---------------------------+
| y   | prediction_result | prediction_score    | prediction_detail         |
+-----+-------------------+---------------------+---------------------------+
| 10  | NULL              | 12.808476727264404  | {"y": 12.8084767272644}   |
| 20  | NULL              | 15.43015013867922   | {"y": 15.43015013867922}  |
| 30  | NULL              | 33.48786177519568   | {"y": 33.48786177519568}  |
| 40  | NULL              | 36.565880804147355  | {"y": 36.56588080414735}  |
| 50  | NULL              | 52.674180388994415  | {"y": 52.67418038899442}  |
| 60  | NULL              | 62.82092871092313   | {"y": 62.82092871092313}  |
| 70  | NULL              | 71.34749583130122   | {"y": 71.34749583130122}  |
| 80  | NULL              | 75.75932310613193   | {"y": 75.75932310613193}  |
| 90  | NULL              | 87.1832221199846    | {"y": 87.18322211998461}  |
| 100 | NULL              | 101.92248485222113  | {"y": 101.9224848522211}  |
+-----+-------------------+---------------------+---------------------------+

The prediction_score column contains the model's predicted value for each row. Connect these results to downstream components for further analysis or reporting.

What's next

To improve model accuracy, set regularizedType to l1 or l2 and tune regularizedLevel to reduce overfitting. Start with l2 if your features are correlated; use l1 if you want the model to select a sparse subset of features.
To run predictions on new data, pass lm_test_input_model_out to the prediction component with a different input table.
To compare model versions, enable enableFitGoodness and compare R-Squared and AIC values across runs.