The Linear Regression component trains a model that predicts a continuous numeric output from one or more input features. Use it when your target variable (such as sales revenue, temperature, or price) has a roughly linear relationship with the input features. For sparse, high-dimensional data—where most feature values are zero—enable sparse format input to represent data efficiently in KV format.
Configure the component
Method 1: Use the UI
Add the Linear Regression component to the workflow canvas in Designer, then configure its parameters in the right pane.
Fields setting
| Parameter | Description | Default |
|---|---|---|
| Select feature columns | Feature columns from the input data source to use for training. | — |
| Select label column | The dependent variable. DOUBLE and BIGINT types are supported. | — |
| Is sparse format | Represent input data in KV format. Enable this for sparse, high-dimensional data where most feature values are zero. | Off |
| Separator between key-value pairs | Separator between key-value pairs when sparse format is enabled. | Comma (,) |
| Separator between key and value | Separator between a key and its value when sparse format is enabled. | Colon (:) |
Parameters setting
| Parameter | Description | Default |
|---|---|---|
| Maximum iterations | Maximum number of iterations for the algorithm. | 100 |
| Minimum likelihood error | Log-likelihood convergence threshold. The algorithm stops when the difference in log-likelihood between consecutive iterations falls below this value. | 0.000001 |
| Regularization type | Regularization method to reduce overfitting. Options: L1, L2, None. Use L1 to produce sparse coefficients (useful when many features are irrelevant); use L2 to penalize large coefficients (useful when features are correlated). Set to None to disable regularization. | None |
| Regularization coefficient | Strength of the regularization penalty. Higher values apply stronger regularization. Ignored when Regularization type is set to None. | 1 |
| Generate model evaluation table | Generates a model evaluation table containing: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance. | Off |
| Regression coefficient evaluation | Adds per-coefficient statistics—T-value, P-value, and confidence interval [2.5%, 97.5%]—to the evaluation table. Available only when Generate model evaluation table is selected. | Off |
Execution tuning
| Parameter | Description | Default |
|---|---|---|
| Number of computing cores | Number of cores to use. | System allocated |
| Memory size per core | Memory per core. | System allocated |
Method 2: Use PAI commands
Use PAI commands to configure the Linear Regression component. Run them through the SQL Script component. For more information, see SQL Script.
PAI -name linearregression
-project algo_public
-DinputTableName=lm_test_input
-DfeatureColNames=x
-DlabelColName=y
-DmodelName=lm_test_input_model_out;| Parameter | Required | Default | Description |
|---|---|---|---|
| inputTableName | Yes | — | Name of the input table. |
| modelName | Yes | — | Name of the output model. |
| outputTableName | No | — | Name of the output model evaluation table. Required when enableFitGoodness is true. |
| labelColName | Yes | — | Dependent variable column. DOUBLE and BIGINT types are supported. Only one column can be specified. |
| featureColNames | Yes | — | Independent variable columns. For dense input, DOUBLE and BIGINT types are supported. For sparse input, the STRING type is supported. |
| inputTablePartitions | No | — | Partitions of the input table to read. |
| enableSparse | No | false | Whether the input data is in sparse format. Valid values: true, false. |
| itemDelimiter | No | , | Separator between key-value pairs. Used when enableSparse is true. |
| kvDelimiter | No | : | Separator between a key and its value. Used when enableSparse is true. |
| maxIter | No | 100 | Maximum number of iterations. |
| epsilon | No | 0.000001 | Convergence threshold. The algorithm stops when the log-likelihood difference between consecutive iterations is less than this value. |
| regularizedType | No | None | Regularization method. Valid values: l1, l2, None. Use l1 for sparse feature selection; use l2 when features are correlated and you want to penalize large coefficients. |
| regularizedLevel | No | 1 | Regularization coefficient. Not used when regularizedType is None. |
| enableFitGoodness | No | false | Whether to generate a model evaluation table. When true, the output table includes: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance. Valid values: true, false. |
| enableCoefficientEstimate | No | false | Whether to include per-coefficient statistics in the evaluation table. When true, adds T-value, P-value, and confidence interval [2.5%, 97.5%] for each coefficient. Requires enableFitGoodness to be true. Valid values: true, false. |
| lifecycle | No | -1 | Lifecycle of the output model evaluation table. |
| coreNum | No | System allocated | Number of computing cores. |
| memSizePerCore | No | System allocated | Memory per core. |
Example
This example walks through the full workflow: create training data, train a linear regression model, run predictions, and inspect both output tables.
Step 1: Create the training data
DROP TABLE IF EXISTS lm_test_input;
CREATE TABLE lm_test_input AS
SELECT * FROM
(
SELECT 10 AS y, 1.84 AS x1, 1 AS x2, '0:1.84 1:1' AS sparsecol1
UNION ALL
SELECT 20 AS y, 2.13 AS x1, 0 AS x2, '0:2.13' AS sparsecol1
UNION ALL
SELECT 30 AS y, 3.89 AS x1, 0 AS x2, '0:3.89' AS sparsecol1
UNION ALL
SELECT 40 AS y, 4.19 AS x1, 0 AS x2, '0:4.19' AS sparsecol1
UNION ALL
SELECT 50 AS y, 5.76 AS x1, 0 AS x2, '0:5.76' AS sparsecol1
UNION ALL
SELECT 60 AS y, 6.68 AS x1, 2 AS x2, '0:6.68 1:2' AS sparsecol1
UNION ALL
SELECT 70 AS y, 7.58 AS x1, 0 AS x2, '0:7.58' AS sparsecol1
UNION ALL
SELECT 80 AS y, 8.01 AS x1, 0 AS x2, '0:8.01' AS sparsecol1
UNION ALL
SELECT 90 AS y, 9.02 AS x1, 3 AS x2, '0:9.02 1:3' AS sparsecol1
UNION ALL
SELECT 100 AS y, 10.56 AS x1, 0 AS x2, '0:10.56' AS sparsecol1
) tmp;Step 2: Train the model
The following command trains the model with model evaluation and coefficient estimation enabled. Output is written to lm_test_input_conf_out.
PAI -name linearregression
-project algo_public
-DinputTableName=lm_test_input
-DlabelColName=y
-DfeatureColNames=x1,x2
-DmodelName=lm_test_input_model_out
-DoutputTableName=lm_test_input_conf_out
-DenableCoefficientEstimate=true
-DenableFitGoodness=true
-Dlifecycle=1;Step 3: Run predictions
Pass the trained model and the original input table to the prediction component. The output includes both actual and predicted values.
pai -name prediction
-project algo_public
-DmodelName=lm_test_input_model_out
-DinputTableName=lm_test_input
-DoutputTableName=lm_test_input_predict_out
-DappendColNames=y;Step 4: View the model evaluation table
Query lm_test_input_conf_out to check goodness-of-fit metrics and regression coefficients.
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| colname | value | tscore | pvalue | confidenceinterval | p |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| Intercept | -6.42378496687763 | -2.2725755951390028 | 0.06 | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient |
| x1 | 10.260063429838898 | 23.270944360826963 | 0.0 | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient |
| x2 | 0.35374498323846265 | 0.2949247320997519 | 0.81 | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient |
| rsquared | 0.9879675667384592 | NULL | NULL | NULL | goodness |
| adjusted_rsquared | 0.9845297286637332 | NULL | NULL | NULL | goodness |
| aic | 59.331109494251805 | NULL | NULL | NULL | goodness |
| degree_of_freedom | 7.0 | NULL | NULL | NULL | goodness |
| standardErr_residual | 3.765777749448906 | NULL | NULL | NULL | goodness |
| deviance | 99.26757440771128 | NULL | NULL | NULL | goodness |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+The R-Squared value of 0.988 indicates an excellent fit. The P-value for x2 (0.81) suggests it has low statistical significance in this dataset.
Step 5: View the prediction results
+-----+-------------------+---------------------+---------------------------+
| y | prediction_result | prediction_score | prediction_detail |
+-----+-------------------+---------------------+---------------------------+
| 10 | NULL | 12.808476727264404 | {"y": 12.8084767272644} |
| 20 | NULL | 15.43015013867922 | {"y": 15.43015013867922} |
| 30 | NULL | 33.48786177519568 | {"y": 33.48786177519568} |
| 40 | NULL | 36.565880804147355 | {"y": 36.56588080414735} |
| 50 | NULL | 52.674180388994415 | {"y": 52.67418038899442} |
| 60 | NULL | 62.82092871092313 | {"y": 62.82092871092313} |
| 70 | NULL | 71.34749583130122 | {"y": 71.34749583130122} |
| 80 | NULL | 75.75932310613193 | {"y": 75.75932310613193} |
| 90 | NULL | 87.1832221199846 | {"y": 87.18322211998461} |
| 100 | NULL | 101.92248485222113 | {"y": 101.9224848522211} |
+-----+-------------------+---------------------+---------------------------+The prediction_score column contains the model's predicted value for each row. Connect these results to downstream components for further analysis or reporting.
What's next
To improve model accuracy, set
regularizedTypetol1orl2and tuneregularizedLevelto reduce overfitting. Start withl2if your features are correlated; usel1if you want the model to select a sparse subset of features.To run predictions on new data, pass
lm_test_input_model_outto thepredictioncomponent with a different input table.To compare model versions, enable
enableFitGoodnessand compare R-Squared and AIC values across runs.