Linear regression

更新时间:
复制 MD 格式

The Linear Regression component trains a model that predicts a continuous numeric output from one or more input features. Use it when your target variable (such as sales revenue, temperature, or price) has a roughly linear relationship with the input features. For sparse, high-dimensional data—where most feature values are zero—enable sparse format input to represent data efficiently in KV format.

Configure the component

Method 1: Use the UI

Add the Linear Regression component to the workflow canvas in Designer, then configure its parameters in the right pane.

Fields setting

ParameterDescriptionDefault
Select feature columnsFeature columns from the input data source to use for training.
Select label columnThe dependent variable. DOUBLE and BIGINT types are supported.
Is sparse formatRepresent input data in KV format. Enable this for sparse, high-dimensional data where most feature values are zero.Off
Separator between key-value pairsSeparator between key-value pairs when sparse format is enabled.Comma (,)
Separator between key and valueSeparator between a key and its value when sparse format is enabled.Colon (:)

Parameters setting

ParameterDescriptionDefault
Maximum iterationsMaximum number of iterations for the algorithm.100
Minimum likelihood errorLog-likelihood convergence threshold. The algorithm stops when the difference in log-likelihood between consecutive iterations falls below this value.0.000001
Regularization typeRegularization method to reduce overfitting. Options: L1, L2, None. Use L1 to produce sparse coefficients (useful when many features are irrelevant); use L2 to penalize large coefficients (useful when features are correlated). Set to None to disable regularization.None
Regularization coefficientStrength of the regularization penalty. Higher values apply stronger regularization. Ignored when Regularization type is set to None.1
Generate model evaluation tableGenerates a model evaluation table containing: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance.Off
Regression coefficient evaluationAdds per-coefficient statistics—T-value, P-value, and confidence interval [2.5%, 97.5%]—to the evaluation table. Available only when Generate model evaluation table is selected.Off

Execution tuning

ParameterDescriptionDefault
Number of computing coresNumber of cores to use.System allocated
Memory size per coreMemory per core.System allocated

Method 2: Use PAI commands

Use PAI commands to configure the Linear Regression component. Run them through the SQL Script component. For more information, see SQL Script.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DfeatureColNames=x
    -DlabelColName=y
    -DmodelName=lm_test_input_model_out;
ParameterRequiredDefaultDescription
inputTableNameYesName of the input table.
modelNameYesName of the output model.
outputTableNameNoName of the output model evaluation table. Required when enableFitGoodness is true.
labelColNameYesDependent variable column. DOUBLE and BIGINT types are supported. Only one column can be specified.
featureColNamesYesIndependent variable columns. For dense input, DOUBLE and BIGINT types are supported. For sparse input, the STRING type is supported.
inputTablePartitionsNoPartitions of the input table to read.
enableSparseNofalseWhether the input data is in sparse format. Valid values: true, false.
itemDelimiterNo,Separator between key-value pairs. Used when enableSparse is true.
kvDelimiterNo:Separator between a key and its value. Used when enableSparse is true.
maxIterNo100Maximum number of iterations.
epsilonNo0.000001Convergence threshold. The algorithm stops when the log-likelihood difference between consecutive iterations is less than this value.
regularizedTypeNoNoneRegularization method. Valid values: l1, l2, None. Use l1 for sparse feature selection; use l2 when features are correlated and you want to penalize large coefficients.
regularizedLevelNo1Regularization coefficient. Not used when regularizedType is None.
enableFitGoodnessNofalseWhether to generate a model evaluation table. When true, the output table includes: R-Squared, Adjusted R-Squared, AIC, degrees of freedom, standard deviation of residuals, and deviance. Valid values: true, false.
enableCoefficientEstimateNofalseWhether to include per-coefficient statistics in the evaluation table. When true, adds T-value, P-value, and confidence interval [2.5%, 97.5%] for each coefficient. Requires enableFitGoodness to be true. Valid values: true, false.
lifecycleNo-1Lifecycle of the output model evaluation table.
coreNumNoSystem allocatedNumber of computing cores.
memSizePerCoreNoSystem allocatedMemory per core.

Example

This example walks through the full workflow: create training data, train a linear regression model, run predictions, and inspect both output tables.

Step 1: Create the training data

DROP TABLE IF EXISTS lm_test_input;
CREATE TABLE lm_test_input AS
SELECT * FROM
(
  SELECT 10 AS y, 1.84 AS x1, 1 AS x2, '0:1.84 1:1' AS sparsecol1
    UNION ALL
  SELECT 20 AS y, 2.13 AS x1, 0 AS x2, '0:2.13' AS sparsecol1
    UNION ALL
  SELECT 30 AS y, 3.89 AS x1, 0 AS x2, '0:3.89' AS sparsecol1
    UNION ALL
  SELECT 40 AS y, 4.19 AS x1, 0 AS x2, '0:4.19' AS sparsecol1
    UNION ALL
  SELECT 50 AS y, 5.76 AS x1, 0 AS x2, '0:5.76' AS sparsecol1
    UNION ALL
  SELECT 60 AS y, 6.68 AS x1, 2 AS x2, '0:6.68 1:2' AS sparsecol1
    UNION ALL
  SELECT 70 AS y, 7.58 AS x1, 0 AS x2, '0:7.58' AS sparsecol1
    UNION ALL
  SELECT 80 AS y, 8.01 AS x1, 0 AS x2, '0:8.01' AS sparsecol1
    UNION ALL
  SELECT 90 AS y, 9.02 AS x1, 3 AS x2, '0:9.02 1:3' AS sparsecol1
    UNION ALL
  SELECT 100 AS y, 10.56 AS x1, 0 AS x2, '0:10.56' AS sparsecol1
) tmp;

Step 2: Train the model

The following command trains the model with model evaluation and coefficient estimation enabled. Output is written to lm_test_input_conf_out.

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DlabelColName=y
    -DfeatureColNames=x1,x2
    -DmodelName=lm_test_input_model_out
    -DoutputTableName=lm_test_input_conf_out
    -DenableCoefficientEstimate=true
    -DenableFitGoodness=true
    -Dlifecycle=1;

Step 3: Run predictions

Pass the trained model and the original input table to the prediction component. The output includes both actual and predicted values.

pai -name prediction
    -project algo_public
    -DmodelName=lm_test_input_model_out
    -DinputTableName=lm_test_input
    -DoutputTableName=lm_test_input_predict_out
    -DappendColNames=y;

Step 4: View the model evaluation table

Query lm_test_input_conf_out to check goodness-of-fit metrics and regression coefficients.

+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| colname              | value               | tscore              | pvalue | confidenceinterval                         | p           |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+
| Intercept            | -6.42378496687763   | -2.2725755951390028  | 0.06   | {"2.5%": -11.964027, "97.5%": -0.883543}  | coefficient |
| x1                   | 10.260063429838898  | 23.270944360826963  | 0.0    | {"2.5%": 9.395908, "97.5%": 11.124219}    | coefficient |
| x2                   | 0.35374498323846265 | 0.2949247320997519  | 0.81   | {"2.5%": -1.997160, "97.5%": 2.704650}    | coefficient |
| rsquared             | 0.9879675667384592  | NULL                | NULL   | NULL                                       | goodness    |
| adjusted_rsquared    | 0.9845297286637332  | NULL                | NULL   | NULL                                       | goodness    |
| aic                  | 59.331109494251805  | NULL                | NULL   | NULL                                       | goodness    |
| degree_of_freedom    | 7.0                 | NULL                | NULL   | NULL                                       | goodness    |
| standardErr_residual | 3.765777749448906   | NULL                | NULL   | NULL                                       | goodness    |
| deviance             | 99.26757440771128   | NULL                | NULL   | NULL                                       | goodness    |
+----------------------+---------------------+---------------------+--------+--------------------------------------------+-------------+

The R-Squared value of 0.988 indicates an excellent fit. The P-value for x2 (0.81) suggests it has low statistical significance in this dataset.

Step 5: View the prediction results

+-----+-------------------+---------------------+---------------------------+
| y   | prediction_result | prediction_score    | prediction_detail         |
+-----+-------------------+---------------------+---------------------------+
| 10  | NULL              | 12.808476727264404  | {"y": 12.8084767272644}   |
| 20  | NULL              | 15.43015013867922   | {"y": 15.43015013867922}  |
| 30  | NULL              | 33.48786177519568   | {"y": 33.48786177519568}  |
| 40  | NULL              | 36.565880804147355  | {"y": 36.56588080414735}  |
| 50  | NULL              | 52.674180388994415  | {"y": 52.67418038899442}  |
| 60  | NULL              | 62.82092871092313   | {"y": 62.82092871092313}  |
| 70  | NULL              | 71.34749583130122   | {"y": 71.34749583130122}  |
| 80  | NULL              | 75.75932310613193   | {"y": 75.75932310613193}  |
| 90  | NULL              | 87.1832221199846    | {"y": 87.18322211998461}  |
| 100 | NULL              | 101.92248485222113  | {"y": 101.9224848522211}  |
+-----+-------------------+---------------------+---------------------------+

The prediction_score column contains the model's predicted value for each row. Connect these results to downstream components for further analysis or reporting.

What's next

  • To improve model accuracy, set regularizedType to l1 or l2 and tune regularizedLevel to reduce overfitting. Start with l2 if your features are correlated; use l1 if you want the model to select a sparse subset of features.

  • To run predictions on new data, pass lm_test_input_model_out to the prediction component with a different input table.

  • To compare model versions, enable enableFitGoodness and compare R-Squared and AIC values across runs.