Regression evaluation

更新时间:
复制 MD 格式

Regression Model Evaluation measures how well a regression model's predictions match actual outcomes. It computes a set of standard regression metrics—including mean squared error (MSE), mean absolute error (MAE), and R-squared (R²)—and generates a residual histogram to help you visualize prediction errors and identify areas for model improvement.

Prerequisites

Before you begin, ensure that you have:

  • A trained regression model with prediction output

  • An input table with two numeric columns: one for actual (observed) values and one for predicted values

Important

Both the actual-value column and the predicted-value column must use numeric data types. Non-numeric columns are not supported.

Configure the component

Method 1: Configure on the pipeline page

Add a Regression Model Evaluation component to your pipeline and configure the following parameters:

Category Parameter Description
Fields Setting Original Regression Value The actual observed values of the target variable. Used as the ground truth for evaluating prediction accuracy.
Predicted Regression Value The values predicted by your regression model based on the input features.
Tuning Worker number Number of workers for distributed computation. For sizing guidance, see Appendix: How to estimate resource usage.
Memory Size per Node Memory allocated to each worker node. See the same appendix for sizing guidance.

Method 2: Use PAI commands

Run the component by passing parameters to the regression_evaluation algorithm through a SQL Script component:

PAI -name regression_evaluation -project algo_public
    -DinputTableName=input_table
    -DyColName=y_col
    -DpredictionColName=prediction_col
    -DindexOutputTableName=index_output_table
    -DresidualOutputTableName=residual_output_table;
Parameter Required Default Description
inputTableName Yes Name of the input table.
inputTablePartitions No Full table Partitions to read from the input table. Omit to use the full table.
yColName Yes Column name containing the actual (observed) values. Must be numeric.
predictionColName Yes Column name containing the predicted values. Must be numeric.
indexOutputTableName Yes Name of the output table that stores regression metrics.
residualOutputTableName Yes Name of the output table that stores the residual histogram data.
intervalNum No 100 Number of intervals (bins) in the residual histogram.
lifecycle No Retention period for the output tables. Must be a positive integer.
coreNum No System default Number of CPU cores. Valid values: 1–9,999.
memSizePerCore No System default Memory per core, in MB. Valid values: 1,024–65,536.

Output

The regression metrics output table (indexOutputTableName) is generated in JSON format and contains the following fields.

Regression metrics table

Metric Description Interpretation
MSE Mean squared error Lower is better. Penalizes large errors more than MAE due to squaring.
RMSE Root mean square error Lower is better. Expressed in the same unit as your target variable, making it easier to interpret than MSE.
MAE Mean absolute error Lower is better. The average magnitude of prediction errors, less sensitive to outliers than MSE.
MAPE Mean absolute percentage error Lower is better. Expresses error as a percentage of actual values; useful when comparing models across different scales.
MAD Mean absolute deviation Lower is better.
R2 R-squared (coefficient of determination) Measures the proportion of variance in the actual values explained by the model.
R Coefficient of multiple correlations Measures the correlation between actual and predicted values.
SST Total sum of squares The total variance in the actual values.
SSE Sum of squared errors The variance left unexplained by the model.
SSR Sum of squares due to regression The variance explained by the model. SST = SSE + SSR.
count Row count Number of rows used in the evaluation.
yMean Mean of actual values The arithmetic mean of the actual (observed) target values.
predictionMean Mean of predicted values The arithmetic mean of the model's predicted values.

Residual histogram table

The residual histogram table (residualOutputTableName) stores the distribution of prediction errors (residuals = actual − predicted) across the number of intervals specified with intervalNum.

How to read the residual histogram:

  • Symmetric, bell-shaped distribution centered near zero: The model's errors are random and well-distributed—a sign of a well-fitted model.

  • Skewed distribution: The model systematically over-predicts or under-predicts for a subset of inputs. Investigate whether feature engineering or model selection could reduce the bias.

  • Wide spread: Large residuals indicate high variance. Consider regularization or collecting more training data.