Logistic regression for binary classification

更新时间:
复制 MD 格式

Logistic Regression for Binary Classification trains a binary classifier using the logistic regression algorithm. The component supports both dense and sparse input data and uses the limited-memory BFGS (L-BFGS) optimizer.

How it works

Training runs in three stages:

  1. Feature processing — The algorithm reads the specified feature columns (DOUBLE or BIGINT types) and the label column from the input table. If the input is in sparse format, the component parses key-value pairs before processing.

  2. Training — L-BFGS iteratively minimizes the loss function. Each iteration updates the model weights. Training stops when the log-likelihood improvement between two consecutive iterations falls below epsilon, or when the maximum number of iterations is reached.

  3. Prediction — The trained model assigns each row a predicted class label (prediction_result), a confidence score (prediction_score), and a probability distribution across all classes (prediction_detail).

Configure the component

Choose one of the following methods.

Method 1: Configure in Machine Learning Designer

Open the pipeline configuration tab in Machine Learning Designer, then set the parameters on the component panel.

Fields setting

ParameterDescription
Training feature columnsFeature columns from the input table used for training. Supports DOUBLE and BIGINT types. Maximum 20 million features.
Target columnsThe label column in the input table.
Positive class valueThe label value that represents the positive class in binary classification.
Use sparse formatEnable if the input data is in sparse format.

Parameters setting

ParameterDescriptionDefault
Regularization typeRegularization to apply during training. Valid values: None, L1, L2. See Choose a regularization type.
Maximum iterationsMaximum number of L-BFGS iterations.100
Regularization coefficientStrength of regularization. Has no effect when Regularization type is None.
Minimum convergence devianceConvergence threshold (epsilon). Training stops when the log-likelihood improvement between two iterations falls below this value.0.000001

Tuning

ParameterDescription
CoresAutomatically set by the system.
Memory size per coreAutomatically set by the system.

Method 2: Configure using PAI commands

Run the following PAI command via the SQL Script component.

PAI -name logisticregression_binary
    -project algo_public
    -DmodelName="xlab_m_logistic_regression_6096"
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="l1"
    -Depsilon="0.000001"
    -DlabelColName="y"
    -DfeatureColNames="pdays,emp_var_rate"
    -DgoodValue="1"
    -DinputTableName="bank_data"

Required parameters

ParameterDescription
inputTableNameName of the input table.
labelColNameLabel column in the input table.
modelNameName of the output offline model.

Optional parameters

ParameterDescriptionDefault
featureColNamesFeature columns used for training. Maximum 20 million features.All numeric columns
inputTablePartitionsPartitions to use from the input table. Formats: partition_name=value or name1=value1/name2=value2. Separate multiple partitions with commas.Full table
regularizedTypeRegularization type. Valid values: l1, l2, None.l1
regularizedLevelRegularization coefficient. Has no effect when regularizedType is None.1.0
maxIterMaximum number of L-BFGS iterations.100
epsilonConvergence threshold. Training stops when the log-likelihood improvement between two iterations falls below this value.1.0e-06
goodValueThe label value corresponding to the positive class. Randomly assigned if not specified.
enableSparseWhether the input data is in sparse format. Valid values: true, false.false
itemDelimiterDelimiter between key-value pairs in sparse input., (comma)
kvDelimiterDelimiter between keys and values in sparse input.: (colon)
coreNumNumber of cores.Automatically allocated
memSizePerCoreMemory size per core, in MB.Automatically allocated

Choose a regularization type

Regularization typeWhen to useEffect
L1 (default)High-dimensional data with many irrelevant featuresDrives coefficients of irrelevant features to zero, producing a sparse model
L2Most features are relevant; you want to prevent large coefficientsShrinks all coefficients toward zero without eliminating them
NoneSmall datasets or when you want to understand the unregularized baselineNo regularization applied; regularizedLevel is ignored

Sparse data format

When enableSparse is true, the component reads each cell as a key-value string. Keys are zero-based indexes; values must be numeric. Non-numeric key values cause an error.

  • itemDelimiter separates key-value pairs within a cell.

  • kvDelimiter separates a key from its value.

Example input with the default delimiters (, and :):

key_value
1:100,4:200,5:300
1:10,2:20,3:30

Example

This example walks through a full training-to-prediction workflow using a dense dataset.

Step 1: Create the training table

DROP TABLE IF EXISTS lr_test_input;
CREATE TABLE lr_test_input AS
SELECT *
FROM (
    SELECT CAST(1 AS DOUBLE) AS f0, CAST(0 AS DOUBLE) AS f1, CAST(0 AS DOUBLE) AS f2, CAST(0 AS DOUBLE) AS f3, CAST(0 AS BIGINT) AS label
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(1 AS BIGINT)
    UNION ALL
    SELECT CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
) a;

The resulting table lr_test_input:

f0f1f2f3label
1.00.00.00.00
0.00.01.00.01
0.00.00.01.01
0.01.00.00.00
1.00.00.00.00
0.01.00.00.00

Step 2: Train the model

DROP OFFLINEMODEL IF EXISTS lr_test_model;
PAI -name logisticregression_binary
    -project algo_public
    -DmodelName="lr_test_model"
    -DitemDelimiter=","
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="None"
    -Depsilon="0.000001"
    -DkvDelimiter=":"
    -DlabelColName="label"
    -DfeatureColNames="f0,f1,f2,f3"
    -DenableSparse="false"
    -DgoodValue="1"
    -DinputTableName="lr_test_input";

Step 3: Run prediction

For details on the Prediction component parameters, see Prediction.

DROP TABLE IF EXISTS lr_test_prediction_result;
PAI -name prediction
    -project algo_public
    -DdetailColName="prediction_detail"
    -DmodelName="lr_test_model"
    -DitemDelimiter=","
    -DresultColName="prediction_result"
    -Dlifecycle="28"
    -DoutputTableName="lr_test_prediction_result"
    -DscoreColName="prediction_score"
    -DkvDelimiter=":"
    -DinputTableName="lr_test_input"
    -DenableSparse="false"
    -DappendColNames="label";

Step 4: Review results

The output table lr_test_prediction_result contains:

labelprediction_resultprediction_scoreprediction_detail
000.9999998793434426{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
110.999999799574135{"0": 2.004258650156743e-07, "1": 0.999999799574135}
110.999999799574135{"0": 2.004258650156743e-07, "1": 0.999999799574135}
000.9999998793434426{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
000.9999998793434426{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
000.9999998793434426{"0": 0.9999998793434426, "1": 1.206565574533681e-07}

Output columns:

  • prediction_result — the predicted class label (0 or 1)

  • prediction_score — the probability of the predicted class

  • prediction_detail — probability distribution across all classes as a JSON object

All predictions match the true labels, with confidence scores above 0.9999.