Logistic regression for binary classification-Platform For AI(PAI)-阿里云帮助中心

Logistic Regression for Binary Classification trains a binary classifier using the logistic regression algorithm. The component supports both dense and sparse input data and uses the limited-memory BFGS (L-BFGS) optimizer.

How it works

Training runs in three stages:

Feature processing — The algorithm reads the specified feature columns (DOUBLE or BIGINT types) and the label column from the input table. If the input is in sparse format, the component parses key-value pairs before processing.
Training — L-BFGS iteratively minimizes the loss function. Each iteration updates the model weights. Training stops when the log-likelihood improvement between two consecutive iterations falls below epsilon, or when the maximum number of iterations is reached.
Prediction — The trained model assigns each row a predicted class label (prediction_result), a confidence score (prediction_score), and a probability distribution across all classes (prediction_detail).

Configure the component

Choose one of the following methods.

Method 1: Configure in Machine Learning Designer

Open the pipeline configuration tab in Machine Learning Designer, then set the parameters on the component panel.

Fields setting

Parameter	Description
Training feature columns	Feature columns from the input table used for training. Supports DOUBLE and BIGINT types. Maximum 20 million features.
Target columns	The label column in the input table.
Positive class value	The label value that represents the positive class in binary classification.
Use sparse format	Enable if the input data is in sparse format.

Parameters setting

Parameter	Description	Default
Regularization type	Regularization to apply during training. Valid values: None, L1, L2. See Choose a regularization type.	—
Maximum iterations	Maximum number of L-BFGS iterations.	100
Regularization coefficient	Strength of regularization. Has no effect when Regularization type is None.	—
Minimum convergence deviance	Convergence threshold (`epsilon`). Training stops when the log-likelihood improvement between two iterations falls below this value.	0.000001

Tuning

Parameter	Description
Cores	Automatically set by the system.
Memory size per core	Automatically set by the system.

Method 2: Configure using PAI commands

Run the following PAI command via the SQL Script component.

PAI -name logisticregression_binary
    -project algo_public
    -DmodelName="xlab_m_logistic_regression_6096"
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="l1"
    -Depsilon="0.000001"
    -DlabelColName="y"
    -DfeatureColNames="pdays,emp_var_rate"
    -DgoodValue="1"
    -DinputTableName="bank_data"

Required parameters

Parameter	Description
`inputTableName`	Name of the input table.
`labelColName`	Label column in the input table.
`modelName`	Name of the output offline model.

Optional parameters

Parameter	Description	Default
`featureColNames`	Feature columns used for training. Maximum 20 million features.	All numeric columns
`inputTablePartitions`	Partitions to use from the input table. Formats: `partition_name=value` or `name1=value1/name2=value2`. Separate multiple partitions with commas.	Full table
`regularizedType`	Regularization type. Valid values: `l1`, `l2`, `None`.	`l1`
`regularizedLevel`	Regularization coefficient. Has no effect when `regularizedType` is `None`.	`1.0`
`maxIter`	Maximum number of L-BFGS iterations.	`100`
`epsilon`	Convergence threshold. Training stops when the log-likelihood improvement between two iterations falls below this value.	`1.0e-06`
`goodValue`	The label value corresponding to the positive class. Randomly assigned if not specified.	—
`enableSparse`	Whether the input data is in sparse format. Valid values: `true`, `false`.	`false`
`itemDelimiter`	Delimiter between key-value pairs in sparse input.	`,` (comma)
`kvDelimiter`	Delimiter between keys and values in sparse input.	`:` (colon)
`coreNum`	Number of cores.	Automatically allocated
`memSizePerCore`	Memory size per core, in MB.	Automatically allocated

Choose a regularization type

Regularization type	When to use	Effect
L1 (default)	High-dimensional data with many irrelevant features	Drives coefficients of irrelevant features to zero, producing a sparse model
L2	Most features are relevant; you want to prevent large coefficients	Shrinks all coefficients toward zero without eliminating them
None	Small datasets or when you want to understand the unregularized baseline	No regularization applied; `regularizedLevel` is ignored

Sparse data format

When enableSparse is true, the component reads each cell as a key-value string. Keys are zero-based indexes; values must be numeric. Non-numeric key values cause an error.

itemDelimiter separates key-value pairs within a cell.
kvDelimiter separates a key from its value.

Example input with the default delimiters (, and :):

key_value
1:100,4:200,5:300
1:10,2:20,3:30

Example

This example walks through a full training-to-prediction workflow using a dense dataset.

Step 1: Create the training table

DROP TABLE IF EXISTS lr_test_input;
CREATE TABLE lr_test_input AS
SELECT *
FROM (
    SELECT CAST(1 AS DOUBLE) AS f0, CAST(0 AS DOUBLE) AS f1, CAST(0 AS DOUBLE) AS f2, CAST(0 AS DOUBLE) AS f3, CAST(0 AS BIGINT) AS label
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(1 AS BIGINT)
    UNION ALL
    SELECT CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
    UNION ALL
    SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
) a;

The resulting table lr_test_input:

f0	f1	f2	f3	label
1.0	0.0	0.0	0.0	0
0.0	0.0	1.0	0.0	1
0.0	0.0	0.0	1.0	1
0.0	1.0	0.0	0.0	0
1.0	0.0	0.0	0.0	0
0.0	1.0	0.0	0.0	0

Step 2: Train the model

DROP OFFLINEMODEL IF EXISTS lr_test_model;
PAI -name logisticregression_binary
    -project algo_public
    -DmodelName="lr_test_model"
    -DitemDelimiter=","
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="None"
    -Depsilon="0.000001"
    -DkvDelimiter=":"
    -DlabelColName="label"
    -DfeatureColNames="f0,f1,f2,f3"
    -DenableSparse="false"
    -DgoodValue="1"
    -DinputTableName="lr_test_input";

Step 3: Run prediction

For details on the Prediction component parameters, see Prediction.

DROP TABLE IF EXISTS lr_test_prediction_result;
PAI -name prediction
    -project algo_public
    -DdetailColName="prediction_detail"
    -DmodelName="lr_test_model"
    -DitemDelimiter=","
    -DresultColName="prediction_result"
    -Dlifecycle="28"
    -DoutputTableName="lr_test_prediction_result"
    -DscoreColName="prediction_score"
    -DkvDelimiter=":"
    -DinputTableName="lr_test_input"
    -DenableSparse="false"
    -DappendColNames="label";

Step 4: Review results

The output table lr_test_prediction_result contains:

label	prediction_result	prediction_score	prediction_detail
0	0	0.9999998793434426	{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
1	1	0.999999799574135	{"0": 2.004258650156743e-07, "1": 0.999999799574135}
1	1	0.999999799574135	{"0": 2.004258650156743e-07, "1": 0.999999799574135}
0	0	0.9999998793434426	{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
0	0	0.9999998793434426	{"0": 0.9999998793434426, "1": 1.206565574533681e-07}
0	0	0.9999998793434426	{"0": 0.9999998793434426, "1": 1.206565574533681e-07}

Output columns:

prediction_result — the predicted class label (0 or 1)
prediction_score — the probability of the predicted class
prediction_detail — probability distribution across all classes as a JSON object

All predictions match the true labels, with confidence scores above 0.9999.

f0	f1	f2	f3	label
1.0	0.0	0.0	0.0	0
0.0	0.0	1.0	0.0	1
0.0	0.0	0.0	1.0	1
0.0	1.0	0.0	0.0	0
1.0	0.0	0.0	0.0	0
0.0	1.0	0.0	0.0	0

f0	f1	f2	f3	label
1.0	0.0	0.0	0.0	0
0.0	0.0	1.0	0.0	1
0.0	0.0	0.0	1.0	1
0.0	1.0	0.0	0.0	0
1.0	0.0	0.0	0.0	0
0.0	1.0	0.0	0.0	0

f0	f1	f2	f3	label
1.0	0.0	0.0	0.0	0
0.0	0.0	1.0	0.0	1
0.0	0.0	0.0	1.0	1
0.0	1.0	0.0	0.0	0
1.0	0.0	0.0	0.0	0
0.0	1.0	0.0	0.0	0