Logistic Regression for Binary Classification trains a binary classifier using the logistic regression algorithm. The component supports both dense and sparse input data and uses the limited-memory BFGS (L-BFGS) optimizer.
How it works
Training runs in three stages:
Feature processing — The algorithm reads the specified feature columns (DOUBLE or BIGINT types) and the label column from the input table. If the input is in sparse format, the component parses key-value pairs before processing.
Training — L-BFGS iteratively minimizes the loss function. Each iteration updates the model weights. Training stops when the log-likelihood improvement between two consecutive iterations falls below
epsilon, or when the maximum number of iterations is reached.Prediction — The trained model assigns each row a predicted class label (
prediction_result), a confidence score (prediction_score), and a probability distribution across all classes (prediction_detail).
Configure the component
Choose one of the following methods.
Method 1: Configure in Machine Learning Designer
Open the pipeline configuration tab in Machine Learning Designer, then set the parameters on the component panel.
Fields setting
| Parameter | Description |
|---|---|
| Training feature columns | Feature columns from the input table used for training. Supports DOUBLE and BIGINT types. Maximum 20 million features. |
| Target columns | The label column in the input table. |
| Positive class value | The label value that represents the positive class in binary classification. |
| Use sparse format | Enable if the input data is in sparse format. |
Parameters setting
| Parameter | Description | Default |
|---|---|---|
| Regularization type | Regularization to apply during training. Valid values: None, L1, L2. See Choose a regularization type. | — |
| Maximum iterations | Maximum number of L-BFGS iterations. | 100 |
| Regularization coefficient | Strength of regularization. Has no effect when Regularization type is None. | — |
| Minimum convergence deviance | Convergence threshold (epsilon). Training stops when the log-likelihood improvement between two iterations falls below this value. | 0.000001 |
Tuning
| Parameter | Description |
|---|---|
| Cores | Automatically set by the system. |
| Memory size per core | Automatically set by the system. |
Method 2: Configure using PAI commands
Run the following PAI command via the SQL Script component.
PAI -name logisticregression_binary
-project algo_public
-DmodelName="xlab_m_logistic_regression_6096"
-DregularizedLevel="1"
-DmaxIter="100"
-DregularizedType="l1"
-Depsilon="0.000001"
-DlabelColName="y"
-DfeatureColNames="pdays,emp_var_rate"
-DgoodValue="1"
-DinputTableName="bank_data"Required parameters
| Parameter | Description |
|---|---|
inputTableName | Name of the input table. |
labelColName | Label column in the input table. |
modelName | Name of the output offline model. |
Optional parameters
| Parameter | Description | Default |
|---|---|---|
featureColNames | Feature columns used for training. Maximum 20 million features. | All numeric columns |
inputTablePartitions | Partitions to use from the input table. Formats: partition_name=value or name1=value1/name2=value2. Separate multiple partitions with commas. | Full table |
regularizedType | Regularization type. Valid values: l1, l2, None. | l1 |
regularizedLevel | Regularization coefficient. Has no effect when regularizedType is None. | 1.0 |
maxIter | Maximum number of L-BFGS iterations. | 100 |
epsilon | Convergence threshold. Training stops when the log-likelihood improvement between two iterations falls below this value. | 1.0e-06 |
goodValue | The label value corresponding to the positive class. Randomly assigned if not specified. | — |
enableSparse | Whether the input data is in sparse format. Valid values: true, false. | false |
itemDelimiter | Delimiter between key-value pairs in sparse input. | , (comma) |
kvDelimiter | Delimiter between keys and values in sparse input. | : (colon) |
coreNum | Number of cores. | Automatically allocated |
memSizePerCore | Memory size per core, in MB. | Automatically allocated |
Choose a regularization type
| Regularization type | When to use | Effect |
|---|---|---|
| L1 (default) | High-dimensional data with many irrelevant features | Drives coefficients of irrelevant features to zero, producing a sparse model |
| L2 | Most features are relevant; you want to prevent large coefficients | Shrinks all coefficients toward zero without eliminating them |
| None | Small datasets or when you want to understand the unregularized baseline | No regularization applied; regularizedLevel is ignored |
Sparse data format
When enableSparse is true, the component reads each cell as a key-value string. Keys are zero-based indexes; values must be numeric. Non-numeric key values cause an error.
itemDelimiterseparates key-value pairs within a cell.kvDelimiterseparates a key from its value.
Example input with the default delimiters (, and :):
| key_value |
|---|
| 1:100,4:200,5:300 |
| 1:10,2:20,3:30 |
Example
This example walks through a full training-to-prediction workflow using a dense dataset.
Step 1: Create the training table
DROP TABLE IF EXISTS lr_test_input;
CREATE TABLE lr_test_input AS
SELECT *
FROM (
SELECT CAST(1 AS DOUBLE) AS f0, CAST(0 AS DOUBLE) AS f1, CAST(0 AS DOUBLE) AS f2, CAST(0 AS DOUBLE) AS f3, CAST(0 AS BIGINT) AS label
UNION ALL
SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
UNION ALL
SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS BIGINT)
UNION ALL
SELECT CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(1 AS BIGINT)
UNION ALL
SELECT CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
UNION ALL
SELECT CAST(0 AS DOUBLE), CAST(1 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS DOUBLE), CAST(0 AS BIGINT)
) a;The resulting table lr_test_input:
| f0 | f1 | f2 | f3 | label |
|---|---|---|---|---|
| 1.0 | 0.0 | 0.0 | 0.0 | 0 |
| 0.0 | 0.0 | 1.0 | 0.0 | 1 |
| 0.0 | 0.0 | 0.0 | 1.0 | 1 |
| 0.0 | 1.0 | 0.0 | 0.0 | 0 |
| 1.0 | 0.0 | 0.0 | 0.0 | 0 |
| 0.0 | 1.0 | 0.0 | 0.0 | 0 |
Step 2: Train the model
DROP OFFLINEMODEL IF EXISTS lr_test_model;
PAI -name logisticregression_binary
-project algo_public
-DmodelName="lr_test_model"
-DitemDelimiter=","
-DregularizedLevel="1"
-DmaxIter="100"
-DregularizedType="None"
-Depsilon="0.000001"
-DkvDelimiter=":"
-DlabelColName="label"
-DfeatureColNames="f0,f1,f2,f3"
-DenableSparse="false"
-DgoodValue="1"
-DinputTableName="lr_test_input";Step 3: Run prediction
For details on the Prediction component parameters, see Prediction.
DROP TABLE IF EXISTS lr_test_prediction_result;
PAI -name prediction
-project algo_public
-DdetailColName="prediction_detail"
-DmodelName="lr_test_model"
-DitemDelimiter=","
-DresultColName="prediction_result"
-Dlifecycle="28"
-DoutputTableName="lr_test_prediction_result"
-DscoreColName="prediction_score"
-DkvDelimiter=":"
-DinputTableName="lr_test_input"
-DenableSparse="false"
-DappendColNames="label";Step 4: Review results
The output table lr_test_prediction_result contains:
| label | prediction_result | prediction_score | prediction_detail |
|---|---|---|---|
| 0 | 0 | 0.9999998793434426 | {"0": 0.9999998793434426, "1": 1.206565574533681e-07} |
| 1 | 1 | 0.999999799574135 | {"0": 2.004258650156743e-07, "1": 0.999999799574135} |
| 1 | 1 | 0.999999799574135 | {"0": 2.004258650156743e-07, "1": 0.999999799574135} |
| 0 | 0 | 0.9999998793434426 | {"0": 0.9999998793434426, "1": 1.206565574533681e-07} |
| 0 | 0 | 0.9999998793434426 | {"0": 0.9999998793434426, "1": 1.206565574533681e-07} |
| 0 | 0 | 0.9999998793434426 | {"0": 0.9999998793434426, "1": 1.206565574533681e-07} |
Output columns:
prediction_result— the predicted class label (0 or 1)prediction_score— the probability of the predicted classprediction_detail— probability distribution across all classes as a JSON object
All predictions match the true labels, with confidence scores above 0.9999.