Logistic regression for multiclass classification

更新时间:
复制 MD 格式

Standard logistic regression handles binary classification. The Logistic Regression for Multiclass Classification component in Platform for AI (PAI) extends it to support multiclass classification using the L-BFGS optimization algorithm. The component accepts both sparse and dense input data formats.

Configure the component

Two configuration methods are available. Use Machine Learning Designer for visual, no-code setup. Use PAI commands for scripted or pipeline-automated workflows.

Method 1: Configure in Machine Learning Designer

Open the Logistic Regression for Multiclass Classification component in Machine Learning Designer (formerly Machine Learning Studio) and set the following parameters.

Fields Setting tab

ParameterDescription
Training feature columnsFeature columns selected from the input table for training. Supports the DOUBLE and BIGINT data types. A maximum of 20 million features are supported.
Target columnsLabel columns in the input table.
Sparse formatWhether the input data is in sparse format.

Parameters Setting tab

ParameterDescription
Regularization typeThe penalty applied to the model during training. Valid values: L1, L2, and None.
Maximum number of iterationsThe maximum number of L-BFGS iterations. Default: 100.
Regularization coefficientThe strength of the regularization penalty. Not applicable when Regularization type is set to None.
Minimum convergence devianceThe convergence threshold for the L-BFGS algorithm. Training stops when the difference in log-likelihood between consecutive iterations falls below this value. Default: 0.000001.

Method 2: Use PAI commands

Pass parameters directly to the logisticregression_multi algorithm using PAI commands. Run PAI commands through the SQL Script component. For more information, see SQL Script.

The following example shows the command syntax:

PAI -name logisticregression_multi
    -project algo_public
    -DmodelName="xlab_m_logistic_regression_6096"
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="l1"
    -Depsilon="0.000001"
    -DlabelColName="y"
    -DfeatureColNames="pdays,emp_var_rate"
    -DgoodValue="1"
    -DinputTableName="bank_data"

Parameters

ParameterRequiredDefaultDescription
inputTableNameYesName of the input table.
featureColNamesNoAll numeric columnsFeature columns selected from the input table for training. A maximum of 20 million features are supported.
labelColNameYesName of the label column.
inputTablePartitionsNoFull tablePartitions selected from the input table. Use partition_name=value for single partitions and name1=value1/name2=value2 for multi-level partitions. Separate multiple partitions with commas (,).
modelNameYesName of the output model.
regularizedTypeNol1Regularization type. Valid values: l1, l2, and None.
regularizedLevelNo1.0Regularization coefficient. Not applicable when regularizedType is None.
maxIterNo100Maximum number of L-BFGS iterations.
epsilonNo1.0e-06Convergence threshold for the L-BFGS algorithm. Training stops when the difference in log-likelihood between consecutive iterations is less than this value.
enableSparseNofalseWhether the input data is in sparse format. Valid values: true and false.
itemDelimiterNo,Delimiter between key-value pairs in sparse-format input.
kvDelimiterNo:Delimiter between keys and values in sparse-format input.
coreNumNoSystem defaultNumber of cores.
memSizePerCoreNoSystem defaultMemory allocated per core, in MB.

Example

This example trains a multiclass logistic regression model on a four-feature dataset and runs predictions. All commands are run through the SQL Script component.

Step 1: Create training data

Run the following SQL statements to create the multi_lr_test_input table:

drop table if exists multi_lr_test_input;
create table multi_lr_test_input
as
select
    *
from
(
    select
        cast(1 as double) as f0,
        cast(0 as double) as f1,
        cast(0 as double) as f2,
        cast(0 as double) as f3,
        cast(0 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(1 as double) as f1,
            cast(0 as double) as f2,
            cast(0 as double) as f3,
            cast(0 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(0 as double) as f1,
            cast(1 as double) as f2,
            cast(0 as double) as f3,
            cast(2 as bigint) as label
    union all
        select
            cast(0 as double) as f0,
            cast(0 as double) as f1,
            cast(0 as double) as f2,
            cast(1 as double) as f3,
            cast(1 as bigint) as label
) a;

The table contains four DOUBLE feature columns (f0f3) and one BIGINT label column:

f0f1f2f3label
1.00.00.00.00
0.00.01.00.02
0.00.00.01.01
0.01.00.00.00

Step 2: Train the model

Run the following PAI command to train the model and save it as multi_lr_test_model:

drop offlinemodel if exists multi_lr_test_model;
PAI -name logisticregression_multi
    -project algo_public
    -DmodelName="multi_lr_test_model"
    -DitemDelimiter=","
    -DregularizedLevel="1"
    -DmaxIter="100"
    -DregularizedType="None"
    -Depsilon="0.000001"
    -DkvDelimiter=":"
    -DlabelColName="label"
    -DfeatureColNames="f0,f1,f2,f3"
    -DenableSparse="false"
    -DinputTableName="multi_lr_test_input";

Step 3: Run predictions

Run the following PAI command to generate predictions and write results to multi_lr_test_prediction_result:

drop table if exists multi_lr_test_prediction_result;
PAI -name prediction
    -project algo_public
    -DdetailColName="prediction_detail"
    -DmodelName="multi_lr_test_model"
    -DitemDelimiter=","
    -DresultColName="prediction_result"
    -Dlifecycle="28"
    -DoutputTableName="multi_lr_test_prediction_result"
    -DscoreColName="prediction_score"
    -DkvDelimiter=":"
    -DinputTableName="multi_lr_test_input"
    -DenableSparse="false"
    -DappendColNames="label";

Step 4: View results

Query the multi_lr_test_prediction_result table to review the prediction output:

labelprediction_resultprediction_scoreprediction_detail
000.9999997274902165{"0": 0.9999997274902165, "1": 2.324679066261573e-07, "2": 2.324679066261569e-07}
000.9999997274902165{"0": 0.9999997274902165, "1": 2.324679066261573e-07, "2": 2.324679066261569e-07}
220.9999999155958832{"0": 2.018833979850994e-07, "1": 2.324679066261573e-07, "2": 0.9999999155958832}
110.9999999155958832{"0": 2.018833979850994e-07, "1": 0.9999999155958832, "2": 2.324679066261569e-07}

The output columns contain the following information:

  • prediction_result: The predicted class label.

  • prediction_score: The probability assigned to the predicted class.

  • prediction_detail: A JSON object mapping each class label to its predicted probability. Each key is a class label and each value is the model's confidence for that class. For example, {"0": 0.999..., "1": 2.32e-07, "2": 2.32e-07} indicates that the model assigns near-certainty to class 0 and near-zero probability to classes 1 and 2.