Linear SVM

更新时间:
复制 MD 格式

A support vector machine (SVM) is a machine learning method based on statistical learning theory. It improves a model's generalization ability through structural risk minimization, which balances the empirical risk and the confidence interval. This topic explains how to configure the Linear SVM component and includes a usage example.

Background

This Linear SVM algorithm is implemented without using a kernel function. For details on the implementation, see the "trust region method for L2-SVM" section in algorithm principles.

Limitations

The Linear SVM component supports only binary classification.

Component configuration

You can configure the parameters for the Linear SVM component using one of the following methods.

Method 1: Visual configuration

  • Input port

    The Linear SVM component has one required input port that must be connected to a Read Table component.

  • Configure the component parameters on the workflow page.

    Parameter type

    Parameter

    Required

    Description

    Field settings

    Feature columns

    Yes

    The feature columns for model training. Supported data types: BIGINT and DOUBLE.

    Label column

    Yes

    The label column for model training. Supported data types: BIGINT, DOUBLE, and STRING.

    Parameter settings

    Positive sample label value

    No

    The value that represents a positive sample. If this parameter is not specified, a value is randomly selected from the label column. Specifying this parameter is recommended for imbalanced datasets.

    Positive penalty factor

    No

    The weight for positive samples. The default value is 1.0, and the value range is (0, +∞).

    Negative penalty factor

    No

    The negative sample weight has a default value of 1.0 and a value range of (0, +∞).

    Convergence coefficient

    No

    Convergence error. The default value is 0.001. The value range is (0, 1).

    Execution tuning

    Number of cores

    No

    If unspecified, the system automatically allocates resources.

    Memory per core

    No

    The memory size for each core, in MB. If unspecified, the system automatically allocates resources.

  • Output port

    Outputs a binary classification model in OfflineModel format. The output of this component connects to a Prediction component.

Method 2: PAI command

You can configure the component parameters by running a PAI command in an SQL Script component. For more information, see SQL Script.

PAI -name LinearSVM -project algo_public
    -DinputTableName="bank_data"
    -DmodelName="xlab_m_LinearSVM_6143"
    -DfeatureColNames="pdays,emp_var_rate,cons_conf_idx"
    -DlabelColName="y"
    -DpositiveLabel="0"
    -DpositiveCost="1.0"
    -DnegativeCost="1.0"
    -Depsilon="0.001";

The following table describes the parameters in the PAI command.

Parameter

Required

Default

Description

inputTableName

Yes

N/A

The name of the input table.

inputTablePartitions

No

All partitions of the input table.

The partitions of the input table for training. The following formats are supported:

  • Partition_name=value

  • name1=value1/name2=value2: Specifies a multi-level partition.

Note

To specify multiple partitions, separate them with commas (,).

modelName

Yes

N/A

The name of the output model.

featureColNames

Yes

N/A

The names of the feature columns in the input table.

labelColName

Yes

N/A

The name of the label column in the input table.

positiveLabel

No

A value is randomly selected from the label column.

The value that represents a positive sample.

positiveCost

No

1.0

The weight of the positive class, also known as the positive penalty factor. The value must be in the range (0, +∞).

negativeCost

No

1.0

The weight of the negative class, also known as the negative penalty factor. The value must be in the range (0, +∞).

epsilon

No

0.001

The convergence coefficient. The value must be in the range (0,1).

enableSparse

No

false

Specifies whether the input data is in sparse format. Valid values: true and false.

itemDelimiter

No

, (comma)

The delimiter between KV pairs when the input table data is in sparse format.

kvDelimiter

No

: (colon)

The delimiter between key and value when the input table data is in sparse format.

coreNum

No

System-allocated

The number of compute cores. The value must be a positive integer.

memSizePerCore

No

System-allocated

The memory size for each core, in MB. The value must be an integer in the range [1, 65536].

Example

  1. Import the following training data.

    id

    y

    f0

    f1

    f2

    f3

    f4

    f5

    f6

    f7

    1

    -1

    -0.294118

    0.487437

    0.180328

    -0.292929

    -1

    0.00149028

    -0.53117

    -0.0333333

    2

    +1

    -0.882353

    -0.145729

    0.0819672

    -0.414141

    -1

    -0.207153

    -0.766866

    -0.666667

    3

    -1

    -0.0588235

    0.839196

    0.0491803

    -1

    -1

    -0.305514

    -0.492741

    -0.633333

    4

    +1

    -0.882353

    -0.105528

    0.0819672

    -0.535354

    -0.777778

    -0.162444

    -0.923997

    -1

    5

    -1

    -1

    0.376884

    -0.344262

    -0.292929

    -0.602837

    0.28465

    0.887276

    -0.6

    6

    +1

    -0.411765

    0.165829

    0.213115

    -1

    -1

    -0.23696

    -0.894962

    -0.7

    7

    -1

    -0.647059

    -0.21608

    -0.180328

    -0.353535

    -0.791962

    -0.0760059

    -0.854825

    -0.833333

    8

    +1

    0.176471

    0.155779

    -1

    -1

    -1

    0.052161

    -0.952178

    -0.733333

    9

    -1

    -0.764706

    0.979899

    0.147541

    -0.0909091

    0.283688

    -0.0909091

    -0.931682

    0.0666667

    10

    -1

    -0.0588235

    0.256281

    0.57377

    -1

    -1

    -1

    -0.868488

    0.1

  2. Import the following test data.

    id

    y

    f0

    f1

    f2

    f3

    f4

    f5

    f6

    f7

    1

    +1

    -0.882353

    0.0854271

    0.442623

    -0.616162

    -1

    -0.19225

    -0.725021

    -0.9

    2

    +1

    -0.294118

    -0.0351759

    -1

    -1

    -1

    -0.293592

    -0.904355

    -0.766667

    3

    +1

    -0.882353

    0.246231

    0.213115

    -0.272727

    -1

    -0.171386

    -0.981213

    -0.7

    4

    -1

    -0.176471

    0.507538

    0.278689

    -0.414141

    -0.702128

    0.0491804

    -0.475662

    0.1

    5

    -1

    -0.529412

    0.839196

    -1

    -1

    -1

    -0.153502

    -0.885568

    -0.5

    6

    +1

    -0.882353

    0.246231

    -0.0163934

    -0.353535

    -1

    0.0670641

    -0.627669

    -1

    7

    -1

    -0.882353

    0.819095

    0.278689

    -0.151515

    -0.307329

    0.19225

    0.00768574

    -0.966667

    8

    +1

    -0.882353

    -0.0753769

    0.0163934

    -0.494949

    -0.903073

    -0.418778

    -0.654996

    -0.866667

    9

    +1

    -1

    0.527638

    0.344262

    -0.212121

    -0.356974

    0.23696

    -0.836038

    -0.8

    10

    +1

    -0.882353

    0.115578

    0.0163934

    -0.737374

    -0.56974

    -0.28465

    -0.948762

    -0.933333

  3. Create a pipeline as shown in the following figure. For more information, see algorithm modeling.

    image.png

  4. Configure the parameters for the Linear SVM component as shown in the following table. Keep the default values for all other parameters.

    Parameter type

    Parameter

    Description

    Fields Setting

    Feature Columns

    Select the f0, f1, f2, f3, f4, f5, f6, and f7 columns.

    Label Column

    Select the y column.

  5. Run the pipeline and view the prediction results. After the prediction completes, the output displays the binary classification prediction results for the 10 rows of test data in five columns: id, y (actual label), prediction_result (predicted label: +1 or -1), prediction_score, and prediction_detail (score details for the +1 and -1 classes, in JSON format).