Principal component analysis (PCA)

更新时间:
复制 MD 格式

Principal Component Analysis (PCA) is a multivariate statistical method for exploring the internal structure of multiple variables and examining their correlations. PCA performs dimensionality reduction by deriving a few uncorrelated principal components from the original variables. These components retain as much information as possible from the original data and serve as new composite indicators.

Limitations

The Principal Component Analysis (PCA) component performs dimensionality reduction and noise reduction. It supports only dense data.

Configuration

Method 1: Visual interface

In Machine Learning Designer, add the Principal Component Analysis (PCA) component to your pipeline and configure its parameters in the pane on the right.

Tab

Parameter

Description

Fields Setting

Feature Columns

The columns from the input table to use for analysis.

Appended Columns

The columns to append to the dimensionally-reduced output table.

Parameters Setting

Information Retention Ratio

The proportion of information from the original data to retain after dimensionality reduction.

Feature Decomposition Mode

The method used for feature decomposition. Valid values:

  • CORR

  • COVAR_SAMP

  • COVAR_POP

Data Conversion Method

The method used to transform data into principal components. Valid values:

  • Simple

  • Sub-Mean

  • Normalization

Tuning

Lifecycle

The lifecycle of the output table. Must be a positive integer.

Number of Cores

Used with the Memory Size per Core (MB) parameter. The value must be a positive integer in the range [1, 9999].

Memory Size per Core (MB)

Unit: MB. The value must be a positive integer in the range [1024, 65536].

Method 2: PAI commands

Alternatively, run PAI (Machine Learning Platform for AI) commands in a SQL Script component to configure the Principal Component Analysis (PCA) component. For more information, see SQL Script.

PAI -name PrinCompAnalysis
    -project algo_public
    -DinputTableName=bank_data
    -DeigOutputTableName=pai_temp_2032_17900_2
    -DprincompOutputTableName=pai_temp_2032_17900_1
    -DselectedColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
    -DtransType=Simple
    -DcalcuType=CORR
    -DcontriRate=0.9;

Parameter

Required

Default

Description

inputTableName

Yes

None

The input table for principal component analysis.

selectedColNames

Yes

None

The names of the columns in the input table to be used for analysis.

Separate multiple column names with a comma. Only columns of the INT and DOUBLE data types are supported.

eigOutputTableName

Yes

None

The output table for the feature vectors and eigenvalues.

princompOutputTableName

Yes

None

The output table containing the results after dimensionality reduction and noise reduction.

transType

No

Simple

The method used to transform data into principal components. Valid values:

  • Simple

  • Sub-Mean

  • Normalization

calcuType

No

CORR

The method used for feature decomposition. Valid values:

  • CORR

  • COVAR_SAMP

  • COVAR_POP

contriRate

No

0.9

The proportion of information to retain after dimensionality reduction. The value must be in the range (0, 1).

remainColumns

No

None

The columns from the original table to include in the dimensionally-reduced output table.

coreNum

No

System-assigned

The number of cores to use. This parameter works in conjunction with the memSizePerCore parameter. The value must be a positive integer in the range [1, 9999].

memSizePerCore

No

System-assigned

The memory size for each core, in MB. The value must be a positive integer in the range [1024, 65536].

lifecycle

No

None

The lifecycle of the output table. Must be a positive integer.

Examples

Example outputs

  • This output table contains four principal component columns: prin0, prin1, prin2, and prin3. Each row represents the principal component values for a sample after dimensionality reduction.

  • Eigenvalue and feature vector table: This table contains four principal components (prin0 to prin3). The columns show the feature vector components (pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, and nr_employed), and the corresponding eigenvalue, contributionrate (contribution rate), and sumcontributionrate (cumulative contribution rate).