Principal Component Analysis (PCA) is a multivariate statistical method for exploring the internal structure of multiple variables and examining their correlations. PCA performs dimensionality reduction by deriving a few uncorrelated principal components from the original variables. These components retain as much information as possible from the original data and serve as new composite indicators.
Limitations
The Principal Component Analysis (PCA) component performs dimensionality reduction and noise reduction. It supports only dense data.
Configuration
Method 1: Visual interface
In Machine Learning Designer, add the Principal Component Analysis (PCA) component to your pipeline and configure its parameters in the pane on the right.
|
Tab |
Parameter |
Description |
|
Fields Setting |
Feature Columns |
The columns from the input table to use for analysis. |
|
Appended Columns |
The columns to append to the dimensionally-reduced output table. |
|
|
Parameters Setting |
Information Retention Ratio |
The proportion of information from the original data to retain after dimensionality reduction. |
|
Feature Decomposition Mode |
The method used for feature decomposition. Valid values:
|
|
|
Data Conversion Method |
The method used to transform data into principal components. Valid values:
|
|
|
Tuning |
Lifecycle |
The lifecycle of the output table. Must be a positive integer. |
|
Number of Cores |
Used with the Memory Size per Core (MB) parameter. The value must be a positive integer in the range [1, 9999]. |
|
|
Memory Size per Core (MB) |
Unit: MB. The value must be a positive integer in the range [1024, 65536]. |
Method 2: PAI commands
Alternatively, run PAI (Machine Learning Platform for AI) commands in a SQL Script component to configure the Principal Component Analysis (PCA) component. For more information, see SQL Script.
PAI -name PrinCompAnalysis
-project algo_public
-DinputTableName=bank_data
-DeigOutputTableName=pai_temp_2032_17900_2
-DprincompOutputTableName=pai_temp_2032_17900_1
-DselectedColNames=pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
-DtransType=Simple
-DcalcuType=CORR
-DcontriRate=0.9;
|
Parameter |
Required |
Default |
Description |
|
inputTableName |
Yes |
None |
The input table for principal component analysis. |
|
selectedColNames |
Yes |
None |
The names of the columns in the input table to be used for analysis. Separate multiple column names with a comma. Only columns of the INT and DOUBLE data types are supported. |
|
eigOutputTableName |
Yes |
None |
The output table for the feature vectors and eigenvalues. |
|
princompOutputTableName |
Yes |
None |
The output table containing the results after dimensionality reduction and noise reduction. |
|
transType |
No |
Simple |
The method used to transform data into principal components. Valid values:
|
|
calcuType |
No |
CORR |
The method used for feature decomposition. Valid values:
|
|
contriRate |
No |
0.9 |
The proportion of information to retain after dimensionality reduction. The value must be in the range (0, 1). |
|
remainColumns |
No |
None |
The columns from the original table to include in the dimensionally-reduced output table. |
|
coreNum |
No |
System-assigned |
The number of cores to use. This parameter works in conjunction with the memSizePerCore parameter. The value must be a positive integer in the range [1, 9999]. |
|
memSizePerCore |
No |
System-assigned |
The memory size for each core, in MB. The value must be a positive integer in the range [1024, 65536]. |
|
lifecycle |
No |
None |
The lifecycle of the output table. Must be a positive integer. |
Examples
Example outputs
-
This output table contains four principal component columns: prin0, prin1, prin2, and prin3. Each row represents the principal component values for a sample after dimensionality reduction.
-
Eigenvalue and feature vector table: This table contains four principal components (prin0 to prin3). The columns show the feature vector components (pdays, previous, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, and nr_employed), and the corresponding eigenvalue, contributionrate (contribution rate), and sumcontributionrate (cumulative contribution rate).