Clustering model evaluation-Platform For AI(PAI)-阿里云帮助中心

Limitations

This component's visualization reports are available only in the original Machine Learning Studio.

Background information

The Calinski-Harabasz index, also known as the Variance Ratio Criterion (VRC), is an evaluation metric calculated using the following formula. VRC公式

Parameter	Description
SS_B	The variance between clusters, defined as follows. Where: k: the number of cluster centers. m_i: the center of a cluster. m: the mean of the input data.
SS_W	The variance within clusters, defined as follows. Where: k: the number of cluster centers. x: a data point. c_i: the i-th cluster. m_i: the center of a cluster.
N	The total number of records.
k	The number of cluster centers.

Component configuration

You can configure the clustering model evaluation component by using either of the following methods.

Method 1: GUI

Configure the component parameters on the Machine Learning Designer pipeline page.

Tab	Parameter	Description
Field Settings	Evaluation Columns	The columns to evaluate.
	Input is Sparse Format	Specifies whether the input data is in a sparse key-value (KV) format.
	Key-Value Pair Delimiter	Defaults to a comma (,).
	Key-Value Inner Delimiter	Defaults to a colon (:).
Execution Tuning	Number of Cores	Use with Memory per Core. Must be a positive integer.
Execution Tuning	Memory per Core	Use with Number of Cores. Unit: MB.

Method 2: PAI command

Run a PAI command in the SQL Script component to configure the component parameters. For more information, see SQL Script.

PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;

Parameter	Required	Description	Default
inputTableName	Yes	The name of the input table.	None
selectedColNames	No	The columns in the input table to evaluate. To specify multiple columns, separate their names with commas (,).	All columns
inputTablePartitions	No	The partitions of the input table to use for evaluation. The following formats are supported: partition_name=value name1=value1/name2=value2: multi-level format Note If you specify multiple partitions, separate them with commas (,).	The entire table
enableSparse	No	Specifies whether the input data is in sparse format. Valid values: `true` and `false`.	false
itemDelimiter	No	The delimiter between key-value pairs in sparse format.	A comma (,).
kvDelimiter	No	The delimiter between a key and a value in sparse format.	A colon (:).
modelName	Yes	The input clustering model.	None
outputTableName	Yes	The output table.	None
lifecycle	No	The lifecycle of the output table.	None

Example

Use an SQL statement to generate test data.

create table if not exists pai_cluster_evaluation_test_input as
select * from
(
  select 1 as id, 1 as f0,2 as f3
  union all
  select 2 as id, 1 as f0,3 as f3
  union all
  select 3 as id, 1 as f0,4 as f3
  union all
  select 4 as id, 0 as f0,3 as f3
  union all
  select 5 as id, 0 as f0,4 as f3
)tmp;

Use a PAI command to create a clustering model. This example uses k-means clustering.

PAI -name kmeans
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DcenterCount=3
    -Dloop=10
    -Daccuracy=0.00001
    -DdistanceType=euclidean
    -DinitCenterMethod=random
    -Dseed=1
    -DmodelName=pai_kmeans_test_model
    -DidxTableName=pai_kmeans_test_idx

Use a PAI command to run the clustering model evaluation component.

PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;

View the output table pai_ft_cluster_evaluation_out. After the component runs, the output table contains a calinhara value of 3, a centerCount value of 3, and a count value of 5. The following table describes the fields in the output table.

Field	Description
count	The total number of records.
centerCount	The number of cluster centers.
calinhara	The Calinski-Harabasz index.