Clustering model evaluation

更新时间:
复制 MD 格式

Evaluates a clustering model's performance using raw data and clustering results to generate evaluation metrics.

Limitations

This component's visualization reports are available only in the original Machine Learning Studio.

Background information

The Calinski-Harabasz index, also known as the Variance Ratio Criterion (VRC), is an evaluation metric calculated using the following formula.VRC公式

Parameter

Description

SSB

The variance between clusters, defined as follows.SSB Where:

  • k: the number of cluster centers.

  • mi: the center of a cluster.

  • m: the mean of the input data.

SSW

The variance within clusters, defined as follows.SSW Where:

  • k: the number of cluster centers.

  • x: a data point.

  • ci: the i-th cluster.

  • mi: the center of a cluster.

N

The total number of records.

k

The number of cluster centers.

Component configuration

You can configure the clustering model evaluation component by using either of the following methods.

Method 1: GUI

Configure the component parameters on the Machine Learning Designer pipeline page.

Tab

Parameter

Description

Field Settings

Evaluation Columns

The columns to evaluate.

Input is Sparse Format

Specifies whether the input data is in a sparse key-value (KV) format.

Key-Value Pair Delimiter

Defaults to a comma (,).

Key-Value Inner Delimiter

Defaults to a colon (:).

Execution Tuning

Number of Cores

Use with Memory per Core. Must be a positive integer.

Memory per Core

Use with Number of Cores. Unit: MB.

Method 2: PAI command

Run a PAI command in the SQL Script component to configure the component parameters. For more information, see SQL Script.

PAI -name cluster_evaluation
    -project algo_public
    -DinputTableName=pai_cluster_evaluation_test_input
    -DselectedColNames=f0,f3
    -DmodelName=pai_kmeans_test_model
    -DoutputTableName=pai_ft_cluster_evaluation_out;

Parameter

Required

Description

Default

inputTableName

Yes

The name of the input table.

None

selectedColNames

No

The columns in the input table to evaluate. To specify multiple columns, separate their names with commas (,).

All columns

inputTablePartitions

No

The partitions of the input table to use for evaluation. The following formats are supported:

  • partition_name=value

  • name1=value1/name2=value2: multi-level format

Note

If you specify multiple partitions, separate them with commas (,).

The entire table

enableSparse

No

Specifies whether the input data is in sparse format. Valid values: true and false.

false

itemDelimiter

No

The delimiter between key-value pairs in sparse format.

A comma (,).

kvDelimiter

No

The delimiter between a key and a value in sparse format.

A colon (:).

modelName

Yes

The input clustering model.

None

outputTableName

Yes

The output table.

None

lifecycle

No

The lifecycle of the output table.

None

Example

  1. Use an SQL statement to generate test data.

    create table if not exists pai_cluster_evaluation_test_input as
    select * from
    (
      select 1 as id, 1 as f0,2 as f3
      union all
      select 2 as id, 1 as f0,3 as f3
      union all
      select 3 as id, 1 as f0,4 as f3
      union all
      select 4 as id, 0 as f0,3 as f3
      union all
      select 5 as id, 0 as f0,4 as f3
    )tmp;
  2. Use a PAI command to create a clustering model. This example uses k-means clustering.

    PAI -name kmeans
        -project algo_public
        -DinputTableName=pai_cluster_evaluation_test_input
        -DselectedColNames=f0,f3
        -DcenterCount=3
        -Dloop=10
        -Daccuracy=0.00001
        -DdistanceType=euclidean
        -DinitCenterMethod=random
        -Dseed=1
        -DmodelName=pai_kmeans_test_model
        -DidxTableName=pai_kmeans_test_idx
  3. Use a PAI command to run the clustering model evaluation component.

    PAI -name cluster_evaluation
        -project algo_public
        -DinputTableName=pai_cluster_evaluation_test_input
        -DselectedColNames=f0,f3
        -DmodelName=pai_kmeans_test_model
        -DoutputTableName=pai_ft_cluster_evaluation_out;
  4. View the output table pai_ft_cluster_evaluation_out. After the component runs, the output table contains a calinhara value of 3, a centerCount value of 3, and a count value of 5. The following table describes the fields in the output table.

    Field

    Description

    count

    The total number of records.

    centerCount

    The number of cluster centers.

    calinhara

    The Calinski-Harabasz index.