聚类分析函数

在商业运作中,聚类分析可以帮助我们发现相似的对象如用户、商品、市场等,从而进行针对性的营销策略,提高效率和收益。本文介绍聚合分类函数的基本语法和示例。

背景信息

  • 日志示例和索引示例(clustering_centroids函数和to_cluster_label函数)

    • 索引如图所示。更多信息,请参见创建索引

      image

    • 日志样例如下:

      {PushBack"entity_group":"A","entity_id":"0","x0":"9.43755123272417","x1":"10.9239183048056"}
      {PushBack"entity_group":"A","entity_id":"1","x0":"9.90934789781198","x1":"10.0410638670737"}
      {PushBack"entity_group":"A","entity_id":"2","x0":"10.0237761750608","x1":"11.5427618678773"}
      {PushBack"entity_group":"A","entity_id":"3","x0":"8.76413639285441","x1":"9.91100696065114"}
      {PushBack"entity_group":"A","entity_id":"4","x0":"9.72866019535702","x1":"10.9289903697879"}
      {PushBack"entity_group":"A","entity_id":"5","x0":"9.31732230036184","x1":"9.70753937768216"}
      {PushBack"entity_group":"A","entity_id":"6","x0":"10.25881289615","x1":"9.74090182514911"}
      {PushBack"entity_group":"A","entity_id":"7","x0":"10.7476586051507","x1":"9.68058142945478"}
      {PushBack"entity_group":"A","entity_id":"8","x0":"10.4225378902342","x1":"11.2940460054908"}
      {PushBack"entity_group":"A","entity_id":"9","x0":"9.60791396236535","x1":"9.74099863764865"}
      {PushBack"entity_group":"A","entity_id":"10","x0":"-9.88083328783069","x1":"9.96483490201365"}
      {PushBack"entity_group":"A","entity_id":"11","x0":"-9.2566893228908","x1":"10.4709996742353"}
      {PushBack"entity_group":"A","entity_id":"12","x0":"-9.42075011656993","x1":"10.2225398224866"}
      {PushBack"entity_group":"A","entity_id":"13","x0":"-11.0476084082138","x1":"9.6602175401245"}
      {PushBack"entity_group":"A","entity_id":"14","x0":"-10.4835290932454","x1":"10.0468761053371"}
      {PushBack"entity_group":"A","entity_id":"15","x0":"-10.0712229897491","x1":"10.7961735513211"}
      {PushBack"entity_group":"A","entity_id":"16","x0":"-9.20579985976076","x1":"11.3595617847332"}
      {PushBack"entity_group":"A","entity_id":"17","x0":"-10.2257991823914","x1":"11.5241122586671"}
      {PushBack"entity_group":"A","entity_id":"18","x0":"-9.53462858712671","x1":"10.4203828737972"}
      {PushBack"entity_group":"A","entity_id":"19","x0":"-9.32911152937254","x1":"8.64624234507702"}
      {PushBack"entity_group":"A","entity_id":"20","x0":"10.0383094216576","x1":"-10.8386382322694"}
  • 日志示例和索引示例(clustering函数)

    • 索引如图所示。更多信息,请参见创建索引

      image

    • 日志样例如下:

      1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126,83,77,4200,None
      2,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      3,Male,28,Doctor,6.2,6,60,8,Normal,125,80,75,10000,None
      4,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      5,Male,28,Sales Representative,5.9,4,30,8,Obese,140,90,85,3000,Sleep Apnea
      6,Male,28,Software Engineer,5.9,4,30,8,Obese,140,90,85,3000,Insomnia
      7,Male,29,Teacher,6.3,6,40,7,Obese,140,90,82,3500,Insomnia
      8,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      9,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      10,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      11,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      12,Male,29,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      13,Male,29,Doctor,6.1,6,30,8,Normal,120,80,70,8000,None
      14,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      15,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      16,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,None
      17,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Sleep Apnea
      18,Male,29,Doctor,6,6,30,8,Normal,120,80,70,8000,Sleep Apnea
      19,Female,29,Nurse,6.5,5,40,7,Normal Weight,132,87,80,4000,Insomnia
      20,Male,30,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      21,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      22,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      23,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      24,Male,30,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      25,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      26,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      27,Male,30,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      28,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      29,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      30,Male,30,Doctor,7.9,7,75,6,Normal,120,80,70,8000,None
      31,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Sleep Apnea
      32,Female,30,Nurse,6.4,5,35,7,Normal Weight,130,86,78,4100,Insomnia
      33,Female,31,Nurse,7.9,8,75,4,Normal Weight,117,76,69,6800,None
      34,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      35,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      36,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      37,Male,31,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      38,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      39,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      40,Male,31,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      41,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      42,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      43,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      44,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      45,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      46,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      47,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      48,Male,31,Doctor,7.8,7,75,6,Normal,120,80,70,8000,None
      49,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      50,Male,31,Doctor,7.7,7,75,6,Normal,120,80,70,8000,Sleep Apnea
      51,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      52,Male,32,Engineer,7.5,8,45,3,Normal,120,80,70,8000,None
      53,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      54,Male,32,Doctor,7.6,7,75,6,Normal,120,80,70,8000,None
      55,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      56,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      57,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      58,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      59,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      60,Male,32,Doctor,7.7,7,75,6,Normal,120,80,70,8000,None
      61,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      62,Male,32,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      63,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      64,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      65,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      66,Male,32,Doctor,6.2,6,30,8,Normal,125,80,72,5000,None
      67,Male,32,Accountant,7.2,8,50,6,Normal Weight,118,76,68,7000,None
      68,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,Insomnia
      69,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      70,Female,33,Scientist,6.2,6,50,6,Overweight,128,85,76,5500,None
      71,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      72,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      73,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      74,Male,33,Doctor,6.1,6,30,8,Normal,125,80,72,5000,None
      75,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      76,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      77,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      78,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      79,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      80,Male,33,Doctor,6,6,30,8,Normal,125,80,72,5000,None
      81,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      82,Female,34,Scientist,5.8,4,32,8,Overweight,131,86,81,5200,Sleep Apnea
      83,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      84,Male,35,Teacher,6.7,7,40,5,Overweight,128,84,70,5600,None
      85,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      86,Female,35,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      87,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      88,Male,35,Engineer,7.2,8,60,4,Normal,125,80,65,5000,None
      89,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      90,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      91,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      92,Male,35,Engineer,7.3,8,60,4,Normal,125,80,65,5000,None
      93,Male,35,Software Engineer,7.5,8,60,5,Normal Weight,120,80,70,8000,None
      94,Male,35,Lawyer,7.4,7,60,5,Obese,135,88,84,3300,Sleep Apnea
      95,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,Insomnia
      96,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      97,Female,36,Accountant,7.2,8,60,4,Normal,115,75,68,7000,None
      98,Female,36,Accountant,7.1,8,60,4,Normal,115,75,68,7000,None
      99,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      100,Female,36,Teacher,7.1,8,60,4,Normal,115,75,68,7000,None
      101,Female,36,Teacher,7.2,8,60,4,Normal,115,75,68,7000,None

聚类分析函数列表

聚类分析是一种数据挖掘技术,它可以将数据集中的对象自动地分为不同的组,使得同一组内的对象具有较高的相似性,而不同组之间的对象具有较低的相似性。

函数名称

语法

说明

返回值类型

clustering_centroids函数

clustering_centroids(array(array(double)) samples, integer num_of_clusters)

通过聚类分析对输入的对象属性样本数据进行分类,用户可以指定希望聚成的类组数量。在完成聚类处理后,函数将输出每个类组数据的平均值(类心)。

array(array(double))

to_cluster_label函数

to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

通过对象的属性向量及clustering_centroids函数计算的类中心,将对象分配至与其最接近的类组,并返回该类组的数字标识。

long

clustering函数

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

聚合函数,通过对提供的样本进行聚类分析,并返回一个二维字符串数组,其中包含各对象及其对应的分类编号。

array(array(varchar))

clustering_centroids函数

聚类是一种无监督学习方法,将数据集分成若干个互不重叠的类,使得同一个类内的数据点相似度较高,而不同类的数据点相似度较低。clustering_centroids函数从输入的样本二维数组和聚类个数中,返回类组数据的平均值(类心)。

array(array(double)) clustering_centroids(array(array(double)) samples, long num_of_clusters)

参数

说明

samples

一个二维数组,可以看成一个二维表格,每一行数据构成一个一维数据,表示一个对象的属性向量。

num_of_clusters

期望聚和成类的个数。

参数返回各个聚类的类心。

使用示例

  • 查询和分析语句

    * | select entity_group,
        clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
    from log
    group by entity_group
  • 返回结果

    返回由各个聚类的类心组成的二维数组,每个聚类的类心是描述一个属性向量的一维数组。

    entity_group

    cluster_centroids

    A

    [[-9.845597237715113,10.311194085779278],[-9.512707125089744,-9.890571794654335],[9.821771754807048,10.351180864562114],[10.66747750273295,-9.081622556974239],[9.361759158361564,-9.91604553609393]]

to_cluster_label函数

根据clustering_centroids函数获取类心后,使用to_cluster_label函数将对象归类到最接近的类中,to_cluster_label函数的返回值是对象的类组的数字编号。

long to_cluster_label(array(array(double)) centroids, array(double) attribute_vector)

参数

说明

centroids

函数clustering_centroids返回的类心(二维数组)中。

attribute_vector

被归类的数据集的属性向量(每个数据点的一组特征或属性)。

使用示例

  • 查询和分析语句

    * | with centroids as
    (
        select entity_group,
            clustering_centroids(array_agg(array[x0, x1]), 5) as cluster_centroids
        from log
        group by entity_group
    )
    select t1.entity_id,
        t1.entity_group,
        to_cluster_label(t2.cluster_centroids, array[t1.x0, t1.x1]) as cluster_label
    from log as t1
        join centroids as t2
        on t1.entity_group = t2.entity_group
    order by entity_id
    limit 100000
  • 返回结果

    cluster_label字段为类组的数字编号。

    entity_id

    entity_group

    cluster_label

    0

    A

    3

    1

    A

    3

    2

    A

    3

    ...

    ...

    ...

clustering函数

聚合函数,通过自动对提供的样本进行聚类分析,并返回一个二维字符串数组,其中包含各对象及其对应的分类编号。

array(array(varchar)) clustering(array(varchar) attribute_vector, array(varchar) attribute_name_vector, array(varchar) attribute_data_type_vector, integer num_of_clusters)

参数

说明

attribute_vector

将一个对象的不同属性转换为字符串后,拼装成一个一维数组。属性字段的原始类型可以是数字,也可以是离散文本。

attribute_name_vector

将属性字段名组合成一维数组,对象属性需统一转换为数字类型并进行归一化处理,以便在各阶段中使用。

attribute_data_type_vector

将属性字段类型整合为一维数组,其中对象属性统一转换为数字类型并进行归一化处理,这些步骤为后续阶段的分析提供必要信息。

属性字段类型:

  • ID_STR:对应字段是聚类对象的字符串类型的ID。

  • ID_NUM:对应字段是聚类对象的数字类型的ID。

  • X_STR_CATEGORICAL:对应字段是字符串类型的离散型数据,例如某一字段的值仅限于'Male'、'Female'或'Unknown'这几个固定选项。

  • X_NUM_CATEGORICAL:对应字段是离散型的数据,但是数据类型是数字离散型,例如某一字段的值仅限于1, 2, 3, 4, 5这几个固定选项。

  • X_NUMERIC:对应字段是连续的数字类型数据。

重要

attribute_vectorattribute_name_vectorattribute_data_type_vector三个数组的长度必须保持一致。例如有10个索引字段用于聚类分析,10个索引字段的值构成attribute_vector,10个索引字段的名字构成attribute_name_vector,10个索引字段对应的类型描述构成attribute_data_type_vector

num_of_clusters

可选参数。聚合的类数,如果不指定则会自动选择。

使用示例

  • 查询和分析语句

    clustering聚合函数进行分析,然后再用unnest函数展开结果。更多信息,请参见UNNEST子句

    * | with clustering_agg as
    (
      select group_id,
      clustering(
        array[cast(person_id as varchar),
              cast(gender as varchar),
              cast(age as varchar),
              cast(occupation as varchar),
              cast(sleep_duration as varchar),
              cast(quality_of_sleep as varchar),
              cast(physical_activity_level as varchar),
              cast(stress_level as varchar),
              cast(bmi_category as varchar),
              cast(blood_pressure_systolic as varchar),
              cast(blood_pressure_diastolic as varchar),
              cast(heart_rate as varchar),
              cast(daily_steps as varchar)], -- 对象属性内容数组,也包括对象ID字段在内,方便后面unnest函数展开以后再提取这个字段出来
        array['person_id', 'gender', 'age', 'occupation', 'sleep_duration',
              'quality_of_sleep', 'physical_activity_level', 'stress_level',
              'bmi_category', 'blood_pressure_systolic', 'blood_pressure_diastolic',
              'heart_rate', 'daily_steps'], -- 对象属性字段名称
        array['ID_NUM',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_STR_CATEGORICAL',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC',
              'X_NUMERIC'],  -- 对象属性字段的类型
        5  -- 要聚类成为多少个分组
      ) as clustering_outcome
      from (
        select 'G1' as group_id, -- 加一个group_id,为了展示这是一个聚合函数
            * from  log
      )as sleep_health_group_data
      group by group_id
    )
    
    select ca.group_id,
    	t. person_cluster[1] as person_id,
    	t. person_cluster[2] as cluster_id
    from clustering_agg as ca
    	cross join unnest(clustering_outcome) as t(person_cluster)
  • 返回结果

    group_id

    person_id

    cluster_id

    G1

    266

    1

    G1

    268

    1

    ...

    ...

    ...