标签传播算法LPA(Label Propagation Algorithm)是基于图的半监督学习方法,其基本思路是节点的标签(community)依赖其相邻节点的标签信息,影响程度由节点相似度决定,并通过传播迭代更新达到稳定。本文为您介绍PAI-Studio提供的标签传播聚类组件。

背景信息

图聚类是根据图的拓扑结构,进行子图的划分,使得子图内部节点的连接较多,子图之间的连接较少。

PAI-Studio支持通过可视化或PAI命令方式,配置标签传播聚类组件的参数。

可视化方式

页签 参数 描述
字段设置 顶点表:选择顶点列 点表的点所在列。
顶点表:选择权值列 点表的点的权重所在列。
边表:选择源顶点列 边表的起点所在列。
边表:选择目标顶点列 边表的终点所在列。
边表:选择权值列 边表边的权重所在列。
参数设置 最大迭代次数 可选,默认为30。
执行调优 进程数 作业并行执行的节点数。数字越大并行度越高,但是框架通讯开销会增大。
进程内存 单个作业可使用的最大内存量。系统默认为每个作业分配4096 MB内存,实际使用内存超过该值,会抛出OutOfMemory异常。

PAI命令方式

PAI -name LabelPropagationClustering
    -project algo_public
    -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DinputVertexTableName=LabelPropagationClustering_func_test_node
    -DvertexCol=node
    -DoutputTableName=LabelPropagationClustering_func_test_result
    -DhasEdgeWeight=true
    -DedgeWeightCol=edge_weight
    -DhasVertexWeight=true
    -DvertexWeightCol=node_weight
    -DrandSelect=true
    -DmaxIter=100;
参数 是否必选 描述 默认值
inputEdgeTableName 输入边表名。
inputEdgeTablePartitions 输入边表的分区。 全表读入
fromVertexCol 输入边表的起点所在列。
toVertexCol 输入边表的终点所在列。
inputVertexTableName 输入点表名称。
inputVertexTablePartitions 输入点表的分区。 全表读入
vertexCol 输入点表的点所在列。
outputTableName 输出表名。
outputTablePartitions 输出表的分区。
lifecycle 输出表的生命周期。
workerNum 作业并行执行的节点数。数字越大并行度越高,但是框架通讯开销会增大。 未设置
workerMem 单个作业可使用的最大内存量。系统默认为每个作业分配4096 MB内存,实际使用内存超过该值,会抛出OutOfMemory异常。 4096
splitSize 数据切分大小。 64
hasEdgeWeight 输入边表的边是否有权重。 false
edgeWeightCol 输入边表边的权重所在列。
hasVertexWeight 输入点表的点是否有权重。 false
vertexWeightCol 输入点表的点的权重所在列。
randSelect 是否随机选择最大标签。 false
maxIter 最大迭代次数。 30

使用示例

  1. 生成训练数据。
    drop table if exists LabelPropagationClustering_func_test_edge;
    create table LabelPropagationClustering_func_test_edge as
    select * from
    (
        select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight from dual
        union all
        select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
        union all
        select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
        union all
        select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight from dual
        union all
        select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight from dual
        union all
        select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
        union all
        select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight from dual
        union all
        select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
    )tmp
    ;
    drop table if exists LabelPropagationClustering_func_test_node;
    create table LabelPropagationClustering_func_test_node as
    select * from
    (
        select '1' as node,0.7 as node_weight from dual
        union all
        select '2' as node,0.7 as node_weight from dual
        union all
        select '3' as node,0.7 as node_weight from dual
        union all
        select '4' as node,0.5 as node_weight from dual
        union all
        select '5' as node,0.7 as node_weight from dual
        union all
        select '6' as node,0.5 as node_weight from dual
        union all
        select '7' as node,0.7 as node_weight from dual
        union all
        select '8' as node,0.7 as node_weight from dual
    )tmp;
    对应的图结构如下所示。标签传播聚类图结构
  2. 查看训练结果。
    +------+------------+
    | node | group_id   |
    +------+------------+
    | 1    | 1          |
    | 2    | 1          |
    | 3    | 1          |
    | 4    | 1          |
    | 5    | 5          |
    | 6    | 5          |
    | 7    | 5          |
    | 8    | 5          |
    +------+------------+