全部产品
存储与CDN 数据库 安全 应用服务 数加·人工智能 数加·大数据基础服务 互联网中间件 视频服务 开发者工具 解决方案 物联网
机器学习PAI

网络分析

更新时间:2017-06-07 13:26:11


目录


网络分析栏提供的都是基于Graph数据结构的分析算法;下图是使用平台网络分析组件构建的一个分析流程实例:

ds

网络分析栏的算法组件都需要设置运行参数,参数说明如下:进程数:参数代号workerNum,用于设置作业并行执行的节点数;数字越大并行度越高,但框架通讯开销会增大。进程内存:参数代号workerMem,用于设置单个 worker可使用的最大内存量,默认每个worker分配4096内存;实际使用内存超过该值,会抛出OutOfMemory异常。

k-Core

功能介绍

  • 一个图的KCore是指反复去除度小于或等于k的节点后,所剩余的子图。若一个节点存在于KCore,而在(K+1)CORE中被移去,那么此节点的核数(coreness)为k。因此所有度为1的节点的核数必然为0,节点核数的最大值被称为图的核数。

参数设置

k:核数的值,必填,默认3

实例

测试数据

新建数据SQL

  1. drop table if exists KCore_func_test_edge;
  2. create table KCore_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'5' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'6' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. )tmp;

数据对应的graph结构如下图:graph

运行结果

设定k = 2:运行结果:结果如下:

  1. +-------+-------+
  2. | node1 | node2 |
  3. +-------+-------+
  4. | 1 | 2 |
  5. | 1 | 3 |
  6. | 1 | 4 |
  7. | 2 | 1 |
  8. | 2 | 3 |
  9. | 2 | 4 |
  10. | 3 | 1 |
  11. | 3 | 2 |
  12. | 3 | 4 |
  13. | 4 | 1 |
  14. | 4 | 2 |
  15. | 4 | 3 |
  16. +-------+-------+

pai命令示例

  1. pai -name KCore
  2. -project algo_public
  3. -DinputEdgeTableName=KCore_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=KCore_func_test_result
  7. -Dk=2;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 边表中起点所在列 必填 -
toVertexCol 边表中终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
k 核数 必填 3

单源最短路径

功能介绍

  • 单源最短路径参考Dijkstra算法,本算法中当给定起点,则输出该点和其他所有节点的最短路径。

参数设置

起始节点id:用于计算最短路径的起始节点,必填

实例

测试数据

新建数据的SQL语句:

  1. drop table if exists SSSP_func_test_edge;
  2. create table SSSP_func_test_edge as
  3. select
  4. flow_out_id,flow_in_id,edge_weight
  5. from
  6. (
  7. select "a" as flow_out_id,"b" as flow_in_id,1.0 as edge_weight from dual
  8. union all
  9. select "b" as flow_out_id,"c" as flow_in_id,2.0 as edge_weight from dual
  10. union all
  11. select "c" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  12. union all
  13. select "b" as flow_out_id,"e" as flow_in_id,2.0 as edge_weight from dual
  14. union all
  15. select "e" as flow_out_id,"d" as flow_in_id,1.0 as edge_weight from dual
  16. union all
  17. select "c" as flow_out_id,"e" as flow_in_id,1.0 as edge_weight from dual
  18. union all
  19. select "f" as flow_out_id,"g" as flow_in_id,3.0 as edge_weight from dual
  20. union all
  21. select "a" as flow_out_id,"d" as flow_in_id,4.0 as edge_weight from dual
  22. ) tmp
  23. ;

数据对应的graph结构:images

运行结果
  1. 结果如下:
  2. +------------+------------+------------+--------------+
  3. | start_node | dest_node | distance | distance_cnt |
  4. +------------+------------+------------+--------------+
  5. | a | b | 1.0 | 1 |
  6. | a | c | 3.0 | 1 |
  7. | a | d | 4.0 | 3 |
  8. | a | a | 0.0 | 0 |
  9. | a | e | 3.0 | 1 |
  10. +------------+------------+------------+--------------+

pai命令示例

  1. pai -name SSSP
  2. -project algo_public
  3. -DinputEdgeTableName=SSSP_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=SSSP_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=edge_weight
  9. -DstartVertex=a;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
startVertex 起始节点ID 必填 -
hasEdgeWeight 输入边表的边是否有权重 选填 false
edgeWeightCol 输入边表边的权重所在列 选填 -

PageRank

功能介绍

  • PageRank起于网页的搜索排序,google利用网页的链接结构计算每个网页的等级排名,其基本思路是:如果一个网页被其他多个网页指向,这说明该网页比较重要或者质量较高。除考虑网页的链接数量,还考虑网页本身的权重级别,以及该网页有多少条出链到其它网页。 对于用户构成的人际网络,除了用户本身的影响力之外,边的权重也是重要因素之一。例如:新浪微博的某个用户,会更容易影响粉丝中关系比较亲密的家人、同学、同事等,而对陌生的弱关系粉丝影响较小。在人际网络中,边的权重等价为用户-用户的关系强弱指数。带连接权重的PageRank公式为:gongshi其中,w(i)为节点i的权重,c(A,i)为链接权重,d为阻尼系数,算法迭代稳定后的节点权重W即为每个用户的影响力指数。

参数设置

最大迭代次数:算法自身会收敛并停止迭代,选填,默认30

实例

测试数据

新建数据的SQL语句:

  1. drop table if exists PageRankWithWeight_func_test_edge;
  2. create table PageRankWithWeight_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id,'b' as flow_in_id,1.0 as weight from dual
  6. union all
  7. select 'a' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  8. union all
  9. select 'b' as flow_out_id,'c' as flow_in_id,1.0 as weight from dual
  10. union all
  11. select 'b' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  12. union all
  13. select 'c' as flow_out_id,'d' as flow_in_id,1.0 as weight from dual
  14. )tmp
  15. ;

对应的graph结构:pagerank

运行结果
  1. 结果如下:
  2. +------+------------+
  3. | node | weight |
  4. +------+------------+
  5. | a | 0.0375 |
  6. | b | 0.06938 |
  7. | c | 0.12834 |
  8. | d | 0.20556 |
  9. +------+------------+

pai命令示例

  1. pai -name PageRankWithWeight
  2. -project algo_public
  3. -DinputEdgeTableName=PageRankWithWeight_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=PageRankWithWeight_func_test_result
  7. -DhasEdgeWeight=true
  8. -DedgeWeightCol=weight
  9. -DmaxIter 100;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
hasEdgeWeight 输入边表的边是否有权重 选填 false
edgeWeightCol 输入边表边的权重所在列 选填 -
maxIter 最大迭代次数 选填 30

标签传播聚类

功能介绍

图聚类是根据图的拓扑结构,进行子图的划分,使得子图内部节点的链接较多,子图之间的连接较少。标签传播算法(Label Propagation Algorithm, LPA)是基于图的半监督学习方法,其基本思路是节点的标签(community)依赖其邻居节点的标签信息,影响程度由节点相似度决定,并通过传播迭代更新达到稳定。

参数介绍

最大迭代次数:选填,默认30

实例

测试数据

数据生成SQL:

  1. drop table if exists LabelPropagationClustering_func_test_edge;
  2. create table LabelPropagationClustering_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id,0.7 as edge_weight from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  10. union all
  11. select '2' as flow_out_id,'3' as flow_in_id,0.7 as edge_weight from dual
  12. union all
  13. select '2' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  14. union all
  15. select '3' as flow_out_id,'4' as flow_in_id,0.6 as edge_weight from dual
  16. union all
  17. select '4' as flow_out_id,'6' as flow_in_id,0.3 as edge_weight from dual
  18. union all
  19. select '5' as flow_out_id,'6' as flow_in_id,0.6 as edge_weight from dual
  20. union all
  21. select '5' as flow_out_id,'7' as flow_in_id,0.7 as edge_weight from dual
  22. union all
  23. select '5' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id,0.6 as edge_weight from dual
  26. union all
  27. select '6' as flow_out_id,'8' as flow_in_id,0.6 as edge_weight from dual
  28. union all
  29. select '7' as flow_out_id,'8' as flow_in_id,0.7 as edge_weight from dual
  30. )tmp
  31. ;
  32. drop table if exists LabelPropagationClustering_func_test_node;
  33. create table LabelPropagationClustering_func_test_node as
  34. select * from
  35. (
  36. select '1' as node,0.7 as node_weight from dual
  37. union all
  38. select '2' as node,0.7 as node_weight from dual
  39. union all
  40. select '3' as node,0.7 as node_weight from dual
  41. union all
  42. select '4' as node,0.5 as node_weight from dual
  43. union all
  44. select '5' as node,0.7 as node_weight from dual
  45. union all
  46. select '6' as node,0.5 as node_weight from dual
  47. union all
  48. select '7' as node,0.7 as node_weight from dual
  49. union all
  50. select '8' as node,0.7 as node_weight from dual
  51. )tmp
  52. ;

数据对应的group结构:ddd

运行结果

结果如下:

  1. +------+------------+
  2. | node | group_id |
  3. +------+------------+
  4. | 1 | 1 |
  5. | 2 | 1 |
  6. | 3 | 1 |
  7. | 4 | 1 |
  8. | 5 | 5 |
  9. | 6 | 5 |
  10. | 7 | 5 |
  11. | 8 | 5 |
  12. +------+------------+

pai命令示例

  1. pai -name LabelPropagationClustering
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClustering_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClustering_func_test_node
  7. -DvertexCol=node
  8. -DoutputTableName=LabelPropagationClustering_func_test_result
  9. -DhasEdgeWeight=true
  10. -DedgeWeightCol=edge_weight
  11. -DhasVertexWeight=true
  12. -DvertexWeightCol=node_weight
  13. -DrandSelect=true
  14. -DmaxIter=100;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
inputVertexTableName 输入点表名称 必填 -
inputVertexTablePartitions 输入点表的分区 选填 全表读入
vertexCol 输入点表的点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
hasEdgeWeight 输入边表的边是否有权重 选填 false
edgeWeightCol 输入边表边的权重所在列 选填 -
hasVertexWeight 输入点表的点是否有权重 选填 false
vertexWeightCol 输入点表的点的权重所在列 选填 -
randSelect 是否随机选择最大标签 选填 false
maxIter 最大迭代次数 选填 30

标签传播分类

功能介绍

该算法为半监督的分类算法,原理为用已标记节点的标签信息去预测未标记节点的标签信息。

在算法执行过程中,每个节点的标签按相似度传播给相邻节点,在节点传播的每一步,每个节点根据相邻节点的标签来更新自己的标签,与该节点相似度越大,其相邻节点对其标注的影响权值越大,相似节点的标签越趋于一致,其标签就越容易传播。在标签传播过程中,保持已标注数据的标签不变,使其像一个源头把标签传向未标注数据。

最终,当迭代过程结束时,相似节点的概率分布也趋于相似,可以划分到同一个类别中,从而完成标签传播过程

参数设置

阻尼系数:默认0.8收敛系数:默认0.000001

实例

测试数据

生成数据的SQL:

  1. drop table if exists LabelPropagationClassification_func_test_edge;
  2. create table LabelPropagationClassification_func_test_edge as
  3. select * from
  4. (
  5. select 'a' as flow_out_id, 'b' as flow_in_id, 0.2 as edge_weight from dual
  6. union all
  7. select 'a' as flow_out_id, 'c' as flow_in_id, 0.8 as edge_weight from dual
  8. union all
  9. select 'b' as flow_out_id, 'c' as flow_in_id, 1.0 as edge_weight from dual
  10. union all
  11. select 'd' as flow_out_id, 'b' as flow_in_id, 1.0 as edge_weight from dual
  12. )tmp
  13. ;
  14. drop table if exists LabelPropagationClassification_func_test_node;
  15. create table LabelPropagationClassification_func_test_node as
  16. select * from
  17. (
  18. select 'a' as node,'X' as label, 1.0 as label_weight from dual
  19. union all
  20. select 'd' as node,'Y' as label, 1.0 as label_weight from dual
  21. )tmp
  22. ;

对应的图结构:ddd

运行结果
  1. 结果如下:
  2. +------+-----+------------+
  3. | node | tag | weight |
  4. +------+-----+------------+
  5. | a | X | 1.0 |
  6. | b | X | 0.16667 |
  7. | b | Y | 0.83333 |
  8. | c | X | 0.53704 |
  9. | c | Y | 0.46296 |
  10. | d | Y | 1.0 |
  11. +------+-----+------------+

pai命令示例

  1. pai -name LabelPropagationClassification
  2. -project algo_public
  3. -DinputEdgeTableName=LabelPropagationClassification_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DinputVertexTableName=LabelPropagationClassification_func_test_node
  7. -DvertexCol=node
  8. -DvertexLabelCol=label
  9. -DoutputTableName=LabelPropagationClassification_func_test_result
  10. -DhasEdgeWeight=true
  11. -DedgeWeightCol=edge_weight
  12. -DhasVertexWeight=true
  13. -DvertexWeightCol=label_weight
  14. -Dalpha=0.8
  15. -Depsilon=0.000001;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
inputVertexTableName 输入点表名称 必填 -
inputVertexTablePartitions 输入点表的分区 选填 全表读入
vertexCol 输入点表的点所在列 必填 -
vertexLabelCol 输入点表的点的标签 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
hasEdgeWeight 输入边表的边是否有权重 选填 false
edgeWeightCol 输入边表边的权重所在列 选填 -
hasVertexWeight 输入点表的点是否有权重 选填 false
vertexWeightCol 输入点表的点的权重所在列 选填 -
alpha 阻尼系数 选填 0.8
epsilon 收敛系数 选填 0.000001
maxIter 最大迭代次数 选填 30

Modularity

功能介绍

  • Modularity是一种评估社区网络结构的指标,来评估网络结构中划分出来社区的紧密程度,往往0.3以上是比较明显的社区结构。

实例

测试数据

略(与标签传播聚类算法的数据相同)

运行结果
  1. 结果如下:
  2. +--------------+
  3. | val |
  4. +--------------+
  5. | 0.4230769 |
  6. +--------------+

pai命令示例

  1. pai -name Modularity
  2. -project algo_public
  3. -DinputEdgeTableName=Modularity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DfromGroupCol=group_out_id
  6. -DtoVertexCol=flow_in_id
  7. -DtoGroupCol=group_in_id
  8. -DoutputTableName=Modularity_func_test_result;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
fromGroupCol 输入边表起点的群组 必填 -
toVertexCol 输入边表的终点所在列 必填 -
toGroupCol 输入边表终点的群组 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64

最大联通子图

功能介绍

在无向图G中,若从顶点A到顶点B有路径相连,则称A和B是连通的;在图G种存在若干子图,其中每个子图中所有顶点之间都是连通的,但在不同子图间不存在顶点连通,那么称图G的这些子图为最大连通子图。

参数设置

实例

测试数据

生成数据的SQL:

  1. drop table if exists MaximalConnectedComponent_func_test_edge;
  2. create table MaximalConnectedComponent_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '2' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '3' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'4' as flow_in_id from dual
  12. union all
  13. select 'a' as flow_out_id,'b' as flow_in_id from dual
  14. union all
  15. select 'b' as flow_out_id,'c' as flow_in_id from dual
  16. )tmp;
  17. drop table if exists MaximalConnectedComponent_func_test_result;
  18. create table MaximalConnectedComponent_func_test_result
  19. (
  20. node string,
  21. grp_id string
  22. );

对应的图结构:Snip20160228_11

运行结果
  1. 结果如下:
  2. +-------+-------+
  3. | node | grp_id|
  4. +-------+-------+
  5. | 1 | 4 |
  6. | 2 | 4 |
  7. | 3 | 4 |
  8. | 4 | 4 |
  9. | a | c |
  10. | b | c |
  11. | c | c |
  12. +-------+-------+

pai命令示例

  1. pai -name MaximalConnectedComponent
  2. -project algo_public
  3. -DinputEdgeTableName=MaximalConnectedComponent_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=MaximalConnectedComponent_func_test_result;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64

点聚类系数

功能介绍

在无向图G中,计算每一个节点周围的稠密度,星状网络稠密度为0,全联通网络稠密度为1。

参数设置

maxEdgeCnt:若节点度大于该值,则进行抽样,默认500,选填。

实例

测试数据

生成数据的SQL:

  1. drop table if exists NodeDensity_func_test_edge;
  2. create table NodeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id, '2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id, '3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id, '6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id, '4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id, '5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id, '6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id, '7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id, '7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists NodeDensity_func_test_result;
  28. create table NodeDensity_func_test_result
  29. (
  30. node string,
  31. node_cnt bigint,
  32. edge_cnt bigint,
  33. density double,
  34. log_density double
  35. );

对应的图结构:Snip20160228_12

运行结果
  1. 结果如下:
  2. 1,5,4,0.4,1.45657
  3. 2,2,1,1.0,1.24696
  4. 3,3,2,0.66667,1.35204
  5. 4,3,2,0.66667,1.35204
  6. 5,4,3,0.5,1.41189
  7. 6,3,2,0.66667,1.35204
  8. 7,2,1,1.0,1.24696

pai命令示例

  1. pai -name NodeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=NodeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=NodeDensity_func_test_result
  7. -DmaxEdgeCnt=500;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
maxEdgeCnt 若节点度大于该值,则进行抽样。 选填 500
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64

边聚类系数

功能介绍

在无向图G中,计算每一条边周围的稠密度。

参数设置

实例

测试数据

生成数据的SQL:

  1. drop table if exists EdgeDensity_func_test_edge;
  2. create table EdgeDensity_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'5' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'7' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id,'5' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'4' as flow_in_id from dual
  16. union all
  17. select '2' as flow_out_id,'3' as flow_in_id from dual
  18. union all
  19. select '3' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '3' as flow_out_id,'4' as flow_in_id from dual
  22. union all
  23. select '4' as flow_out_id,'5' as flow_in_id from dual
  24. union all
  25. select '4' as flow_out_id,'8' as flow_in_id from dual
  26. union all
  27. select '5' as flow_out_id,'6' as flow_in_id from dual
  28. union all
  29. select '5' as flow_out_id,'7' as flow_in_id from dual
  30. union all
  31. select '5' as flow_out_id,'8' as flow_in_id from dual
  32. union all
  33. select '7' as flow_out_id,'6' as flow_in_id from dual
  34. union all
  35. select '6' as flow_out_id,'8' as flow_in_id from dual
  36. )tmp;
  37. drop table if exists EdgeDensity_func_test_result;
  38. create table EdgeDensity_func_test_result
  39. (
  40. node1 string,
  41. node2 string,
  42. node1_edge_cnt bigint,
  43. node2_edge_cnt bigint,
  44. triangle_cnt bigint,
  45. density double
  46. );

对应的图结构:Snip20160228_13

运行结果
  1. 结果如下:
  2. 1,2,4,4,2,0.5
  3. 2,3,4,4,3,0.75
  4. 2,5,4,7,3,0.75
  5. 3,1,4,4,2,0.5
  6. 3,4,4,4,2,0.5
  7. 4,2,4,4,2,0.5
  8. 4,5,4,7,3,0.75
  9. 5,1,7,4,3,0.75
  10. 5,3,7,4,3,0.75
  11. 5,6,7,3,2,0.66667
  12. 5,8,7,3,2,0.66667
  13. 6,7,3,3,1,0.33333
  14. 7,1,3,4,1,0.33333
  15. 7,5,3,7,2,0.66667
  16. 8,4,3,4,1,0.33333
  17. 8,6,3,3,1,0.33333

pai命令示例

  1. pai -name EdgeDensity
  2. -project algo_public
  3. -DinputEdgeTableName=EdgeDensity_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=EdgeDensity_func_test_result;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64

计数三角形

功能介绍

在无向图G中,输出所有三角形。

参数设置

maxEdgeCnt:若节点度大于该值,则进行抽样,默认500,选填。

实例

测试数据

生成数据的SQL:

  1. drop table if exists TriangleCount_func_test_edge;
  2. create table TriangleCount_func_test_edge as
  3. select * from
  4. (
  5. select '1' as flow_out_id,'2' as flow_in_id from dual
  6. union all
  7. select '1' as flow_out_id,'3' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id,'4' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id,'5' as flow_in_id from dual
  12. union all
  13. select '1' as flow_out_id,'6' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id,'3' as flow_in_id from dual
  16. union all
  17. select '3' as flow_out_id,'4' as flow_in_id from dual
  18. union all
  19. select '4' as flow_out_id,'5' as flow_in_id from dual
  20. union all
  21. select '5' as flow_out_id,'6' as flow_in_id from dual
  22. union all
  23. select '5' as flow_out_id,'7' as flow_in_id from dual
  24. union all
  25. select '6' as flow_out_id,'7' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TriangleCount_func_test_result;
  28. create table TriangleCount_func_test_result
  29. (
  30. node1 string,
  31. node2 string,
  32. node3 string
  33. );

对应的图结构:Snip20160228_12

运行结果
  1. 结果如下:
  2. 1,2,3
  3. 1,3,4
  4. 1,4,5
  5. 1,5,6
  6. 5,6,7

pai命令示例

  1. pai -name TriangleCount
  2. -project algo_public
  3. -DinputEdgeTableName=TriangleCount_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TriangleCount_func_test_result;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
maxEdgeCnt 若节点度大于该值,则进行抽样。 选填 500
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64

树深度

功能介绍

对于众多树状网络,输出每个节点的所处深度和树ID。

参数设置

实例

测试数据

生成数据的SQL:

  1. drop table if exists TreeDepth_func_test_edge;
  2. create table TreeDepth_func_test_edge as
  3. select * from
  4. (
  5. select '0' as flow_out_id, '1' as flow_in_id from dual
  6. union all
  7. select '0' as flow_out_id, '2' as flow_in_id from dual
  8. union all
  9. select '1' as flow_out_id, '3' as flow_in_id from dual
  10. union all
  11. select '1' as flow_out_id, '4' as flow_in_id from dual
  12. union all
  13. select '2' as flow_out_id, '4' as flow_in_id from dual
  14. union all
  15. select '2' as flow_out_id, '5' as flow_in_id from dual
  16. union all
  17. select '4' as flow_out_id, '6' as flow_in_id from dual
  18. union all
  19. select 'a' as flow_out_id, 'b' as flow_in_id from dual
  20. union all
  21. select 'a' as flow_out_id, 'c' as flow_in_id from dual
  22. union all
  23. select 'c' as flow_out_id, 'd' as flow_in_id from dual
  24. union all
  25. select 'c' as flow_out_id, 'e' as flow_in_id from dual
  26. )tmp;
  27. drop table if exists TreeDepth_func_test_result;
  28. create table TreeDepth_func_test_result
  29. (
  30. node string,
  31. root string,
  32. depth bigint
  33. );

对应的图结构:image

运行结果
  1. 结果如下:
  2. 0,0,0
  3. 1,0,1
  4. 2,0,1
  5. 3,0,2
  6. 4,0,2
  7. 5,0,2
  8. 6,0,3
  9. a,a,0
  10. b,a,1
  11. c,a,1
  12. d,a,2
  13. e,a,2

pai命令示例

  1. pai -name TreeDepth
  2. -project algo_public
  3. -DinputEdgeTableName=TreeDepth_func_test_edge
  4. -DfromVertexCol=flow_out_id
  5. -DtoVertexCol=flow_in_id
  6. -DoutputTableName=TreeDepth_func_test_result;

算法参数

参数key名称 参数描述 必/选填 默认值
inputEdgeTableName 输入边表名 必填 -
inputEdgeTablePartitions 输入边表的分区 选填 全表读入
fromVertexCol 输入边表的起点所在列 必填 -
toVertexCol 输入边表的终点所在列 必填 -
outputTableName 输出表名 必填 -
outputTablePartitions 输出表的分区 选填 -
lifecycle 输出表申明周期 选填 -
workerNum 进程数量 选填 未设置
workerMem 进程内存 选填 4096
splitSize 数据切分大小 选填 64
本文导读目录