Vertex clustering coefficient

更新时间:
复制 MD 格式

The Vertex Clustering Coefficient measures how tightly a node's neighbors are connected to each other — that is, the likelihood that any two of a node's neighbors are also directly connected. It is the ratio of actual edges among a node's neighbors to the maximum possible edges between them.

The coefficient ranges from 0 to 1:

  • 0: None of the node's neighbors are connected to each other (e.g., the hub of a star network).

  • 1: All of a node's neighbors are fully connected to each other (e.g., any node in a fully connected network).

How it works

For each node u in an undirected graph, the clustering coefficient is:

$$C_u = \frac{2T(u)}{deg(u)(deg(u)-1)}$$

Where:

  • T(u) — the number of triangles passing through node u (pairs of neighbors that are also directly connected to each other)

  • deg(u) — the degree of node u (number of direct neighbors)

Network topology examples

The algorithm produces boundary values on two canonical graph structures:

Network type Clustering coefficient Why
Star network 0 (for the central node) Peripheral nodes connect only to the center, not to each other.
Fully connected network 1 (for all nodes) Every possible connection between neighbors exists.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Vertex Clustering Coefficient component on the pipeline page and configure the following parameters:

Category Parameter Description
Fields Setting Start Vertex The start vertex column in the edge table.
End Vertex The end vertex column in the edge table.
Parameters Setting Largest Vertex Degree Nodes with a vertex degree larger than this value are sampled instead of fully computed. Default: 500.
Tuning Workers Number of workers for parallel execution. Higher values increase parallelism but also increase framework communication overhead.
Memory Size per Worker (MB) Maximum memory per worker. Default: 4096 MB. If memory usage exceeds this value, an OutOfMemory error is reported.
Data Split Size (MB) Size of each data split. Default: 64 MB.

Method 2: Configure the component by using PAI commands

Run the Vertex Clustering Coefficient algorithm using the NodeDensity PAI command via the SQL Script component. For details on running PAI commands in SQL Script, see Scenario 4: Execute PAI commands within the SQL script component.

PAI -name NodeDensity
    -project algo_public
    -DinputEdgeTableName=NodeDensity_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=NodeDensity_func_test_result
    -DmaxEdgeCnt=500;
Parameter Required Default Description
inputEdgeTableName Yes Name of the input edge table.
inputEdgeTablePartitions No Full table Partitions to read from the input edge table.
fromVertexCol Yes Start vertex column in the input edge table.
toVertexCol Yes End vertex column in the input edge table.
outputTableName Yes Name of the output table.
outputTablePartitions No Partitions in the output table.
lifecycle No Lifecycle of the output table.
maxEdgeCnt No 500 Nodes with a vertex degree greater than this value are sampled.
workerNum No Number of workers for parallel execution.
workerMem No 4096 Maximum memory per worker (MB). An OutOfMemory error is reported if exceeded.
splitSize No 64 Data split size (MB).

Example

This example runs the Vertex Clustering Coefficient algorithm on a small graph with 7 nodes and 11 edges, then reads the results.

The graph has the following edges (each row is one edge):

Start vertex End vertex
1 2
1 3
1 4
1 5
1 6
2 3
3 4
4 5
5 6
5 7
6 7

Node 1 connects to 5 neighbors (nodes 2–6), but only 4 of the 10 possible pairs among those neighbors are directly connected (2–3, 3–4, 4–5, 5–6). This gives node 1 a clustering coefficient of 0.4.

image

Step 1: Add a SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statements to create the input edge table and output table.

drop table if exists NodeDensity_func_test_edge;
create table NodeDensity_func_test_edge as
select * from
(
  select '1' as flow_out_id, '2' as flow_in_id
  union all
  select '1' as flow_out_id, '3' as flow_in_id
  union all
  select '1' as flow_out_id, '4' as flow_in_id
  union all
  select '1' as flow_out_id, '5' as flow_in_id
  union all
  select '1' as flow_out_id, '6' as flow_in_id
  union all
  select '2' as flow_out_id, '3' as flow_in_id
  union all
  select '3' as flow_out_id, '4' as flow_in_id
  union all
  select '4' as flow_out_id, '5' as flow_in_id
  union all
  select '5' as flow_out_id, '6' as flow_in_id
  union all
  select '5' as flow_out_id, '7' as flow_in_id
  union all
  select '6' as flow_out_id, '7' as flow_in_id
)tmp;
drop table if exists NodeDensity_func_test_result;
create table NodeDensity_func_test_result
(
  node string,
  node_cnt bigint,
  edge_cnt bigint,
  density double,
  log_density double
);

Step 2: Add a second SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI commands and connect the two SQL Script components.

drop table if exists ${o1};
PAI -name NodeDensity
    -project algo_public
    -DinputEdgeTableName=NodeDensity_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=${o1}
    -DmaxEdgeCnt=500;

Step 3: Click image in the upper-left corner to run the pipeline.

Step 4: Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the results.

node node_cnt edge_cnt density log_density
1 5 4 0.4 1.45657
2 2 1 1.0 1.24696
3 3 2 0.66667 1.35204
4 3 2 0.66667 1.35204
5 4 3 0.5 1.41189
6 3 2 0.66667 1.35204
7 2 1 1.0 1.24696

Output columns:

Column Type Description
node string Node identifier.
node_cnt bigint Number of direct neighbors (vertex degree).
edge_cnt bigint Number of edges that actually exist among those neighbors.
density double Clustering coefficient: edge_cnt / (node_cnt * (node_cnt - 1) / 2).
log_density double Log-transformed density value.

Nodes 2 and 7 both have a clustering coefficient of 1.0, meaning all their neighbors are directly connected to each other. Node 1 has the lowest coefficient (0.4), because only 4 of its 10 possible neighbor pairs are connected.