Vertex Clustering Coefficient-Platform For AI(PAI)-阿里云帮助中心

The Vertex Clustering Coefficient measures how tightly a node's neighbors are connected to each other — that is, the likelihood that any two of a node's neighbors are also directly connected. It is the ratio of actual edges among a node's neighbors to the maximum possible edges between them.

The coefficient ranges from 0 to 1:

0: None of the node's neighbors are connected to each other (e.g., the hub of a star network).
1: All of a node's neighbors are fully connected to each other (e.g., any node in a fully connected network).

How it works

For each node u in an undirected graph, the clustering coefficient is:

$$C_u = \frac{2T(u)}{deg(u)(deg(u)-1)}$$

Where:

T(u) — the number of triangles passing through node u (pairs of neighbors that are also directly connected to each other)
deg(u) — the degree of node u (number of direct neighbors)

Network topology examples

The algorithm produces boundary values on two canonical graph structures:

Network type	Clustering coefficient	Why
Star network	0 (for the central node)	Peripheral nodes connect only to the center, not to each other.
Fully connected network	1 (for all nodes)	Every possible connection between neighbors exists.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Vertex Clustering Coefficient component on the pipeline page and configure the following parameters:

Category	Parameter	Description
Fields Setting	Start Vertex	The start vertex column in the edge table.
Fields Setting	End Vertex	The end vertex column in the edge table.
Parameters Setting	Largest Vertex Degree	Nodes with a vertex degree larger than this value are sampled instead of fully computed. Default: 500.
Tuning	Workers	Number of workers for parallel execution. Higher values increase parallelism but also increase framework communication overhead.
	Memory Size per Worker (MB)	Maximum memory per worker. Default: 4096 MB. If memory usage exceeds this value, an `OutOfMemory` error is reported.
	Data Split Size (MB)	Size of each data split. Default: 64 MB.

Method 2: Configure the component by using PAI commands

Run the Vertex Clustering Coefficient algorithm using the NodeDensity PAI command via the SQL Script component. For details on running PAI commands in SQL Script, see Scenario 4: Execute PAI commands within the SQL script component.

PAI -name NodeDensity
    -project algo_public
    -DinputEdgeTableName=NodeDensity_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=NodeDensity_func_test_result
    -DmaxEdgeCnt=500;

Parameter	Required	Default	Description
`inputEdgeTableName`	Yes	—	Name of the input edge table.
`inputEdgeTablePartitions`	No	Full table	Partitions to read from the input edge table.
`fromVertexCol`	Yes	—	Start vertex column in the input edge table.
`toVertexCol`	Yes	—	End vertex column in the input edge table.
`outputTableName`	Yes	—	Name of the output table.
`outputTablePartitions`	No	—	Partitions in the output table.
`lifecycle`	No	—	Lifecycle of the output table.
`maxEdgeCnt`	No	500	Nodes with a vertex degree greater than this value are sampled.
`workerNum`	No	—	Number of workers for parallel execution.
`workerMem`	No	4096	Maximum memory per worker (MB). An `OutOfMemory` error is reported if exceeded.
`splitSize`	No	64	Data split size (MB).

Example

This example runs the Vertex Clustering Coefficient algorithm on a small graph with 7 nodes and 11 edges, then reads the results.

The graph has the following edges (each row is one edge):

Start vertex	End vertex
1	2
1	3
1	4
1	5
1	6
2	3
3	4
4	5
5	6
5	7
6	7

Node 1 connects to 5 neighbors (nodes 2–6), but only 4 of the 10 possible pairs among those neighbors are directly connected (2–3, 3–4, 4–5, 5–6). This gives node 1 a clustering coefficient of 0.4.

Step 1: Add a SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following SQL statements to create the input edge table and output table.

drop table if exists NodeDensity_func_test_edge;
create table NodeDensity_func_test_edge as
select * from
(
  select '1' as flow_out_id, '2' as flow_in_id
  union all
  select '1' as flow_out_id, '3' as flow_in_id
  union all
  select '1' as flow_out_id, '4' as flow_in_id
  union all
  select '1' as flow_out_id, '5' as flow_in_id
  union all
  select '1' as flow_out_id, '6' as flow_in_id
  union all
  select '2' as flow_out_id, '3' as flow_in_id
  union all
  select '3' as flow_out_id, '4' as flow_in_id
  union all
  select '4' as flow_out_id, '5' as flow_in_id
  union all
  select '5' as flow_out_id, '6' as flow_in_id
  union all
  select '5' as flow_out_id, '7' as flow_in_id
  union all
  select '6' as flow_out_id, '7' as flow_in_id
)tmp;
drop table if exists NodeDensity_func_test_result;
create table NodeDensity_func_test_result
(
  node string,
  node_cnt bigint,
  edge_cnt bigint,
  density double,
  log_density double
);

Step 2: Add a second SQL Script component. Deselect Use Script Mode and Whether the system adds a create table statement. Enter the following PAI commands and connect the two SQL Script components.

drop table if exists ${o1};
PAI -name NodeDensity
    -project algo_public
    -DinputEdgeTableName=NodeDensity_func_test_edge
    -DfromVertexCol=flow_out_id
    -DtoVertexCol=flow_in_id
    -DoutputTableName=${o1}
    -DmaxEdgeCnt=500;

Step 3: Click in the upper-left corner to run the pipeline.

Step 4: Right-click the SQL Script component created in Step 2 and choose View Data > SQL Script Output to view the results.

node	node_cnt	edge_cnt	density	log_density
1	5	4	0.4	1.45657
2	2	1	1.0	1.24696
3	3	2	0.66667	1.35204
4	3	2	0.66667	1.35204
5	4	3	0.5	1.41189
6	3	2	0.66667	1.35204
7	2	1	1.0	1.24696

Output columns:

Column	Type	Description
`node`	string	Node identifier.
`node_cnt`	bigint	Number of direct neighbors (vertex degree).
`edge_cnt`	bigint	Number of edges that actually exist among those neighbors.
`density`	double	Clustering coefficient: `edge_cnt / (node_cnt * (node_cnt - 1) / 2)`.
`log_density`	double	Log-transformed density value.

Nodes 2 and 7 both have a clustering coefficient of 1.0, meaning all their neighbors are directly connected to each other. Node 1 has the lowest coefficient (0.4), because only 4 of its 10 possible neighbor pairs are connected.