Connected Components of a Graph
Implementation in Adaptor High Performance Fortran
Aleksandar Donev, working under Dr. Phil Duxbury
Algorithm Description
Problem Formulation
A graph G=(N, E) can be thought of as a collection of points (hereafter nodes) N={i, i=1..n}, where n=N is the number of nodes, connected via directed or undirected links (arcs), E={e_{(i, j)} for some i and j in N}, where m=E is the number of arcs in the graph and e_{(i, j)} denotes an arc whose head is i and tail is j. We can this represent a graph efficiently by specifying n and m and giving an array HT(m,2)
giving the heads and tails of each arc:
HT(e_{(i, j)}, :)=[i, j]
For example, assume that we have randomly generated the following graph:
Distribution of the graph
We will use coloring to denote ownership of the data on certain processors. The example from figure 1 was executed on 3 processors and was a 2D grid of size 7x7 with 70% of the arcs present. Processors P1P3 will have red, green and blue color correspondingly, and the color of each node/arc will denote the processor that is its owner. Once again, here is the graph that I will use as an example throughout this section:
[hpf@gauss hpf2DNet]$ mpirun np 3 Test2DDebug.x $SLIB
Enter the length L and the dilution:
7,.7
Enter the number of multiplications and connected components labelings:
0,1
# Here are node_lbound and node_ubound for different processors:
The distribution of nodes among processors is:
Procs: 1 2 3
From : 1 18 35
To : 17 34 49
Initial connectivity parenthood:
Fig 5. Initial labeling of the graph
The connected components labeling algorithm consists of assigning each node i a label c(i) such that two nodes have the same label if and only if there is a path in the graph connecting the two nodes. Our purpose at the end will be to select only for those arcs and nodes that belong to a selected (usually the one containing the source or sink in network optimization problems) connected component. Mathematically, we want to extract a single (large) connected component C from the diluted graph:
G < ({e_{(i, j)}, c(i)=c(j)=C}, {i, c(i)=C})
For our specific example, the program will produce the output:
Fig 2. Final extracted largest connected component
Parallel PRAM Algorithms for ConnectedComponents Labeling
The algorithms build the labeling of the graph by building a spanning forest of the graph. Namely, each connected component will be represented via a tree in the forest, with one node chosen as the root of the tree. To convert this spanning forest into a proper labeling, we simply collapse all the trees into starstrees of height 1. See figure 3 for an illustration.
Fig 3. A tree collapses into a starTreeCollapse()
We represent the trees in the forest in the usuall way, via parenthood relations. So, let us make the array p (n)
, which stores the parent of each node, and p (node) = node
for root nodes node
. Thus tree collapse can be represented via:
Procedure TreeCollapse (N)
For all nodes i in N do in parallel
p (i) < FindRoot (i)
End do
End procedure TreeCollapse
where we find the root of the tree a node belongs to by traversing the tree upward until a root node is found:
Procedure child = FindRoot (i)
child = i
# Concurrent read:
Do while ( p (child) != child ) child < p (child)
End procedure FindRoot
Finding the root FindRoot (node)
has a cost O(H) , where H is the depth of node
in the tree. Therefore, it is beneficial to have trees of small height (i.e. almost stars).
Now, let us assume that two trees need to be merged together into a bigger tree because there is an arc e_{(i, j)} connecting one of the nodes i from the first tree to another node j in the second tree. This can be done easily by hooking one of the roots to any node in the other tree. Again, to save on cost and minimize the height of the parenthood trees, it is best to hook one of the roots to the other root. See figure 4 for an illustration.
Fig 4. Hooking two trees togetherMergeTrees(e_{(4, 9)})
There are various ways to minimize the height of the trees formed by tree merging. The simplest one to understand (though not most memory efficient) is to always hook a shorter tree to a taller tree (this gurantees that no tree will have height larger than ln_{2}(n)). For this purpose we make another array, h (n)
, and store in it the height of the tree underneath a given node (so leaf nodes in the tree have minimal height 0, while roots have the maximal height label). When a shorter tree merges with a bigger tree the heights of the two trees do not change. When two trees of equal height merge, the taller tree gets higher by 1 unit.
So we have the following tree merging procedure:
Procedure MergeTrees (e_{(i, j)})
r{i} = FindRoot (i)
r{j} = FindRoot (j)
# Find the taller tree and make it a parent of the shorter child:
If ( h (r{i}) > h (r{j}) ) then
parent < r{i} ; child < r{j}
Else if ( h (r{i}) < h (r{j}) ) then
parent < r{j} ; child < r{i}
Else
# In case of a tie, we can chose the smaller label to be the parent:
parent < min (i, j) ; child < max (i, j)
# Concurrent read and write:
h (parent) < h (parent) + 1
End if
# Now merge the trees (concurrent write):
p (child) < parent
End procedure MergeTrees
To shorten the trees in the spanning forest one can periodically perform pointer jumping, which simply means that we rehook a node higher up in its parenthood tree:
Procedure PointerJumping (N)
For all nodes i in N do in parallel
Concurrent read and write:
p (i) < p (p (i))
End do
End procedure PointerJumping
Edge condensation is a process of passing arcs higher up the parenthood tree, useful when some of the nodes become inactive and their arcs need to be passed to other nodes:
Procedure EdgeCondensation (E)
For all arcs e_{(i, j)} in E do in parallel
Concurrent read:
e_{(i, j)} < e_{(p (i), p (j))}
End do
End procedure EdgeCondensation
Now we can summarize an efficient parallel algorithm for computing the connected components of a graph in parallel that accepts a graph G=(N, E) as input:
Procedure BuildConnectedComponents (G)
# Initially let each node be its own parent and mark all arcs as alive:
For all nodes i in N do in parallel
p (i) < i
Mark all arcs as alive
Mark all nodes as nonleaf
Repeat until there are no alive arcs
For all alive arcs e_{(i, j)} in E do in parallel
If certain conditions are true
Merge the trees of nodes i and j
Kill arc e_{(i,j)}
End if
End for
Optionally
Mark leaf nodes and condense their arcs
Perform pointer jumping on nonleaf nodes
and / or
Condense the alive arcs
and / or
Remove duplicate alive arcs
End optionally
End repeat
TreeCollapse () # Not neccessary if invoked above
End procedure BuildConnectedComponents
The main step in any algorithm is MergeTrees
. If executed for all arcs, it will guarantee a correct connected component labeling so long as trees are merged acyclically (acyclicity of the parenthood relations is required in any tree). This can be guaranteed by using some acyclic property when choosing which tree merges with which, such as comparing the heights of the two trees or comparing the values of the labels of the roots of the two trees.
There are two main approaches to the above algorithm.

In the first approach, if there are no additional certain conditions imposed on the merging, then all arcs will be used and killed in the first iteration. However, this requires traversing the trees to find their roots, which is a logarithmic operation O (ln (n)). The total running time of such an approach is O (m ln_{2} (n)).
 List Item Two
In the second approach, trees are not always merged. The most popular choice is to only merge stars to other trees, or merging only stars and performing full tree collapse on each iteration so that all trees are either isolated vertices or stars. In this approach, each iteration in to the
Repeat
loop is of cost O (m), but then there are likely to be O (ln (n)) iterations until all arcs are dead. So the cost is still a logarithmic factor worse than the best serial algorithms for sparse graphsO (n).
We will employ the first approach because of its simplicity:
Procedure BuildConnectedComponentsPRAM (G)
# Initially let each node be its own parent:
For all nodes i in N do in parallel
p (i) < i
For all arcs e_{(i, j)} in E do in parallel
MergeTrees (e_{(i, j)})
Optionally perform some kind of pointer jumping on i and j
End for
TreeCollapse ()
End procedure BuildConnectedComponentsPRAM
In the initial implementation we will not use pointer jumping. This is because the height property of trees can not be maintained under such operations and a O (m ln_{2} (n)) bound can not be guaranteed for some pathological graphs.
A Simple Distributed Relaxation ConnectedComponents Algorithm
The above algorithm were well suited for PRAM CRCW (Parallel Random Access Memory with Concurrent Read and Concurrent Write capability) models. This model can not be directly implemented even on most of todays SMP's, because of the concurrent read & write capability.
HPF is not a suitable language for implementing such runtime communication and/or job distribution irregularities. So we need to look at another distributedmemory connectedcomponent labeling algorithm.
The simplest kind of algorithm is a relaxation algorithm known as local diffusion [4]. Instead of the previous parenthood label p (node)
we will use the label c (node)
just to indicate that there are no parenthood trees being created but a labeling of the nodes is built right away. The algorithm can be expressed a few lines:
Procedure BuildConnectedComponentsLocalDiffusion (G)
# Initially let each node be its own parent:
For all nodes i in N do in parallel
p (i) < i
# Local diffusion relaxation:
Repeat until convergence to a stationary point
For all nodes i in N do in parallel
c (i) < min (c (i), { c (j) for all e _{(i, j)} in E} )
End for
End repeat
End procedure BuildConnectedComponentsLocalDiffusion
In words, set the label of each node to the minimum of its own label and all of its neighbouring nodes, and iterate until convergence to a stationary labeling of the nodes.
We can easily implement the above algorithm using the heads & tails arrays and avoiding the use of a temporary by speeding up the relaxation:
Procedure BuildConnectedComponentsRelax (G)
For all nodes i in N do in parallel
p (i) < i
Repeat until convergence to a stationary point
# Relaxation step:
For all arcs e_{(i, j)} in E do in parallel
c (i), c (j) < min ( c (i), c (j) )
End for
End repeat
End procedure BuildConnectedComponentsRelax
The relaxation step has to be repeated as many times as the (chemical) length of the biggest percolation cluster in the network, which can be of the order of the size of the system [4]. So this algorithm is not very efficient. Figure 5 shows the absolute magnitude of the label mismatch M across all arcs during the course of the relaxation for two points, one above, one at, and one below the percolation point:
# Total mismatch over all arcs:
M < Sum of c (i)  c (j) over all arcs e_{(i, j)}
Fig 4. Convergence behaviour of the local diffusion relaxation algorithm
The grid graph was 2D of size [ 1000 x 1000 ]
A Mixed Approach
The optimal distributedmemory connectedcomponents algorithm on most machines will consist of two phases:
 Local Phasein which each processor performs a local connectedcomponents algorithm on the arcs and nodes that it owns. Note that this phase needs to ignore remote arcs. Different algorithms can be used for this local phase. From an algorithmic point of view, it is best to use one of the O (n) algorithms, like breath (depth)first search. We decided to use the above parallel PRAM arcbased algorithms for the local phase, though:
# Some statistics about the trees built in the local phase:
The maximum tree height is: 2
The average tree height is: 0.20
The number of treeclimbing steps is: 54
# Here are the parent nodes (with .TRUE. mask at the end):
Parent nodes and their ownership:
Nodes: 1 8 14 21 22 28 35 36 37
Procs: 1 1 1 2 2 2 3 3 3
After the local phase the connectivity is:
Fig 6. Labeling of the graph after the local phase
 Global Phasein which only remote arcs are involved and a global connectedcomponents algorithm is invoked on a reduced graph consisting of only remotearcs.
Before entering the global connectedcomponents algorithm, we need to collapse all the remote arcs, that is, make their head & tail pointers point to the parents of their head & tail. This will dramatically reduce the number of nodes involved in the global phase:
# After packing, the ends_remote array and its ownership is:
The remote arcs belong to processor:
Procs: 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Before the collapse the remote arcs have:
Heads: 19 21 22 23 18 24 35 36 37 32 38 33 39 40 41
Tails: 12 14 15 16 17 17 28 29 30 31 31 32 32 33 34
# Now, we collapse the remote edges to get a reduced graph with many duplicate edges:
The collapsed remote arcs have:
Heads: 22 21 22 22 22 22 35 36 37 22 37 22 37 37 37
Tails: 8 14 8 8 8 8 28 22 22 22 22 22 22 22 22
# Here is how that graph looks like (this is an added drawing):
Fig 7. The reduced graph with duplicate edges
It might be beneficial to remove duplicate remote arc entries as well to reduce the number of inner iterations in the global connectedcomponents algorithm. This is costly to do because it requires a global sorting.
But, most duplicate arc entries will reside on one or two neighbouring processors. This is because remote arcs are close physically in the network and thus close indexwise in the array ends
. So, all we need to do is remove remote arc entries locally on each processor. In most cases this will leave two or one copies of each remote arc in the reduced graph.
# After removing the local duplicate arc copies:P
The nonduplicate remote arcs have:
Heads:  21   22 22 35 36 37      37
Tails:  14   8 8 28 22 22      22
There are 15 remote arcs, out of 60
There are 7 arcs entering the global phase
# And here is the reduced graph that will enter the global phase in the next section:
Fig 8. The reduced graph without local duplicate edges
We will use the localdiffusion algorithm for this global phase operating on the reduced graph, because of its efficient HPF implementation in Adaptor and simplicity. Implementation details will come later on. After this phase completes, the reduced graph will be correctly labeled:
There are 7 arcs entering the global phase
The HALO collapsed global arcs have:
Heads: 19 20 21 36 37 38 37
Tails: 14 8 8 28 22 22 51
Now at iteration: 1 with total mismatch 86
Now at iteration: 2 with total mismatch 42
Now at iteration: 3 with total mismatch 0
Total number of global iterations: 3
After the global relaxation phase the node labels are:
Fig 9. The reduced graph after the global phase
The labeling of the graph at this stage reflects the changes in the labels of the parents that participated in the global phase:
Fig 10. The graph labeling after the global phase
After the global algorithm is done, all nonparent nodes need to update their labels from their parent nodes to complete the global labeling phase:
Fig 11. The graph labeling after the global update
In summary, here is the full distributed connectedcomponents algorithm:
Procedure BuildConnectedComponentsDistributed (G)
# Select the nonremote arcs (P_{i} is the processor owning node i)
E^{(l)} < {e_{(i, j)}  (P_{i} = P_{j})}
# Local phase:
BuildConnectedComponentsPRAM(G^{(l)}=(E^{(l)}, N))
# Mark the nodes that are not parents (in TreeCollapse()):
N^{(np)} < {i  p (i) != i}
# Select only the remote arcs:
E^{(g)} < E \ E^{(l)}
# Condense the remote arcs:
EdgeCondensation (E^{(g)})
Remove duplicate entries from (E^{(g)})
# Global phase
BuildConnectedComponentsRelax ({G^{(g)}=(E^{(g)}, N))
# Update of global labels:
PointerJumping (N^{(np)})
End procedure BuildConnectedComponentsDistributed