Needed or proposed improvements to Adaptor
What this web-document is about
The problem of finding the connected components of a graph has been very well studied due to its wide applicability in such diverse applications as Monte Carlo simulations and image recognition. Very efficient serial and parallel algorithms have been developed for both general and specialized graphs. Most parallel algorithms are based on the PRAM CRCW model [2, 6], which can be implemented on modern shared memory architectures with only minor implementation difficulties for CREW and EREW models. Several distributed algorithms have also been proposed [1, 4].
Some of these algorithms are easier to implement on a given platform than others, and some work better on certain type of graphs than on others. The aim of this web-page is to present a very simple data-parallel distributed algorithm implemented in the Adaptor (version 7.0) High Performance Fortran (HPF) compilation system and executed on a Beowulf cluster. The algorithm is not the fastest available or possible, but it is efficient and illustrates some major principles/difficulties of implementing graph algorithms in data-parallel, array-based, languages like HPF.
Algorithm Description
Problem Formulation
A graph G=(V, E) can be thought of as a collection of points (here-after nodes) V={i, i=1..n}, where n=|V| is the number of nodes, connected via directed or undirected links (arcs), E={e(i, j) | i, j in V}, where m=|E| is the number of arcs in the graph and e(i, j) denotes an arc whose head is i and tail is j. We can thus represent a graph efficiently by specifying n and m and giving an array HT(m,2) that specifies the heads and tails of each arc:
HT(e(i, j), :)=[i, j]
In physics, the most commonly used type of graphs are diluted grid graphs (diluted hypercube lattices), which are made from regular 2D or 3D grids by removing some of the arcs/nodes. For the purpose of this document, we will use 2D graphs and use actual debugging printout from the HPF program to illustrate the algorithm. For example, assume that we have randomly generated the following graph:
Fig 1. Initial labeling of the graph
The coloring and the presence of arcs denoted by double lines in figure 1 is related to the distribution of the graph onto abstract processors and will be explained later. The connected components labeling algorithm consists of assigning each node i a label c(i) such that two nodes have the same label if and only if there is a path in the graph connecting the two nodes. Our purpose at the end will be to extract only for those arcs and nodes that belong to a selected (usually the one containing the source or sink in network optimization problems) connected component. Mathematically, we want to extract a single (large) connected component C from the diluted graph:
G <-- ({e(i, j), c(i)=c(j)=C}, {i, c(i)=C})
For our specific example, the program will label the largest connected component as 8 and extract it in a resulting pure graph:
Fig 2. Final extracted largest connected component
Parallel PRAM Algorithms for Connected-Components Labeling
There are several important points to make before discussing some PRAM/distributed algorithms for labeling connected components. First, the heads & tails array HT is not the only possible representation of the graph, nor is it the most commonly used one. Graphs are most frequently represented via node-based data structures such as adjacency matrices (for dense graphs) and adjacency lists (for sparse graphs). We chose to use the heads & tails representation because it is well suited for the more important part of our project (Convex Network Optimization), it fits the HPF_HALO_LIBRARY of Adaptor very well, and because it is most memory efficient for sparse graphs. This implies that some efficient serial algorithms like breath or depth first search are not good candidates. Also, we will look at data-parallel algorithms only, which can be easily implemented in HPF or OpenMP on both shared- and distributed-memory architectures.
Here we give a generic efficient arc-based parallel algorithm for finding the connected components of a graph for the PRAM CRCW model. It is a combination of the simpler and yet efficient parts of several algorithms found in literature [1, 3, 4].
The algorithms build the labeling of the graph by building a spanning forest of the graph. Namely, each connected component will be represented via a tree in the forest, with one node chosen as the root of the tree. To convert this spanning forest into a proper labeling, we simply collapse all the trees into stars---trees of height 1. See figure 3 for an illustration.
Fig 3. A tree collapses into a star--TreeCollapse()
We represent the trees in the forest in the usuall way, via parenthood relations. So, let us make the array p (n), which stores the parent (often called predecessor) of each node, and p (node) = node for root nodes node. Thus tree collapse can be represented via:
Procedure TreeCollapse (V)
For all nodes i in V do in parallel
p (i) <-- FindRoot (i)
End do
End procedure TreeCollapse
where we find the root of the tree a node belongs to by traversing the tree upward until a root node is found:
Procedure child = FindRoot (i)
child = i
# Concurrent read:
Do while ( p (child) != child ) child <-- p (child)
End procedure FindRoot
Note that reading from p (child) requires concurrent read capability. Also, note that finding the root FindRoot (node) has a cost O(H) , where H is the depth of node in the tree. Therefore, it is beneficial to have trees of small height (i.e. almost stars).
Now, let us assume that two trees need to be merged together into a bigger tree because there is an arc e(i, j) connecting one of the nodes i from the first tree to another node j in the second tree. This can be done easily by hooking one of the roots to any node in the other tree. Again, to save on cost and minimize the height of the parenthood trees, it is best to hook one of the roots to the other root. See figure 4 for an illustration.
Fig 4. Hooking two trees together--MergeTrees(e(4, 9))
There are various ways to minimize the height of the trees formed by tree merging. The simplest one to understand (though not most memory efficient) is to always hook a shorter tree to a taller tree (this gurantees that no tree will have height larger than ln2(n)). For this purpose we make another array, h (n), and store in it the height of the tree underneath a given node (so leaf nodes in the tree have minimal height 0, while roots have the maximal height label). When a shorter tree merges with a bigger tree the heights of the two trees do not change. When two trees of equal height merge, the taller tree gets higher by 1 unit.
So we have the following tree merging procedure:
Procedure MergeTrees (e(i, j))
r{i} = FindRoot (i)
r{j} = FindRoot (j)
# Find the taller tree and make it a parent of the shorter child:
If ( h (r{i}) > h (r{j}) ) then
parent <-- r{i} ; child <-- r{j}
Else if ( h (r{i}) < h (r{j}) ) then
parent <-- r{j} ; child <-- r{i}
Else
# In case of a tie, we can chose the lesser node to be the parent:
parent <-- min (i, j) ; child <-- max (i, j)
# Concurrent read and write:
h (parent) <-- h (parent) + 1
End if
# Now merge the trees (concurrent write):
p (child) <-- parent
End procedure MergeTrees
To shorten the trees in the spanning forest one can periodically perform pointer jumping, which simply means that we rehook a node higher up in its parenthood tree:
p (i) <-- p (p (...p (i)))
where usually only a single pointer jump is performed (i.e. the parent of a child is set to the parent of the parent--grandparent):
Procedure PointerJumping (V)
For all nodes i in V do in parallel
# Concurrent read and write:
p (i) <-- p (p (i))
End do
End procedure PointerJumping
This is a less costly version of TreeCollapse which can also be implemented in parallel. In general, there will be an optimal number of pointer jumps to be performed for each application.
One more optimization that can easily be implemented is edge condensation. In this optimization, the ends of an arc are moved up the tree. This kind of optimization is often used efficiently with other approaches, like marking the leaf nodes of the trees as inactive and condensing their arcs up the tree (to reduce the number of nodes used in pointer jumping routines). Self-loops (arcs connecting a node to itself) should be removed. Also, duplicate arcs (arcs between the same two nodes) can be removed by sorting the heads & tails array. These are more useful and much easier (and less costly) to implement with adjacency lists and on shared-memory models, so we won't use them much. Here is an edge condensation routine, which we will use:
Procedure EdgeCondensation (E)
For all arcs e(i, j) in E do in parallel
Concurrent read:
e(i, j) <-- e(p (i), p (j))
End do
End procedure EdgeCondensation
Now we can summarize an efficient parallel algorithm for computing the connected components of a graph in parallel that accepts a graph G=(V, E) as input:
Procedure BuildConnectedComponents (G)
# Initially let each node be its own parent and mark all arcs as alive:
For all nodes i in V do in parallel
p (i) <-- i
Mark all arcs as alive
Mark all nodes as non-leaf
Repeat until there are no alive arcs
For all alive arcs e(i, j) in E do in parallel
If certain conditions are true
Merge the trees of nodes i and j
Kill arc e(i, j)
End if
End for
Optionally
Mark leaf nodes and condense their arcs
Perform pointer jumping on non-leaf nodes
and / or
Condense the alive arcs and remove self-loops
and / or
Remove duplicate alive arcs
End optionally
End repeat
TreeCollapse () # Not neccessary if invoked above
End procedure BuildConnectedComponents
A few words are in order. The main step in any algorithm is MergeTrees. If executed for all arcs, it will guarantee a correct connected component labeling so long as trees are merged acyclically (acyclicity of the parenthood relations is required in any tree). This can be guaranteed by using some acyclic property when choosing which tree merges with which, such as comparing the heights of the two trees or comparing the values of the labels of the roots of the two trees.
There are two main approaches to the above algorithm.
-
In the first approach, if there are no additional certain conditions imposed on the merging, then all arcs will be used and killed in the first iteration. However, this requires traversing the trees to find their roots, which is a logarithmic operation O(ln(n)). Pointer jumping can still be perfomed more or less agressively, but edge condensation is not needed. The total running time of such an approach is O(mln(n)). For a nice HTML illustration of this approach see [3].
- List Item Two
In the second approach, trees are not always merged. The most popular choice is to only merge stars to other trees (finding the root of a star is O(1), and we can then merge the star to a node in the other tree), or merging only stars and performing full tree collapse on each iteration so that all trees are either isolated vertices or stars. In this approach, each iteration in to the
Repeat loop is of cost O(m), but then there are likely to be O(ln(n)) iterations until all arcs are dead [1, 2]. So the cost is still a logarithmic factor worse than the best serial algorithms for sparse graphs--O(mln*(n)) (the bound O(mln(n)) is optimal for parallel algorithms).
We will employ the first approach because of its simplicity. It is well known that the second approach allows many additonal optimizations to be employed that speed overall execution times significantly. But the two algorithms are similar, so that the rest of the discussion won't suffer from this possibly arbitrary choice. So we have:
Procedure BuildConnectedComponents-PRAM (G)
# Initially let each node be its own parent:
For all nodes i in V do in parallel
p (i) <-- i
For all arcs e(i, j) in E do in parallel
MergeTrees (e(i, j))
Optionally perform some kind of pointer jumping on i and j
End for
TreeCollapse ()
End procedure BuildConnectedComponents-PRAM
In the initial implementation we will not use pointer jumping. This is because the height property of trees can not be maintained under such operations and a O(mln(n)) bound can not be guaranteed for some pathological graphs (in the algorithms of the second approach, a logarithmic execution time is guaranteed by employing special hooking conditions, such as merging trees with larger root labels to trees with smaller root labels on odd iterations, and vice versa on even iterations). But for diluted 2D grid graphs the logarithmic bound is likely to be the average case, so the height of the trees does not have to be monitored.
A Simple Distributed Relaxation Connected-Components Algorithm
The above algorithm were well suited for the PRAM CRCW model. This model can not be directly implemented even on most of todays SMP's, because of the concurrent read & write capability. But these concurrent memory access operations can be emulated in software in languages like MPL [2].
The same is true for implementing the above algorithm on a distributed-memory architecture. The basic operation with trees is traversing the tree upward from a node. This may require skipping from one processor to the other if the parenthood relation (i.e. and edge) crosses across processors. Languages like Split C have emulation capabilities for global pointers, which can be used to represent remote arcs--arcs whose end nodes belong to different processors [1]. This however simply hides the inherent difficulties present in utilizing the above algorithm on a distributed-memory architecture and using global pointers usually has a very high cost associated with it.
HPF is not a suitable language for implementing such runtime communication and/or job distribution irregularities. It derives its main power from operations on arrays and (especially Adaptor) in reusing communication schedules based on arrays. Pointers in Fortran have a special treatment (they are aliases) and are not implemented too efficiently in most compilers. So we need to look at another distributed-memory connected-component labeling algorithm.
The simplest kind of algorithm is a relaxation algorithm known as local diffusion [4]. This algorithm itself can be used to implement efficient MPP connected-components multigrid algorithms, but we will use it in its simplest form. Instead of the previous parenthood label p (node) we will use the label c (node) just to indicate that there are no parenthood trees being created but a labeling of the nodes is built right away. The algorithm can be expressed in a few lines:
Procedure BuildConnectedComponents-LocalDiffusion (G)
# Initially let each node be its own parent:
For all nodes i in V do in parallel
p (i) <-- i
# Local diffusion relaxation:
Repeat until convergence to a stationary point
For all nodes i in V do in parallel
c (i) <-- min (c (i), { c (j) for all e (i, j) in E} )
End for
End repeat
End procedure BuildConnectedComponents-LocalDiffusion
In words, set the label of each node to the minimum of its own label and all of its neighbouring nodes, and iterate until convergence to a stationary labeling of the nodes. This kind of procedure implies a high data locality for grid graphs partitioned among processors based on physical domain decomposition and makes HPF a good implementation language. Later we will show how the Adaptor halo library can be used to do this extremely easily.
The above process requires that a temporary be made for c (n). It is well known for other relaxation procedures that this requirment can be relaxed and no temporary has to be created but updates can be made to c directly. This implies some non-determinacy, but the algorithm will still converge to the correct labeling, usually significantly faster. Also, we can easily implement the above algorithm using the heads & tails array:
Procedure BuildConnectedComponents-Relax (G)
For all nodes i in V do in parallel
p (i) <-- i
Repeat until convergence to a stationary point
# Relaxation step:
For all arcs e(i, j) in E do in parallel
c (i), c (j) <-- min ( c (i), c (j) )
End for
End repeat
End procedure BuildConnectedComponents-Relax
The local diffusion algorithm relaxes the label mismatches in the graph as in a diffusion process. The relaxation step has to be repeated as many times as the (chemical) length of the biggest percolation cluster in the network, which can be of the order of the size of the system [4]. So this algorithm is not very efficient. Figure 5 shows the absolute magnitude of the label mismatch M across all arcs during the course of the relaxation for several diluted grids, one above, one at, and one below the percolation point:
# Total mismatch over all arcs:
M <-- Sum of |c (i) - c (j)| over all arcs e(i, j)
Fig 4. Convergence behaviour of the local diffusion relaxation algorithm
The grid graph was 2D of size [ 1000 x 1000 ]
A Mixed Approach
The optimal distributed-memory connected-components algorithm on most machines will consist of two phases:
- Local Phase--in which each processor performs a local connected-components algorithm on the arcs and nodes that it owns. Note that this phase needs to ignore remote arcs. Different algorithms can be used for this local phase. From an algorithmic point of view, it is best to use one of the (approximately) O (n) algorithms, like breath (depth)-first search. We decided to use the above parallel PRAM arc-based algorithms for the local phase, though. This has several motivations:
- Our data structures are arc-based and conversion to adjacency lists takes logarithmic time itself.
- As shown later, this "local" part can be implemented very easily in HPF without the cumbersome
HPF_LOCAL model. More importantly, this implementation allows a good HPF compiler to produce code that can be executed on a cluster of SMP's, where only concurrent reads/writes need to be handled with some care.
- The algorithm is simple and does not require complicated data-structures like priority ques or balanced binary trees, which are relatively difficult to implement in array-based Fortran.
This local phase builds parenthood trees that do not involve nodes not residing on the same processor. It is important to understand that only some of the parent nodes will need to update their own parents to make a true global connected-components labeling, and that other nodes can later just update their labels from their parents locally.
- Global Phase--in which only remote arcs are involved and a global connected-components algorithm is invoked on a reduced graph consisting of only remote-arcs and nodes they are incident on, remote nodes.
Before entering the global connected-components algorithm, we need to collapse all the remote arcs, that is, make their head & tail pointers point to the parents of their head & tail. This will dramatically reduce the number of nodes involved in the global phase, since it is likely that many remote arcs have the same head and tail (reducing the number of nodes will reduce the halo size needed later on and thus dramatically reduce communication costs). It might be beneficial to remove duplicate remote arc entries as well to reduce the number of inner iterations in the global connected-components algorithm.
We will use the local-diffusion algorithm for this global phase operating on the reduced graph, because of its efficient HPF implementation in Adaptor and simplicity. After the global algorithm is done, all non-parent nodes need to update their labels from their parent nodes to complete the global labeling phase.
Both of these will be illustrated well on a concrete example and with actual code later on. For now, here is pseudo-code for this algorithm:
Procedure BuildConnectedComponents-Distributed (G)
# Select the non-remote (local) arcs (Pi is the processor owning node i)
E(l) <-- {e(i, j) | (Pi = Pj)}
# Local phase:
BuildConnectedComponents-PRAM(G(l)=(E(l), V))
# Mark the nodes that are not parents (in TreeCollapse()):
V(np) <-- {i | p (i) != i}
# Select only the remote arcs:
E(g) <-- E \ E(l)
# Condense the remote arcs:
EdgeCondensation (E(g))
Remove duplicate entries from E(g)
# Global phase
BuildConnectedComponents-Relax ({G(g)=(E(g), V))
# Update of global labels:
PointerJumping (V(np))
End procedure BuildConnectedComponents-Distributed
Implementation in ADP_HPF
We will now discuss an initial implementation of the above connected-components labeling distributed algorithm within Adaptor, using some of its extensions to HPF (in Adaptor's terminology, ADP_HPF). Altough pieces of code will be given, these are not final and some key details are missing. There are still several important issues that need to be resolved about how to implement certain operations efficiently, and any suggestions would be greatly appreciated on how to resolve the following issues:
- Marking the remote arcs and possibly packing the
ends array to contain only remote arcs (henceforth packing problem).
- Removing duplicate arcs (henceforth sorting problem)
- Passing
ALLOCATABLE arrays aligned with (:) to HPF_LOCAL routines (allocatables problem)
There are other problems that we will discuss, but these are the ones that stick out.
Representation of the Graph
To review what was said earlier: Let the number of nodes (after a possible dilution) be n_nodes and the number of arcs (again, after the dilution which will not be discussed here) be n_arcs. The graph connectivity is fully stored in the heads & tails array ends (n_arcs, 2)--we store the head and the tail of each arc in the location ends(arc,:):
ends (arc, : ) = [head, tail] ! [ and ] are array constructors in Adaptor (and F2K)
We introduce the array labels (n_nodes) to store the node labels and/or parents (the two are neccessarily the same for non-parent nodes).
In fact, throughout the program we will meet several node and arc arrays. We will have two logical array masks nodes_mask (n_nodes) and arcs_mask (n_arcs) that we will use as templates for the alignment of these arrays (since a template can not be allocatable in HPF 2, and we will also need these arrays frequently).
Distribution of the graph
Since we are interested in distributed-memory connectivity algorithms, something must be said about the distribution of the node and arc arrays onto (abstract) HPF processors. I will use coloring to denote ownership of the data on certain processors. The example from figure 1 was executed on 3 processors and was a 2D grid of size 7x7 with 70% of the arcs present. Processors P1-P3 will have red, green and blue color correspondingly, and the color of each node/arc will denote the processor that is its owner. Once again, here is the graph that we will use as an example throughout this section:
Fig 5. Initial labeling of the graph
A major choice to be made in an HPF implementation of the algorithm is whether to distribute the graph arrays BLOCK (according to an approximate physical domain decomposition of the plane region of size LxL into approximately rectangular blocks), or GEN_BLOCK (according to a precise rectangular domain decomposition). In both approaches, we need to number the nodes so that numbers that are close in the numbering correspond to nodes that are very close in the network physically (for example, nodes could be numbered consecutively in rows first, then consecutively in columns). Both BLOCK and GEN_BLOCK have some advantages and disadvantages in this context:
BLOCK offers better load balancing and is easy to implement. It is also more efficient and covered under HPF 1. It also relieves the programmer for worrying about the physical meaning of the graph and can thus be used efficiently for general graphs.
- The disadvantage of
BLOCK is that it makes the physical decomposition somewhat irregular (for example, corners will be "chipped" off of the rectangles) and thus one can not use special LOCAL speed-up techniques that use the physical representation of the graph. For example, the famous Hoshen-Kopelman percolation connected-components algorithm uses the fact that the graph has contigious rows/columns. Also, this distribution will require that we do a redistribution when, for example, removing some of the nodes/arcs.
- Simply reverse the advantages/disadvantages above when discussing
GEN_BLOCK--it offers a nice physical representation of the partitioning that can be used to speed certain parts (for example, making the HALO data structures). Also, it avoids certain redistributions.
- But
GEN_BLOCK can cause load imbalances when the number of processors is of the same order as the number of nodes (this will not be usually the case, but some phases, like the global phase above, will work on a small reduced subgraph of the whole network). More importantly, HPF does not have a mechanism to tell a certain procedure that a global, USE-associated argument is indeed distributed GEN_BLOCK, so all code that uses these arrays must be encapsulated into subroutines that declare their arguments to be GEN_BLOCK (all of this in fact stems from what Adaptor describes as, "every access to a distributed array has only one reaching distribution, and this is the last syntactical DISTRIBUTE or REDISTRIBUTE statement").
- Most importantly (this was the deciding factor),
GEN_BLOCK can not be used for general graphs and makes the design of a general-purpose network optimization HPF library more complicated.
Therefore, we decided to distribute all arrays BLOCK and do redistributions when necessary:
INTEGER, PRIVATE :: alpha ! Just a dummy index
!HPF$ DISTRIBUTE (BLOCK) :: arcs_mask ! Use BLOCK distribution
LOGICAL, DIMENSION (:), ALLOCATABLE :: arcs_mask ! Mask/template for arcs
! Adaptor does not support alignment with (:) for allocatables yet:
!HPF$ ALIGN (alpha,*) WITH arcs_mask (alpha) :: ends
INTEGER, DIMENSION (:, :), ALLOCATABLE :: ends ! The heads and tails array
!HPF$ DISTRIBUTE (BLOCK), DYNAMIC :: nodes_mask ! Does not have to be dynamic for now
LOGICAL, DIMENSION (:), ALLOCATABLE :: nodes_mask ! The nodes mask/template
!HPF$ ALIGN (alpha) WITH nodes_mask (alpha) :: heights, labels
!HPF$ DYNAMIC :: labels ! Needs to be DYNAMIC to use HALOs
INTEGER, DIMENSION (:), ALLOCATABLE :: labels, heights ! labels and heights of node trees
During the transition from local to global phase we will operate on a subgraph of G that will contain all of the remote arcs. This graph will have similar arrays to the ones above, but their names will have the suffix "_remote" after the names above (for example, ends_remote). Similarly, once we remove duplicate arc entries just before starting the relaxation process, the suffix will be "_global" (like ends_global).
Local Phase
The local phase can be implemented very easily in HPF. The only tricky point is that remote arcs (i.e. arcs whose head & tail belong to different processors) need to be ignored in this phase, so we need to find the remote arcs and set their mask (temporarily) to .TRUE.. There is no clean way of doing this in HPF without the HPF_LOCAL model, and in Adaptor many of the HPF_LOCAL_LIBRARY routines are not supported, so that we couldn't really think of a standard-conforming way for doing this. So we will assume that somehow, each processor marks the arcs that have an end that belongs to a different processor. A pseudo-HPF code for doing this would be:
EXTRINSIC (HPF_LOCAL) SUBROUTINE MarkNonLocalArcs (ends, nodes_mask, arcs_mask)
! Declare the argument types and distributions:
!HPF$ DISTRIBUTE (BLOCK) :: nodes
!HPF$ DISTRIBUTE (BLOCK) :: arc_mask
! Adaptor has problems with this (use wrapper routines to trick it):
!HPF$ ALIGN (alpha, *) WITH arc_mask (alpha, *) :: ends
INTEGER, DIMENSION (:, :), INTENT (IN) :: ends
LOGICAL, DIMENSION (:), INTENT (IN) :: nodes_mask
LOGICAL, DIMENSION (:), INTENT (OUT) :: arcs_mask
!
INTEGER :: arc, node, node_lbound, node_ubound, indx
!
! Assume that Lower- and UpperBlockBound(array) return the upper and lower
! global indexes of the block of a BLOCK-distributed array
! that belongs to this processor.
! In Adaptor, LBOUND and UBOUND have that behaviour:
node_lbound = LowerBlockBound (nodes_mask)
! In Adaptor: node_lbound = LBOUND (nodes_mask,1)
node_ubound = LowerBlockBound (nodes_mask)
! In Adaptor: node_ubound = UBOUND (nodes_mask,1)
!
! Now mark the remote arcs as .TRUE.:
arcs_mask = .FALSE.
DO arc = LBOUND (arcs_mask, 1), UBOUND(arcs_mask, 1)
DO indx = LBOUND (ends, 2), UBOUND (ends, 2) ! indx = 1, 2
! If node is out of range, the arc is remote:
node = ends (arc, indx)
IF( (node < node_lbound) .OR. (node > node_ubound)) THEN
arc_mask (arc)=.TRUE.
END IF
END DO
END DO
!
END SUBROUTINE MarkNonLocalArcs
We encounter a few problems even in this first routine. First, because Adaptor does not support LOCAL_TO_GLOBAL, we used L(U)BOUND. The standard does not really specify the behaviour of these Fortran 90 intrinsics for HPF_LOCAL routines.
More importantly, the routine won't work with Adaptor because the array ends is aligned with:
ALIGN (i) WITH ... (i)
and can not as such be passed to a local routine. The origin of the problem is really that Adaptor does not support allocatables aligned with (:) with another allocatable. This is the allocatables problem. It is possible to work around it by using double interfaces and hiding some of the routines from Adaptor, so that one tricks its IPA analysis routines, but this is very clumsy. For now we made some workarounds by emulating an HPF style using ON (P (i)) and INDEPENDENT.
Since the MarkNonLocal... phase is always present when making a halo in HPF_HALO_LIBRARY, we propose that this functionality (marking which indirection indexes are non-local in a mask aligned with the indexes or returning a pointer array to these indexes) become a part of this library.
Here is the HPF code that implements the local phase:
! Initialization:
FORALL (node = 1 : n_nodes) labels (node) = node
heights = 0
!
! Marking remote arcs
CALL MarkNonLocalArcs ()
! The number of remote arcs is:
n_remote = COUNT (arcs_mask)
!
!HPF$ INDEPENDENT, ON HOME (arcs_mask (arc)), &
!HPF$& NEW (head, tail, head_label, tail_label, child, parent, diff_heights), &
!HPF$& RESIDENT ! Very important to state resident condition
DO arc = 1, n_arcs
! For all non-remote arcs:
IF ( .NOT. arcs_mask (arc)) THEN
head = ends (arc, 1)
tail = ends (arc, 2)
!
! This is FindRoot (head) inlined:
! (it can not be implemented as a separate procedure easily
! because the above RESIDENT clause can not be propagated)
child = head
DO WHILE (labels (child) /= child)
child = labels (child)
END DO
head_label = child
!
! This is the inlined FindRoot (tail):
child = tail
DO WHILE (labels (child) /= child)
child = labels (child)
END DO
tail_label = child
!
! We compare the labels of the two trees and merge if different:
IF (head_label /= tail_label) THEN
diff_heights = heights (head_label) - heights (tail_label)
IF (diff_heights > 0) THEN
parent = head_label
child = tail_label
! It is not always neccessary to consider the equal case separately:
ELSE
parent = tail_label
child = head_label
END IF
! Hook the trees:
labels (child) = parent
! Increase the height of the taller tree when merging equal-height trees:
IF (diff_heights == 0) heights (parent) = heights (parent) + 1
END IF
END IF
END DO
!
! Now we implement TreeCollapse()
! We will use nodes_mask to mark the parent nodes as .TRUE.:
nodes_mask = .FALSE.
!HPF$ INDEPENDENT, ON HOME (nodes_mask (node)), NEW (parent, child), RESIDENT
DO node = 1, n_nodes
child = node
DO WHILE (labels (child) /= child)
child = labels (child)
END DO
labels (node) = child
nodes_mask (child) = .TRUE.
END DO
An Example
Here is the output from the program (colored for denoting ownership) that illustrates how the local phase above works:
[hpf@gauss hpf-2DNet]$ mpirun -np 3 Test2D-Debug.x $SLIB
Enter the length L and the dilution:
7,.7
Enter the number of multiplications and connected components labelings:
0,1
# Here are node_lbound and node_ubound for different processors:
The distribution of nodes among processors is:
Procs: 1 2 3
From : 1 18 35
To : 17 34 49
Initial connectivity parenthood: [See Figure 5]
# Some statistics about the trees built in the local phase:
The maximum tree height is: 2
The average tree height is: 0.20
The number of tree-climbing steps is: 54
# Here are the parent nodes (with .TRUE. mask at the end):
Parent nodes and their ownership:
Nodes: 1 8 14 21 22 28 35 36 37
Procs: 1 1 1 2 2 2 3 3 3
# From here on we will leave the nodes and the array labels alone
# and work on the remote arcs only. Here are some statistics before we proceed:
The remote arcs belong to processor:
Procs: 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Before the collapse the remote arcs have:
Heads: 19 21 22 23 18 24 35 36 37 32 38 33 39 40 41
Tails: 12 14 15 16 17 17 28 29 30 31 31 32 32 33 34
After the local phase the connectivity is:
Fig 6. Labeling of the graph after the local phase
Preparation for the Global Phase
The transition from local to global phase is actually most difficult to handle and takes about a quarter to half of the total execution time. Therefore input on possible improvements is welcome.
PackNonLocalArcs
The first and according to my measurements most time-critical routine is PackNonLocalArcs (). This routine makes the reduced graph by taking only the edges in the heads & tails array ends for which arcs_mask is .TRUE. (i.e. only the remote edges) and packing them into a new BLOCK distributed array ends_remote.
In fact, the main implementation difficulty comes from the packing problem. Assume that we have a multidimensional array A (:, :, ..., :) of rank r and a mask A_MASK (SIZE (A, DIM)) for some dimension argument 1 <= DIM <= r, and that we want to elimininate those hyperplanes of A along DIM for which the mask is .FALSE.. This is very similar to the Fortran 90 intrinsic PACK, only that that routine always packs multidimensional arrays into one dimensional ones, here we want to preserve the rank of the array and just pack along one dimension. The interface for this routine would be something like:
FUNCTION PackAlongDim ( array , mask, dim, vector) RESULT (packed_array)
INTEGER, INTENT (IN) :: dim
<TYPE>, DIMENSION (:, ..., :, ..., :), INTENT (IN) :: array
LOGICAL, DIMENSION (SIZE (array, dim)), INTENT (IN) :: mask
INTEGER :: rank = SIZE (SHAPE (array)) ! Rank of array
<TYPE>, DIMENSION (SIZE (array,1), ..., COUNT (mask), ...,SIZE(array, rank)) :: packed_array
END FUNCTION PackAlongDim
How to implement this kind of function efficiently in Adaptor, in particular, for GEN_BLOCK (*) distributed arrays? A standard conforming version (it is not going to work with Adaptor) for the one-dimensional case (a real PACK), which is not at all memory efficient for block-distributed arrays, would be:
INTEGER , DIMENSION (SIZE (mask)) :: count_array
count_array = COUNT_PREFIX (mask)
packed_array = COPY_SCATTER ( ARRAY = array, BASE = packed_array, INDX1= count_array, MASK = mask)
Unfortunately, this can not be adopted in the multidimensional case because COPY_SCATTER (like PACK) can not scatter along a dimension (another problem with HPF that we have encountered at times). The solution that we implemented for the particular case of packing a BLOCK distributed array of rank 1 along the first dimension into another such array uses REDISTRIBUTE, RESIDENT and GEN_BLOCK to achieve this with two redistribution from GEN_BLOCK to GEN_BLOCK and a localized computation. It may not be the most efficient in Adaptor, but it is simple (suggestions are welcome--for example, one could dispose of the additional memory in packed_array_tmp by just declaring packed_array locally as DYNAMIC and working with it directly, but this for some reason seemed to slow the routine slighly):
SUBROUTINE PackFirstDim (mask_array, array, packed_array)
!HPF$ DISTRIBUTE (BLOCK) :: mask_array
!!HPF$ ALIGN (:, *) WITH mask_array (:) :: array ! Adaptor will complain because of CountLocal
!HPF$ DISTRIBUTE (BLOCK,*) :: array ! A workaround for CountLocal
!HPF$ DISTRIBUTE (BLOCK,*) :: packed_array ! Maybe DYNAMIC
LOGICAL, DIMENSION (:), INTENT (IN) :: mask_array
INTEGER, DIMENSION (:, :), INTENT (IN) :: array
INTEGER, DIMENSION (:, :) :: packed_array
!
!HPF$ DISTRIBUTE (BLOCK) :: block_sizes_local ! The local block sizes for array
!ADP$ REPLICATED :: block_sizes ! The block size array in GEN_BLOCK must be replicated
INTEGER, DIMENSION (NUMBER_OF_PROCESSORS()) :: block_sizes, block_sizes_local
!HPF$ DISTRIBUTE (BLOCK,*), DYNAMIC :: packed_array_tmp
INTEGER, ALLOCATABLE, DIMENSION (:, :) :: packed_array_tmp ! Extra memory
INTEGER :: row, col, n_rows, n_cols ! Just counters
!
! CountLocal () returns the local number of .TRUE. elements in a mask:
CALL CountLocal (mask_array, block_sizes_local)
block_sizes = block_sizes_local ! The replicated version needed for GEN_BLOCK
!
n_rows = COUNT (mask_array)
n_cols = SIZE (array, 2)
!
! Will Adaptor recognize this idom of ALLOCATE + immediate REDISTRIBUTE?
! Judging from the generated 77 code, no, but internally in DALIB, maybe...
ALLOCATE (packed_array_tmp (n_rows, n_cols))
! This could be a REDISTRIBUTE on packed_array, thus avoiding the temporary
!HPF$ REDISTRIBUTE (GEN_BLOCK (block_sizes), *) :: packed_array_tmp ! *Communication*
!
! This is a RESIDENT local packing:
CALL PackFirstDimLocal (mask_array, array, packed_array_tmp) ! *Computation*
! This can be a REDISTRIBUTE on packed_array
packed_array = packed_array_tmp ! *Communication*, implicit redistribution
!
END SUBROUTINE PackFirstDim
!
! Local resident packing:
EXTRINSIC (HPF_LOCAL) SUBROUTINE PackFirstDimLocal (mask_array, array, packed_array)
!HPF$ DISTRIBUTE (BLOCK) :: mask_array
!HPF$ DISTRIBUTE (BLOCK, *) :: array ! In fact, it should be aligned with mask (Adaptor will complain)
!HPF$ DISTRIBUTE (GEN_BLOCK(), *) :: packed_array
LOGICAL, DIMENSION (:), INTENT (IN) :: mask_array
INTEGER, DIMENSION (:, :), INTENT (IN) :: array
INTEGER, DIMENSION (:, :) :: packed_array
!
INTEGER :: indx, count
count = 0
DO indx = LBOUND (mask_array, 1), UBOUND (mask_array, 1)
IF (mask_array (indx)) THEN
count = count + 1
! Array is ALIGNED with mask in this case and conforms it:
packed_array (LBOUND(packed_array, 1)+count-1, :) = array (indx, :)
END IF
END DO
!
END SUBROUTINE PackFirstDimLocal
Now we are ready to pack the remote arcs into the array ends_remote and then collapse them as in:
ends_remote = labels (ends_remote)
However, vector subsripts must be one-dimensional arrays, so we need to rewrite this as a FORALL (same internal implementation). Notice that Adaptor will use halos with a temporary for this, since labels has the DYNAMIC attribute. In the actual program we did this manually. The shadow is best allocated on the heap in this case, because it will not be used again:
SUBROUTINE PackNonLocalArcs (...)
...
ALLOCATE (arcs_mask_remote (n_remote), ends_remote (n_remote, 2))
!
! ends_remote = PackAlongDim (array = ends, mask = arcs_mask, dim = 1)
CALL PackFirstDim (mask_array = arcs_mask, array = ends, packed_array = ends_remote)
!
! We collapse the remote edges (Adaptor will use halos):
FORALL (arc = 1 : n_remote) ends_remote (arc, :) = labels (ends_remote (arc, :))
!
END SUBROUTINE PackNonLocalArcs
RemoveDuplicateArcs
The second important routine, which my preliminary timings indicate is not time-critical, is removing duplicate entries from ends_remote (we will also remove self-cycles, that is edges of the form e(i, i)). As we will see in a moment, this array will be used as an indirection array in making a HALO for labels. In the current implementation of Adaptor, such duplicate entries create duplicate shadow nodes, so that the communication requirements of the program will be much bigger than needed. In the next version of Adaptor, a sorting algorithm will be provided that will eliminate such duplications. However, this sorting will itself be very expensive. Therefore we decided to implement a specific kind of duplicate-entry removal that uses the high data-locality property of the network.
In esence, most duplicate arc entries will reside on one or two neighbouring processors. This is because remote arcs are close physically in the network and thus close index-wise in the array ends. So, all we need to do is remove remote arc entries locally on each processor. In most cases this will leave two or one copies of each remote arc in the reduced graph. We believe that the cost of this approximate doubling in communication costs is less costly than sorting the arcs globally (of course, one could try to do a nearest-neighbour kind of exchange among processors to see if both own the same arc, further removing duplicates). There are some interesting possibilities here. For example, it is possible to make a halo with the unsorted ends_remote, and then use this halo to find and remove duplicate entries (by setting a mask for already present arcs to .FALSE.). These require some extensions to HPF_HALO_LIBRARY though.
For now, here is a simple procedure that removes duplicate arc entries on each processor. Notice that this routine uses the simplest kind of bubble-sort algorithm, which does not have the best-case logarithmic time bound. However, implementing a sorting algorithm in HPF from scratch is not an easy task. We can also implement a ready-made F77_SERIAL sorting algorithm as well, or look for a suitable F90_SERIAL routine:
SUBROUTINE RemoveDuplicateArcs (...)
...
INTEGER :: proc, arc, test_arc, indx, head, tail
INTEGER :: n_procs = NUMBER_OF_PROCESSORS ()
!
! We are going to emulate an HPF_LOCAL routine here.
! Assume we have a BLOCK-distributed array local_procs of size n_procs
! First find the local block boundaries for each processor:
CALL LocalBounds (arcs_mask_remote, arcs_lbounds, arcs_ubounds)
!
! At first all arcs are marked as non-duplicate:
arcs_mask_remote = .TRUE.
!HPF$ INDEPENDENT, NEW (head, tail), ON HOME (local_procs (proc)), RESIDENT
DO proc = 1, n_procs
! Loop over all local arcs on this processor:
DO arc = arcs_lbounds (proc), arcs_ubounds (proc)
head = ends_remote (arc, 1) ; tail = ends_remote (arc, 2)
! Remove self-cycles:
IF (head == tail) arcs_mask_remote (arc) = .FALSE.
! For each non-duplicate arc, mark all identical arcs that follow as duplicate:
! (note that we do not care about the direction of the arc)
IF (arcs_mask_remote (arc)) THEN
DO test_arc = arc + 1, arcs_ubounds (proc)
IF (((ends_remote (test_arc, 1) == head) .AND. (ends_remote (test_arc, 2) == tail))&
.OR. ((ends_remote (test_arc, 2) == head) .AND. (ends_remote (test_arc, 1) == tail))) &
arcs_mask_remote (arc) = .FALSE.
END DO
END IF
END DO
END DO
!
END SUBROUTINE RemoveDuplicateArcs
An Example
Here we continue our example from earlier:
# After packing, the ends_remote array and its ownership is:
The remote arcs belong to processor:
Procs: 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Before the collapse the remote arcs have:
Heads: 19 21 22 23 18 24 35 36 37 32 38 33 39 40 41
Tails: 12 14 15 16 17 17 28 29 30 31 31 32 32 33 34
# Now, we collapse the remote edges to get a reduced graph with many duplicate edges:
The collapsed remote arcs have:
Heads: 22 21 22 22 22 22 35 36 37 22 37 22 37 37 37
Tails: 8 14 8 8 8 8 28 22 22 22 22 22 22 22 22
# Here is how that graph looks like (this is an added drawing):
Fig 7. The reduced graph with duplicate edges
# After removing the local duplicate arc copies:
The non-duplicate remote arcs have:
Heads: -- 21 -- -- 22 22 35 36 37 -- -- -- -- -- 37
Tails: -- 14 -- -- 8 8 28 22 22 -- -- -- -- -- 22
There are 15 remote arcs, out of 60
There are 7 arcs entering the global phase
At this point we are going to again pack the array ends_remote into an array ends_global containing only the non-duplicate arcs. This is necessary for the moment because Adaptor does not yet support the MASK argument to the HPF_HALO_LIBRARY routines:
n_global = COUNT (arcs_mask_remote)
ALLOCATE (arcs_mask_global (n_global))
arcs_mask_global = .TRUE.
ALLOCATE (ends_global (n_global, 2))
CALL PackFirstDim (mask_array = arcs_mask_remote, array = ends_remote, packed_array = ends_global)
After this packing, the reduced graph that enters the global phase is:
Fig 8. The reduced graph without local duplicate edges
Global Phase
The efficient HPF implementation of the global phase will hinge entirely on the HPF_HALO_LIBRARY. This library and the use of halos are discussed at length in papers available from the Adaptor webpage. In essence, a shadow area is appended to the end of labels that stores local copies of the remote nodes--nodes that are ends of a local arc but which are not local themselves. Although it is preferable to allocate this shadow on the HEAP, this can not be done in the current release of Adaptor. Therefore we use:
CALL HALO_ALL_DEFINE (HALO_ID = labels_halo_global, ARRAY = labels, INDEXES = ends_global)
which appends a shadow of sufficient size to the right of the DYNAMIC array labels and changes the entries in ends_global to point to the local shadow nodes instead of remote ones. During the relaxation iterations, the labels of the shadow copies will be updated from the labels of their parents (HALO_UPDATE), and then the minimum label will be found from the labels of all copies of a given physical node (HALO_REDUCE):
There are 7 arcs entering the global phase
The global arcs belong to processor:
Procs: 1 1 1 2 2 2 3
The HALO collapsed global arcs have:
Heads: 19 20 21 36 37 38 37
Tails: 14 8 8 28 22 22 51
Fig 9. The graph as it enters the global phase
Shadow nodes are enclosed in dashed rectangles and connected to their parents via dashed communication links
The global phase is now easy to implement in Adaptor:
iteration = 0
! Relaxation loop:
DO
iteration = iteration + 1
!
! Relaxation step:
! First communication step is to update the values of the shadow nodes
! from their parents:
CALL HALO_UPDATE (HALO_ID = labels_halo_global, ARRAY = labels)
mismatched = 0 ! Number of arcs with mismatched labels for head & tail
!HPF$ INDEPENDENT, ON HOME (ends_global (arc, :)), NEW (head, tail, head_label, tail_label, diff_label), &
!HPF$ RESIDENT, REDUCTION (mismatched) ! RESIDENT is required here
DO arc = 1, n_global
head = ends_global (arc, 1)
tail = ends_global (arc, 2)
head_label = labels (head)
tail_label = labels (tail)
diff_label = head_label - tail_label
! Now we efficently implement:
! head_label, tail_label <-- MIN (head_label, tail_label)
IF (diff_label /= 0) THEN
IF (diff_label > 0) THEN
labels (head) = tail_label
ELSE
labels (tail) = head_label
END IF
mismatched = mismatched + 1
END IF
END DO
! Now the values of all node copies are reduced:
CALL HALO_REDUCE_MINVAL (HALO_ID = labels_halo_global, ARRAY = labels)
!
! Convergence check:
IF (mismatched == 0) EXIT
!
END DO
During the global phase, it is likely that the number of iterations needed for convergence will be of the order of the number of processors p, since the maximal percolation cluster will span across several processors:
Now at iteration: 1 with total mismatch 86
Now at iteration: 2 with total mismatch 42
Now at iteration: 3 with total mismatch 0
Total number of global iterations: 3
After the global relaxation phase the node labels are:
Fig 10. The reduced graph after the global phase
The labeling of the original graph at this stage reflects the changes in the labels of the parents that participated in the global phase, while non-parent nodes still have their old (incorrect) labels:
Fig 11. The graph labeling after the global phase
Parent nodes that were updated in the global phase are highlighted
After the global algorithm is done, all non-parent nodes need to update their labels from their parent nodes to complete the global labeling phase:
SUBROUTINE GlobalUpdate (...)
...
!HPF$ INDEPENDENT, ON HOME (nodes_mask (node)), RESIDENT
DO node = 1, n_nodes
IF ( .NOT. nodes_mask (node)) THEN
! Pointer jumping:
labels (node) = labels (labels (node))
END IF
END DO
!
END SUBROUTINE GlobalUpdate
Fig 11. The graph labeling after the global update
This completes our algorithm. It is now trivial to extract the desired connected component C and make a pure graph from it by again packing all the arcs and nodes which belong to this connected component.
Extraction of ST connected component
For the purposes of our project, we needed this whole algorithm in order to extract the single connected component connecting the source S to the sink (target) T (if such an "infinite" percolating cluster exists). In a serial implementation, this whole phase would have been a rather trivial breath-first search from S.
However, a full connected-components labeling is the easiest way to do this in a distributed-memory parallel setting. This document illustrates well the numerous complexities involved in parallelizing such a seamingly mindless piece of serial code.
The extraction phase consists of several stages. The first is marking the arcs and nodes which belong to the desired ST component. We will assume that the label of the source source_label and the sink_label are equal and known:
! Label the nodes that belong to the ST component as .TRUE.:
!HPF$ INDEPENDENT, ON HOME (nodes_mask (node)), RESIDENT
DO node = 1, n_nodes
IF (labels (node) == source_label) THEN
nodes_mask (node) = .TRUE.
ELSE
nodes_mask (node) = .FALSE.
END IF
END DO
!
! Marking the ST arcs requires a HALO on labels--this is a big halo:
CALL HALO_ALL_DEFINE (HALO_ID = labels_halo, ARRAY = labels, INDEXES = ends)
CALL HALO_UPDATE (HALO_ID = labels_halo, ARRAY = labels)
!
! Label the arcs that belong to the ST component as .TRUE.:
!HPF$ INDEPENDENT, NEW (head_label, tail_label), ON HOME (arcs_mask (arc)), RESIDENT
DO arc = 1, n_arcs
head_label = labels (ends (arc, 1))
tail_label = labels (ends (arc, 2))
! The second test on tail_label is redundant:
IF (head_label == source_label .AND. tail_label == source_label) THEN
arcs_mask (arc) = .TRUE.
ELSE
arcs_mask (arc) = .FALSE.
END IF
END DO
!
Then we need to renumber the nodes so that we don't count the ones that don't belong to the ST component. Notice that this is a COUNT_PREFIX operation. Since Adaptor does not support this intrinsic, we made a replacement routine CountPrefixBLOCK. Also, the heads and tails of all arcs need to be updated to point to the correct nodes with the new numbering. We can fortunately do this easily by using the halo we already made:
labels = CountPrefixBLOCK (nodes_mask) ! New numbering of nodes
!
! Now we must make the heads & tails arrays point to the right locations:
CALL HALO_UPDATE (HALO_ID = labels_halo, ARRAY = labels)
!HPF$ INDEPENDENT, NEW (head_label, tail_label), ON HOME (arcs_mask (arc)), RESIDENT
DO arc = 1, n_arcs
IF (arcs_mask (arc)) THEN
ends (arc, 1) = labels (ends (arc, 1))
ends (arc, 2) = labels (ends (arc, 2))
END IF
END DO
!
Finally, we need to reallocate the heads and tails arrays with only the ST arcs present. We stress the fact that it is this portion of the code that takes the longest because of the call to PackFirstDim with a very large array:
n_nodes = COUNT (nodes_mask)
n_arcs = COUNT (arcs_mask)
ALLOCATE (ends_selected (n_arcs, 2)) ! This is a temporary array
CALL PackFirstDim (mask_array = arcs_mask, ARRAY = ends, packed_array = ends_selected)
! These are REALLOCATEs:
! (not really standard conforming--ends_selected is aligned with arcs_mask)
DEALLOCATE (ends, arcs_mask, nodes_mask)
ALLOCATE (arcs_mask (n_arcs), nodes_mask (n_nodes))
ALLOCATE (ends (n_arcs, 2))
arcs_mask = .TRUE.
nodes_mask = .TRUE.
ends = ends_selected ! Aligned, but not standard conforming
DEALLOCATE (ends_selected)
Performance Measurements
These timing measurements documented here were done on a Linux Pentium II (350 MHz) cluster of 11 PC's connected with a 20 channel Fast Ethernet switch. The most important thing to notice is that these are not final results. The new version of Adaptor is expected to be released in the next month or so, and new solutions to some of the problems documented above will become available. Please check again soon to see if results have changed.
All results will be plotted in the form of LogLog scaling plots. In an ideal scaling world, all curves (for different number of processors p) would overlap together. There are many other important plots that need to be plotted to fully investigate performance, but these will not be done until the coding is finalized:
- In 2D:
- Local phase
- Intermediate (preparatory) phase
- Global phase
- Relation between the three phases for p=11
The most important thing to look at is the scaling and composition of the whole connected-components algorithm. The most striking thing that can be said is that the extraction phase takes longest by most accounts. So improving this phase is very important. Also, the intermediate phase of the CC algorithm takes too long. It needs improvement as well.
- In 3D:
- Local phase
- Intermediate (preparatory) phase
- Global phase
- Relation between the three phases for p=11
In the 3D case, the surface-to-volume ration is much larger for identical memory requirments, so that there are a lot more remote arcs that enter the intermediate phase. Consequently, the intermediate phase takes over and dominates the execution time. It should be stressed that the PackFirstDim routine is the most time critical routine.
References
- Connected Components on Distributed Memory Machines, A. Krishnamurthy et al., downloaded from the Internet
- Parallel Implementation of Algorithms for Finding Connected Components in Graphs, T. S. Hsu, V. Ramachandran, N. Dean, DIMACS Series in Discrete Mathematics, vol. 30, 1997
- Foundations of Computer Science Section 9.4, A. Aho, J. Ullman, see website http://www.cs.byu.edu/courses/cs236/javalect/GRAPHS3/Graphs3.html.
- A Parallel Cluster Labeling Method For Monte Carlo Dynamics, M. Flanigan, P. Tamayo, International Journal of Modern Physics C, vol.3 (6), 1992, pp. 1235-1249.
- A Parallel Multigrid Algorithm for Percolation Clusters, R. C. Brower, P. Tamayo, B. York, Journal of Statistical Physics, vol. 6 (1/2), 1991, pp. 73-88
- A Comparison of Data-Parallel Algorithms for Connected Components, J. Greiner, downloaded from the Internet