# Graph Algorithms

## Abstract

Many problems can be abstracted into vertices and edges, where the vertices represents objects, and the edges relations between objects. How can we efficiently analyze and manipulate these graphs in an efficient scalable way?

## Context

The graph abstraction allows standard analysis techniques to be performed on problems in various domains. It also allows efficient algorithms to be developed for an abstracted problem description. The abstraction increases the scale of the problems that can be solved within reasonable time and resource constraints.

Many applications have underlying structures that can be abstracted into graphs: online social networks for example can be represented a graph where people are the vertices and links to friends are the edges; airlines’ flight service map is a graph, where the cities they serve are the vertices and the scheduled flights between cities are the edges; the logic on a computer chip can be represented by a graph, where the logic gates are the vertices, and the wires connecting them are the edges; and in finite state machines (FSMs) that can be used to govern the operations of complex systems, the states are the vertices and the state transitions are the edges.

Information can be associated with vertices, edges. Properties of a vertex include gender and hobbies of a person in an online social network, population of a city on an airlines flight service map, area and power consumption of a gate on a computer chip, and the output value of a FSM state. Properties on the edge include frequency of contacts between friends in an online social network, flight time between cities in an airline flight service map, wire length between logic gates on a computer chip, and input-based state transitions between FSM states.

The graph abstraction allows us to apply well-studied graph algorithms to answer important questions with correctness and running time guarantees. For example, one can apply breadth-first-search to find the closest connection to another person you would like to meet, which is an O(V) running time algorithm for V people in the network; an airline can apply max flow algorithm to maximize crew and equipment efficiency to commit to the daily flight schedules, various flavors of the algorithm have O(VE2) (Edmunds-Karp) to O(V2E) (push-relabel) running time; a chip logic design..qer can use Dijkstra’s shortest path algorithm to find the critical logic path in a circuit, which has O(V2) running time; and a system designer can use breadth first search based reachability analysis to determine if there is a possibility for a state machine to reach an error state.

With a repertoire of well-studied graph algorithm, the graph abstraction allows us to solve problems of tremendous scales, on the order of millions of vertices on the limited computation capabilities of existing computing platforms.

## Forces

#### Problem specific optimizations

Solutions which exploit problem-specific characteristics of graphs (e.g. with a low upper bound for the number of out-degrees) can be orders of magnitude more efficient than a general solution but brittle with respect to changing underlying assumptions. A highly flexible and general solutions which ignore these special graph characteristics may be unbearably slow.

#### Graph representation

Graph data structures can be stored in sparse or dense format. A sparse data representation reduces memory foot-print, allowing larger portion of the problem to fit in cache. However, sparse representations also use more levels of indirection, which increase the latency of retrieving data from a data structure.

## Solution

A problem that can be solved using graph algorithm can be approached in four steps:

1. Recognizing problem-specific graph structure
2. Determining the data structure in which to represent the graph structure
3. Defining temporary data structures in which to store traversal variables
4. Designing partitioning and parallel traversal techniques

The list is ordered in the magnitude of influence on the final performance of the implementation. Each step is explained below.

### Step 1: Recognizing Problem-Specific Graph Structure

The most important factor affecting performance is the assumptions that can be made about the structure of a problem abstraction. Knowing the problem-specific graph structure, we can leverage its characteristics to eliminate redundant checks, partition workload into different pieces, and minimize synchronizations required between the partitioned pieces. We list some graph properties, and their parallelization opportunities. (Note: the parallelization challenge of each type of graph could be a pattern all by itself)

• Bipartite Graph
• Definition: A graph G is bipartite if V(G) is the union of two disjoint independent sets called partite sets of G.
• Parallelization opportunities: The definition of bipartite graph precludes any edges between vertices within each independent set. Although the edges between the two sets might be enormous, traversal on edges from one set to another could be done in a data-parallel fashion.
• Connected Graph
• Definition: A graph G is connected if each pair of vertices in G belongs to a path.
• Parallelization opportunities: If a graph is known to be unconnected, the unconnected components can be processed in parallel.
• Complete Graph
• Definition: A complete graph is a graph whose vertices are pair-wise adjacent.
• Parallelization opportunities: Vertices in a complete graph of any significant size has a large number of degrees. A traversal could be parallelized over the number of degrees.
• k-regular Graph
• Definition: A graph is k-regular if all vertices in G have degree k.
• Parallelization opportunities: The regularity can be very beneficial to efficient graph storage and traversal. A fixed width traversal maps well to SIMD parallelism, where each SIMD lane can be responsible for an edge on a vertex. For graph partitioning, however, k-regular graphs sets a lower bound on cut size to be at least k.
• Tree
• Definition: A tree is a connected graph where number of edges E is V+1, where V is the number of vertices
• Parallelization opportunities: Subtrees can be explored in parallel. Every cut on the tree has size 1.
• Forest
• Definition: A forest is an set of disjoint trees, undirected acyclic graph
• Parallelization opportunities: Each tree in the forest is independent. They can be processed in parallel.
• DAG
• Definition: A directed acyclic graph G is a graph containing directed edges, but does not contain any cycles. It can also be seen a directed tree where sub-trees can be share among multiple branches.
• Parallelization opportunities: shared sub-trees in a DAG can be replicated to trade-off communication with computation.

The graph structure can be inferred from the problem abstraction, or detected by doing specific graph analysis on representative benchmark set.

With even more insight into the problem, a graph can be partitioned based on its structure, making the graph structure more amenable for parallelization. Take timing optimization in CAD as an example, registers in circuits are good endpoints for timing propagation. Therefore, we can partition the circuit graph on register boundaries, and compute the partitions separately.

### Step 2: Determining Data Structure to Store Graph Structure

In order to analyze a graph or manipulate a graph, we need to select an appropriate data structure to represent the graph.

• Vertex List
• Use a list to store all vertices
• Edge information stored within the vertex data structure
• Advantage: If traversal always start at a vertex, such as in Moore FSM traversal, storing edge info at the source provides better data locality
• Disadvantage: If traversal requires accessing a specific edge, such information is not readily available in a vertex list.
• Edge List
• Use a list to store all edges.
• Advantage: If the traversal requires information about specific edges, edge list provides the ability to quickly find it.
• Disadvantage: If vertex information is stored in the edge list, there will be a lot of duplication in vertex information stored.
• For each vertex, use a list to store all other vertices that it connects to.
• Advantage: This is a compact way of representing the adjacency information
• Disadvantage: Traversal through such a data structure requires levels of indirections and longer latency
• For a |V(G)| by |V(G)| matrix, the value of entry aij is zero if there is no edges from vi to vj, the value of aij is non-zero if there is an edge from vi to vj. A sparse representation is a adjacency list.
• Advantage: The matrix can be accessed with fixed data access patterns over a dense matrix, making parallelization and partitioning only dependent on problem size, and not on input graph structure.
• Disadvantage: For sparsely connected graphs, this representation is wasteful in storage.

### Step 3: Defining Temporary Data Structures to Store Traversal Variables

Other than the data structure to store the graph, there exist the need to store temporary traversal variables. For example, a breadth-first search involves a queue to store currently active vertices; A shortest path algorithm requires a priority queue to expand the vertex with the lowest weight.

The temporary data structures can also be used to prefetch data from a variety of locations to a regular data structure. Computations can then be operated on the temporary data in a memory coalesced manner. The final results can then be scattered back into the graph structure.

### Step 4: Designing Partitioning and Parallel Traversal Techiques

There are two types of operations on graphs: 1) traversal operations preserve all graph structures and only analysis the properties of the graph; 2) partitioning operations modifies the graph structure by cutting or eliminating some of the edges for more effective parallelization.

#### Traversal operations

There are a plethora of algorithms that are used analyze graphs. The most common ones are described here.

##### Elementary Graph Algorithms:
• Input: A graph G=(V, E) and a start vertex u.
• Iteration: Maintain a queue Q and a set S. Q stores vertices that will be traversed. S stores vertices that have been traversed. Initially, Q = {u}, S = Φ, level(u) = 1. As long as Q ≠ Φ, we search from the first vertex v of Q. The neighbors of v not in S∪Q are pushed into Q, their levels are assigned as level(v)+1, and then v is removed from Q and put in S.
• Concurrent Computations: All vertices in Q with the same level value can be traversed concurrently.
• Depth-First-Search
• Input: G=(V, E) and a start vertex u.
• Iteration: Maintain a stack R and a set S. R stores vertices that will be traversed. S stores vertices that have been traversed. Initially, R = {u}, S = Φ, level(u) = 1. As long as R ≠ Φ, we search from the topmost vertex v in R. If v has a neighbor w not in S∪R, assign level(w) as level(v)+1, and push w into R. Otherwise, pop v from R.
• Concurrent Computations: For tree graph structures with no shared sub-trees, depth-first-search is equivalent to breadth-first-search. For the general graph, it is a very sequential process.
• Topological Sort
• Input: A DAG G=(V, E)
• Iteration: Maintain two lists, R and S. R stores vertices that do not have any incoming edges. S stores the sorted sequence. Initially, R = vertices that do not have any incoming edges. S = Φ. The level values of all vertices in R are assigned as 1. As long as R ≠ Φ, we search from the first vertex v of R. Remove all ongoing edges from v. By doing so, if any of v’s neighbor y becomes vertex with no incoming edges, push y into R, and assign level(y) as level(v) + 1. Push v into S.
• Concurrent Computations: All vertices in Q with the same level value can be traversed concurrently.
##### Minimum Spanning Trees:
• Kruskal
• Input: A weighted connected graph G=(V, E)
• Iteration: Maintain an acyclic spanning subgraph H. Initially, V(H) = V(G), E(H) = Φ. Sort E(G) by ascending order. Traverse edges by the sorted order. If an edge joins two components of H together, add this edge into H. Otherwise, discard it. Terminate while H is connected.
• Concurrent Computations: The sorting process can be done in parallel. When adding edges into H, if the components joined by the edges are independent, the joining process on different edges can also be done concurrently.
• Prim
• Input: A weighted connected graph G=(V, E)
• Iteration: Maintain an acyclic spanning subgraph H. Initially, we arbitrarily pick up one vertex from G and put it into H. Examine edges that have one endpoint in H and one endpoint in G – H, pick up the edge with minimum weight, and add the edge with the vertex into H. Terminate while H spans the whole graph.
##### Single Source Shortest Paths:
• Bellman-Ford
• Input: A weighted directed graph G=(V, E), a source vertex s.
• Iteration: Initially, d(v) = ∞ if v ≠ s. d(s) = 0. Relax all edges (|V| – 1) times. When relaxing an edge e(u, v), update d(v) according to the formula d(v) = min(d(v), d(u) + w(e)). w(e) is the weight of e. d(u) is the distance between s and u.
• Concurrent Computations: When relaxing edges, edges with different heads can be relaxed concurrently.
• Dijkstra
• Input: A nonnegative weighted directed graph G=(V, E), a source vertex s.
• Iteration: Maintain a set S of vertices to which a shortest path from s is known. Initially S={s}. d(s) = 0. d(s, v) = w(e(s, v)) if v is a neighbor of s. d(s, v) = ∞ otherwise. For all vertices outside S, pick up the vertex u with the minimum distance value, add u into S. Assuming y is a neighbor of u, if it is not in S, update its distance value by d(s, y) = min(d(s, y), d(s, u)+w(e(u, y)). Terminate while S = V(G) or d(s, v) = ∞ for all v outside S.
• Concurrent Computations: After we have picked up a vertex v and added it into set S, we can update the distance value of all its neighbors concurrently.
##### All-Pairs Shortest Paths:
• Floyd-Warshall
• Johnson
• Input: A weighted directed graph G=(V, E).
• Iteration: First, insert a source vertex s, connecting all nodes in V(G). Second, apply the Bellman-Ford method to find the single source shortest path with respect to s. Second, update all edge weights as w(e(u, v)) = w(e(u, v)) + d(s, u) – d(s, v). Finally, apply the Dijkstra’s algorithm on all nodes to find all-pair shortest paths.
• Concurrent Computations: When applying the Bellman-Ford and Dijkstra algorithm, the parallel potential is the same as that in these two algorithms. Moreover, in the final stage, we can apply the Dijkstra algorithm on all vertices concurrently.
##### Single Source Longest Paths in Directed Acyclic Graph:
• Floss
• Input: A weighted DAG G=(V, E). A source node s.
• Iteration: Maintain a list R. R stores vertices that do not have any incoming edges. Initially, R = {s}. As long as R ≠ Φ, we search from the first vertex v of R. Assuming y is v’s neighbor, updating d(s, y) by d(s, y) = max(d(s, y), d(s, v) + w(e(v, y))). Remove all ongoing edges from v. If any of v’s neighbor y becomes vertex with no incoming edges, push y into R. Push v into S.
• Concurrent Computations: When picking up a vertex v, the updating process among all v’s neighbors can be done concurrently.
##### Maximum Flow:
• Edmonds-Karp
• Input: A connected directed graph G=(V, E). Capacity c(e) on each edge e. A source node s, a sink node t.
• Iteration: Maintain a residual network Gf of G. Assuming the distance on each edge is 1, use BFS to find a shortest path p from s to t. Augment through path p and update the residual network Gf. Repeat the procedure until BFS cannot find a path p from s to t.
• Concurrent Computations: The same as the concurrent computations in BFS.
• Push-Relabel
• Input: A connected directed graph G=(V, E). Capacity c(e) on each edge e. A source node s, a sink node t.
• Iteration: Maintain an excess flow and a height on each vertex. Initially, the excess flows of source vertex’s neighbors are assigned as the capacity between the source node and them, and the excess flows of other vertices are assigned 0. The height of source vertex is assigned as |V(G)|, and the height of other vertices are assigned 0. For vertices that have excess flow larger than 0, apply the push operation or relabel operation according to the relationship between them and their neighbors. For a vertex u with excess flow larger than 0, if u has a neighbor v with height less than u by one and the capacity from u to v is larger than 0, apply the push operation to push flow from u to v. If all neighbor of u have height larger than or equal to u, apply the relabel operation to increase the height of u. Repeat the process until there are no valid push and relabel operations exist on the residual network.
• Concurrent Computations: The push operation on different vertices can be applied concurrently. However, if more than one vertices are pushing flows to the same vertex, the lock-unlock mechanism should be applied on that vertex to maintain the correctness of the operation. The relabel operation on independent vertices can be applied concurrently.
• Relabel-to-Front
• The concept of the relabel-to-front is similar to the push-relabel method. However, it bundles all possible operations on a vertex together. As a result, if a vertex u should apply the push operation, it will push the excess flow from u to all valid neighbors until no further push operation can be applied. If a vertex u should apply the relabel operation, we will apply the push operation on u after the relabel operation immediately.
##### Graph Coloring:
• Chaitin
• Input: An undirected graph G=(V, E).
• Iteration: When finding a k-coloring on the graph, remove vertices on the graph with degree less than k iteratively until the graph is reduced to an empty graph. Reconstruct the graph by reverse order, and draw the vertices during the reconstruction process.
• Bomen
• Input: An undirected graph G=(V, E). A predefined step parameter s.
• Iteration: Divide V(G) into p subsets V1, …, Vp. Put each subset into one processor. Phase 1: Pick up s vertices in each processor, assign them colors. Phase 2: If vertices in different processors result in some confliction, randomly pick one of them as uncolored. Repeat phase 1 and phase 2 until all vertices are colored.

Different algorithms can have different parallel implementations. The Parallel BGL library [2] implements some of the algorithms listed above using distributed graphs and distributed queues.

#### Partitioning operations

There is a limited number of parallelization opportunities for many graph structures. Given domain knowledge, we can choose to eliminate some edges and scope the traversal operation into independent localized partitions such that the partitions can be solved in parallel.

Graph partitioning is a very important step for parallelizing graph algorithms. Communications between different processors are very expensive. Good partitions can minimize the communications among processors, and improve the overall performance. Most graphs have unpredictable structures, which makes the estimation of communication difficult. In order to define good partitions, we need to rely on the properties of the graphs and the problems. For example, if we are analyzing a tree, we know that each edge on the tree defines a cut. By examine all edges, we can definitely find the best cut of the tree. If we are partitioning a circuit, usually the pipeline stages form good partitions.

Here are some common graph partitioning algorithms

• Kernighan-Lin
• Input: A graph G=(V, E), and |V(G)| is an even number.
• Iteration: Find a pair of unlocked vertices va in part a and vb in part b such that the exchange of these two vertices makes the largest decrease or smallest increase in cut cost. Lock va and vb after the exchanger operation. After all the vertices are locked, find the exchange subsequence that makes the largest decrease in cut cost. Repeat such procedure until there is no improvement on cut cost.
• Fiduccia-Matheyses
• Input: A graph G=(V, E).
• Iteration: Maintain a bucket of linked lists. Each linked list represents a specific gain for moving one unlocked vertex to the other partition. Pick one unlocked vertex u from nonempty linked list with highest gain, and try to move it to the other partition. If it is valid, move it, update the gain of vertices that connect to u, update the corresponding linked lists, and lock u. If all vertices are locked or all movements are invalid due to the balance constraint, find the movement subsequence that makes the largest decrease in cut cost. Repeat such procedure until there is no improvement on cut cost.
• ParMetis
• Input: A graph G=(V, E).
• Iteration:
1. Coarsening Phase: Recursively combine nodes in the graph to reduce the size of the graph. Each processing element will merge nodes inside its storage space independently.
2. Partitioning Phase: Broadcast all information inside each processing element to one another. Therefore, each processing element can reconstruct the shrunk small graph inside its storage space. Execute the same partition algorithm on each processing element, and then update the partition information of nodes belongs to it. Finally remove additional nodes and edges to restore the distributed graph topology at the beginning of this phase.
3. Uncoarsening Phase: Recursively restore the shrunk graph into its original topology. After the each recovery operation, refine the partition inside each processing element in parallel.

With the four steps described above ordered in the magnitude of influence on the final performance of the implementation, one can begin to formulate an efficient implementation strategy for a problem with a graph abstraction. Since the abstracted graph structure is of great importance, a domain expert should explore several abstractions and examine their graph structures according to this pattern and consider the implementation implications before committing to the final graph abstraction.

## Invariant

The invariant will be relative to the particular algorithm. Here are some examples:

• Breadth-First-Search: All vertices in the same level cannot be traversed until all vertices in the upper level have been traversed.
• Kruskal: All edges will be accessed in non-decreasing order.
• Prim: Among all outgoing edges from the set of traversed vertices, the one with minimum weight will be traversed next.
• Dijkstra: For the vertices in the determined set, the shortest distance between the source node and these vertices are finalized.
• Floss: Before traversing an edge from vertex u to vertex v, the longest path to vertex u is finalized.

## Examples

Suppose that we are trying to transfer goods from one city to another as many as possible. We use the airline route map for the transportation. Each airline route has its unique capacity of carrying goods, which limits the number of goods we can transfer using one route. Now we will apply the graph algorithm to solve the problem.

1. Use a graph to represent the problem, and verify the characteristics of the graph. We use a source vertex and a sink vertex to represent the starting city and the target city. All other cities that are between the starting city and the target city are also represented by vertices. If there is an airline route from city a to city b with capacity c, we will use a directed edge from a to b to reflect the route. The directed edge will have two properties. One is its capacity; the other is the number of goods we are transmitting through the route. The graph structure does not fit in any specific graph categories that we introduced in the solution section. However, the routes between cities have good locality properties in general. That is, there are more airline routes between cities that are close together, and less airline routes between cities that are far away from each other.
2. Determine the data structure to store the graph. Usually the airline route map is a sparse graph. Therefore, we choose the adjacency list data structure to store the graph.
3. Partition the graph. In step 1, we recognize that the airline route map has good locality properties that we can rely on. Moreover, we are using the adjancency list to store the graph. Therefore, we can partition vertices by their physical positions. For example, we can put all cities inside California in one partition, all cities inside Seattle in another partition, and so on.
4. Analyze the graph by specific graph algorithms. It is a maximum flow by nature. Therefore, we can apply the Edmond-Karp algorithm, the push-relabel algorithm, or the relabel-to-front algorithm to solve it.

## Known Uses

##### Online social networks
• Abstraction: people are abstracted as vertices, with gender and hobbies as vertex properties; friend connections are abstracted as edges, with frequency of contacts as a possible edge property
• Question: How to find the closest connection to person B a person A would like to meet?
• Solution: Breadth first search with person A as a starting node, and stop when you have reached person B, back-track to find the shortest path to person A. This is an O(V) running time algorithm for V people in the network
##### Airline flight/crew scheduling
• Abstraction: cities are abstracted as vertices, with flight crew’s home base as a vertex property; flight connections between cities are abstracted as edges, with crew requirements and size of planes used on routes as edge properties
• Question: How to maximize crew and equipment efficiency to fulfill the daily flight schedules?
• Solution: Use the max flow algorithm to maximize crew and equipment utilization, where various flavors of the algorithm have O(VE2) (Edmunds-Karp) to O(V2E) (push-relabel) running time
##### Circuit timing analysis
• Abstraction: logic gates are the vertices, with silicon area and power consumption of a gate are vertex properties; the wires connecting them are the edges, with their length, width and pitch as edge properties
• Question: What is the timing critical path in the logic?
• Solution: use Dijkstra’s shortest path algorithm to find the critical logic path in a circuit, which has O(V2) running time
##### Finite state machine traversal
• Abstraction: states are the vertices, with the output value of a FSM state as a state property; the state transitions are abstracted as the edges, with input symbols as edge property.
• Question: Can one reach an error state with sequences of inputs starting from known good start state?
• Solution: Use breadth first search based reachability analysis to determine if there is a possibility for a state machine to reach an error state, O(V) for V states.

## Related Patterns

• Dense Linear Algebra
• Sparse Linear Algebra
• Graphical Models
• Backtrack and Branch-and-Bound
• Finite State Machines

## Authors

Bor-Yiing Su, Jike Chong

## References

[1] Douglas B. West, Introduction to Graph Theory, Prentice Hall Book Co., 2nd Ed., 2001.

[2] Douglas Gregor and Andrew Lumsdaine, “The Parallel BGL: A Generic Library for Distributed Graph Computations.” In Parallel Object-Oriented Scientific Computing (POOSC), July 2005.

[3] Gregor, D. and Lumsdaine, A., “Lifting Sequential Graph Algorithms for Distributed-memory Parallel Computation.” In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), October 2005.

[4] Kirk Schloegel, George Karypis, and Vipin Kumar, “Parallel Static and Dynamic Multi-constraint Graph Partitioning.” In Concurrency and Computation: Practice and Experience. Volume 14, Issue 3, pages 219 – 240, 2002.

[5] Alex Breuer, Peter Gottschling, Douglas Gregor, and Andrew Lumsdaine, “Effecting Parallel Graph Eigensolvers Through Library Composition.” In Performance Optimization for High-Level Languages and Libraries (POHLL), April 2006.