The problem of finding the strongly connected components of a directed graph has been solved in an algorithmically optimal manner for over 40 years. In this article we look at an algorithm which is efficient not only algorithmically, but also from a memory usage perspective.
One of the fundamental data structures in computer science is the directed graph. It finds applications in everything from quantum mechanics to physical mail delivery.
A directed graph can be thought of as a collection of circles connected by arrows. An arrow tells us that it is possible to get from one circle to another. If we can get from one circle to another circle by following a series of arrows (in the correct direction) then we say that these circles are connected.
In many cases we might be able to get from circle "A" to circle "B" by following arrows, but we might not be able to get from B back to A. In this case we say that A and B are weakly connected. If its possible to get from A to B and also from B back to A then we say that they are strongly connected.
If A is strongly connected to both B and C then we know that B and C are also strongly connected to each other, since we can get between them by going via A.
By grouping together circles which are strongly connected to each other we can break a graph up into strongly connected components (SCCs).
It turns out the problem of breaking up a graph into strongly connected components is one which pops up all the time in areas including disease transmission and ocean current analysis.
In the rest of this article we'll look at algorithms to solve this problem and how they're implemented in Scipy to be super fast and efficient.
The canonical SCC algorithm was published by Robert Tarjan in 1972 [PDF]. His paper describes a recursive algorithm which is based on a depth first search. Many implementations of this algorithm can be found in a variety of languages.
Tarjan's algorithm is algorithmically optimal, since the time taken to compute the SCCs is linear in both the number of circles (nodes) and arrows (edges). One of its drawbacks however is that it is a recursive algorithm. When implemented in most languages recursion is both slow and memory intensive due to the need to maintain a complete call stack.
One way to mitigate this problem is to convert the recursive algorithm into an iterative one. It can be shown that this allows the problem to be solved with a per-node memory usage of 5 words plus 2 bits, or 162 bits per node when working with 4 byte words. This equates to approximately 19.3MB per million nodes. This of course assumes that the algorithm has been implemented with memory optimisation in mind, which is often not the case!
Now 5 words per node sounds pretty good, but if we could shave this down by a word or two we could make significant savings! Is such an improvement possible?
In 2005 David Pearce analysed this exact question and came to the conclusion that with a bit of tweaking, Tarjan's original algorithm could be modified to only require 3 words per node (or 11.4MB per million nodes) [PDF]. So far this is the most efficient (in terms of memory usage) algorithm known to solve the SCC problem.
Pearce's paper presented pseudocode for his algorithm in recursive format in order to contrast it with Tarjan's original algorithm. This recursive algorithm was implemented within Scipy in version 0.11.0.
This implementation suffered from being recursive and used orders of magnitude more memory than the theoretically possible 3 words per node. In version 0.12.0 of Scipy I implemented the iterative version with optimal memory usage. This version was able to handle graphs around 7,500 times larger than the original before hitting the memory limits of a standard PC.
Lets have a look at how this algorithm works and how we are able to run it with such a low memory overhead.
We start with two counters,
label. We initialise these to
N respectively, where
N is the number of nodes in our graph.
We now pick an arbitrary node as a starting point and perform a depth first search (DFS) traversal of our graph. At each node we assign the current value of
index to the node and increment
index by one. In this way we assign incrementing numbers to each node which is weakly connected to our starting node. We call the number assigned to each node its link number.
The clever part of the algorithm comes when we have processed all the children of a node and are ready to backtrack up the tree to continue the DFS traversal. When backtracking past a node, we consider all of its children and find the smallest link number of these child nodes. If this is smaller than the link number of the current node, we update the link number of the current node and put the node onto a stack.
If none of the children have a smaller link number then we have found a root node. None of the children of this node point to a node with a smaller link number than the current one, which means that they can't possibly be strongly connected to any nodes "lower" in the graph.
In this case we assign the value of
label to the root node. We also assign
label to all the nodes in the stack with a link number greater than or equal to the root's link number, removing them from the stack at the same time. Once we've done all this we decrease
1 and decrease
index by the number of nodes we just updated.
Once we have completed this backtracking step for all the nodes in our depth first traversal from the original node we will have assigned a label to all the nodes (weakly) connected to the original node. There might still be nodes which are completely disconnected from the original node, so we repeat the process for every node in our graph, skipping nodes which have been assigned a label already.
At the end of all this each node will have a label assigned to it, and all nodes which are strongly connected to each other will have the same label. As such, we\'ve solved the problem of determining the SCCs of the graph.
The following tool allows your to step through this algorithm for an example directed graph.
We can now look at how much memory we need for this algorithm.
For each node we use one word of memory to store the link number/label.
To perform the DFS we need to keep a stack of nodes. We only need to visit each node once. As such, if we're trying to push a node which is already on the stack we would like to be able to extract the node from inside the stack and move it to the top of the stack. We can implement this data structure with a doubly linked list in a fixed sized array, which requires 2 words per node. The operation of looking for an element within the stack and moving it to the top can be performed in constant time, maintaining the linear algorithmic complexity of the complete algorithm.
We also need to keep a stack of backtracked nodes, which would usually require another one word per node. As we can observe from the interactive tool above, nodes on the backtrack stack have already been removed from the DFS stack. This means we can share the memory used by the DFS stack and implement the backtrack stack as a singly linked list in a fixed size array, requiring no extra memory.
This gives us a total of 3 words per node, as claimed by Pearce. As such we can claim that the SCC algorithm implemented within Scipy is optimal from a memory usage perspective, whilst maintaining linear computational complexity.