In this article I’m going to share how I optimized a graph algorithms interview question down from 58s down to 1.2s, nearly a 50x improvement! TLDR: Use the right data structures, exploit cache efficiency, and do less work.
For those unaware, F# is a functionalfirst language that runs on the .NET platform (think C#). Despite the title of the article, very little of the optimization techniques applies solely to F#, and so I have intentionally wrote this article to make it comprehensible even if you don’t understand a line of code in the article.
Why F#? All significant optimizations come from careful profiling and significant refactoring, and F# has very good support for both. F#’s strong type system makes refactoring safe, and Visual Studio integration makes profiling easy. Also, I love that I get highly tuned data structures via .NET to speed up my algorithms.
Bear in mind that I am by no means an F# expert, so there may be better ways to optimize this code.
The question
As with all interview questions, they usually come with a good ‘ol blurb which attempts to give some context to the horridly abstract question you’re expected to comprehend. Here’s my attempt at paraphrasing it (note that all this is original writing, nothing copied):
The question: Applications in 2020 typically have hundreds of dependencies, all of which need to be compiled. The downside is, changing one module might cause all the downstream modules to be recompiled. The task is as follows: given a set of 2000 modules, all of which depend on each other in some way, print the number of modules downstream of each module.
The input will look like the following. In this representation, module A depends on B, E, and F. As a result, module A is downstream of module B (and E,F too).
A B E F
B D
C B D E
D E
E
F D
This corresponds to the following dependency graph:
E

D
/ \
B F
/ \ /
C A
Notice that not all edges are drawn, in particular any transitive edges are not drawn. This will come in useful later.
This program will produce the following output, corresponding to the number of downstream modules (including itself):
A 1
B 3
C 1
D 5
E 6
F 2
The original solution
The exact solution isn’t super critical, so I won’t go into too much depth (as well as not to spoil the fun for those new to the question). Basically, it involves building a Directed Acyclic Graph (DAG) out of the list of lists, and then running a DepthFirst Search in order to identify the number of children under each node.
Before a recursive call to the DFS on some node returns, it updates a shared hashtable, to indicate which children are under the node. For instance, in the tree:
E

D
/ \
B F
/ \ /
C A
When the call to node B returns, the map will contain {B: {B, C, A}}.
With this information, finding the child nodes under a node is as simple as calling the DFS on each child node, then merging the set of all its child nodes.
When I first implemented this algorithm in F# a year back, it worked very well, but took 55s for a large dataset. And honestly, that was fine. It solved the problem! Problem solving is always the priority.
But that’s not enough for you, dear reader. You want to know how to squeeze that last drop of performance, and that’s what the rest of this article is about.
Performance with Data Structures
As Chandler Carruth puts it, performance with data structures, efficiency with algorithms. The thing that made the most difference in my F# code was Data structures, Data structures, Data structures.
When I first wrote my F# code, it was filled with Maps and Lists. Now, that was very good F# code: idiomatic and easy to read, but it also meant that it was really slow. As you may be aware, F# maps are implemented as Binary Trees, which means that insert and search are O(log n) time operations. The hot path of my application involved a lot of Map lookups, and so I changed the F# maps to .NET dictionaries, which have O(1) lookup time.
Similarly, I changed F#’s sets to HashSets, and changed the F#’s lists (which are linked lists) to .NET arrays, which have much better cache locality. Since I was mostly looping over my lists, having them as arrays made them looping faster.
Changing immutable collections to mutable ones for performance is not surprising in itself, but it’s important to remember that mutablity makes the code harder to reason about.
So, I kept the Maps and Lists around in the parsing section of the code, which my benchmarking showed only took less than half a second  recall that the merge step took almost a full minute!
We’ll get around to optimizing that later on, but as always, profile your code and only make life more complex if you have to.
Here’s a snippet to demonstrate what I mean. The old code is presented first, which uses F#’s sets, maps and lists, and the new code is below.
This does a DepthFirst Search (DFS) to obtain the set of child nodes that are under each parent node. There’s obviously a lot to explain in this code, which I won’t go into, but here’s a summary of the Data Structures changed:
 Adjacency lists (Map from string to Set<string>) got changed to a 2D adjacency matrix of bools
 The cost Map, which used to be a Map from string to string, got changed to a Dictionary from int to int[]. Note that ChildNodes is an alias for int[], I did this so I could change the implementation of ChildNodes without rewriting a lot of code.
1// old code
2type Digraph =
3 {nodes: string list;
4 adjacency: Map<string, Set<string>>;}
5
6// Note: graph is a Digraph in this function,
7// because this function is actually a closure
8let countNumChildren node (costMap: Map<string, Set<string>>) =
9 let mutable costMap = costMap
10 let rec dfs node =
11 match Map.tryFind node costMap with
12  Some (s) > s
13  None >
14 match Map.tryFind node graph.adjacency with
15  None >
16 let s = Set.add node Set.empty
17 costMap < Map.add node s costMap
18 s
19  Some(lst) >
20 let s =
21 lst
22 > List.map dfs
23 > List.reduce (Set.union)
24 > Set.add node
25 costMap < Map.add node s costMap
26 s
27 dfs node > ignore
28 costMap
1// New code  uses Dictionaries and 2D arrays
2type Digraph =
3 {size: int;
4 labels: Map<string, int>;
5 adjacency: bool[,];
6 costMap: Dictionary<int, ChildNodes>}
7
8let costOfModules (graph: Digraph) =
9 let rec dfs (node: int) =
10 let found, value = graph.costMap.TryGetValue node
11 if found then value else
12 let adj = genAdj graph.adjacency node
13 let s = ChildNodes.init node
14 match adj.Length with
15  0 >
16 // Node is a leaf, so add itself to costMap
17 graph.costMap.Add(node, s)
18 s
19  _ >
20 let s =
21 adj
22 > Array.map dfs
23 > Array.append [s]
24 > ChildNodes.build
25 graph.costMap.Add(node, s)
26 s
27 for i in 0..graph.size1 do
28 let found = graph.costMap.ContainsKey i
29 if found then () else dfs i > ignore
30 graph
If you’ve been following closely, then you’ll notice one change which I didn’t mention yet. How come the Maps used to go from string to string, but now go from int to int[]?
That leads me to my second major perf improvement  Not using strings. Strings are basically blobs of memory that have to be yanked out from all over the heap, that the GC has to care about, and they have to be hashed (that takes time!) If you can avoid using strings, then don’t.
In my application, I was using strings as the names of each of the nodes in the graph, and I changed every use of a string to an int. Now, the strings as names for the nodes still had to be kept around, since I needed to print the nodes at the end, so I simply kept a Map called “labels” around that would let me know which strings corresponded to which ints.
Notice that I kept this as an F# map, despite everything I said earlier, since it isn’t involved in the performance critical sections. Remember, mutability means more work for the reader.
With these two perf improvements, I was able to get my total time down from 58s to 5.5s, an 10x improvement!
Efficiency with Algorithms
Now at this point I had spent about three days writing the improved code, and I was feeling really good. All that stuff about exploiting cache efficiency was paying off, and I felt that I could get it under a second by Monday. The next step, though, wasn’t so easy. The bottleneck proved to be calculating the set of child nodes under each parent.
Right now, I was storing each set of child nodes as an int[]. To calculate the set of all nodes under a particular parent node, I would recursively compute the set of all nodes under each of the parent’s children, then use a HashSet to merge them together. Repeated efforts to improve the code proved futile, shaving only milliseconds off.
1module ChildNodes =
2 let build (lst: int[][]): int[] =
3 let d = HashSet()
4 for s in lst do
5 d.UnionWith s
6 d > Seq.toArray
Now what? Well, after some testing, I noticed a big problem. Most edges in the graph are useless. To show you what I mean, let’s take a very simple example of a graph, where A depends on B, which depends on C.
A < B < C
// represented as:
A B C
B C
C
Looking at this graph, to calculate the nodes under C, it suffices to calculate the nodes under B. But under my current scheme, I would calculate the nodes under B and nodes under A, then merge those two together! Obviously, this is unnecessary work, since whatever is under A will also be under B.
Let’s take a more complex example to see this in action:
As you can see, in this graph there are many redundant edges, marked in red. In particular, we have something called transitive redundancy, i.e. if I can get from A to B in two or more steps, then I shouldn’t have an edge from A to B at all!
This problem is closely related to the problem of a transitive closure. In a transitive closure, we take some DAG and add in all the transitive edges. That is, if I can go from A to B and B to C, then I should also be able to go from A to C.
We need to do what’s called a Transitive Reduction, going in the opposite direction.
If you’ve taken a Discrete Mathematics class (spoiler: I used to teach one), you’ll know that there exists an O(n^3) algorithm for computing Transitive Closures, called Warshall’s algorithm. I won’t go too much into depth as the linked PDF explains it well.
Instead, we’re going to use a much lesser known algorithm called Gries’ algorithm[1] to compute the transitive reduction of a graph. It basically runs Warshall’s algorithm in reverse, removing transitive edges from the graph as it finds them. Again, read the paper for more details.
Here is my implementation of Gries' algorithm:
1// Implements the transitve reduction algorithm
2// minor perf improvements have been made
3let reduce (graph: Digraph): Digraph =
4 let adj = graph.adjacency
5 let n = Array2D.length1 adj
6 for k = n1 downto 0 do
7 Parallel.For(0, n,
8 (fun i >
9 if adj.[i, k] then
10 for j in 0..n1 do
11 if adj.[k, j] then
12 adj.[i, j] < false
13 )
14 ) > ignore
15 graph
A few words on this algorithm. When I first implemented it asis from the paper, I found that the total runtime shot up from 5.5s to 12s! That was really disheartening, since I was sure that implementing a better algorithm was the key to speeding this algorithm up.
Nonetheless, with a few performance improvements, I managed to get the program to run at 1.4s, another order of magnitude improvement! Firstly, I implemented the middle loop as a parallel for loop, which helped to speed things up a little[2].
But the main improvement came by avoiding work (again). Notice that I do an if true check on adj.[i,k] before I run the innermost loop. If that is false, we skip a whole inner loop of work. That’s 2000 iterations saved  No wonder the speed up was an order of magnitude!
Compare the contrast to the old version (some omissions):
1let reduceOld (graph: Digraph): Digraph =
2 for k = n1 downto 0 do
3 for i in 0..n1 do
4 for j in 0..n1 do
5 if adj.[i, k] && adj.[k, j] then
6 adj.[i, j] < false
7
8let reduce (graph: Digraph): Digraph =
9 for k = n1 downto 0 do
10 Parallel.For(0, n,
11 (fun i >
12 if adj.[i, k] then
13 for j in 0..n1 do
14 if adj.[k, j] then
15 adj.[i, j] < false
16 )
17 ) > ignore
The proof in the pudding (aka the code)
If you want to see the code, here’s a link to the Github repo: Link to Github Repo
Squeezing the last drop of performance
By and large, I was pretty happy with the performance as is, so I stopped here.
There were some other things that helped increase performance by little bits and pieces here and there, mentioning them for completeness:
 Preallocating capacity for HashSets and Dictionaries helped increase performance by 0.1s.
 For HashSets, I first preallocated the capacity of the sum of all the sizes of the arrays it was going to merge together. However, that causes the hashsets to become too big. With some profiling, I found that doing sumOfSizes / 4 worked best.
Failed ideas
More than knowing what worked, it’s also useful to note what didn’t work:
 I tried to beat String.Split() by writing my custom lex and parse phase but failed. As it turns out, String.Split() isn’t too slow for this application (~1.8MB file).
 Also, on the CLR memory allocations are really cheap, so having 4000 strings in memory was a hassle for the GC, but didn’t affect runtime significantly. Plus, it gets GCed pretty quickly as it’s only used in Graph creation.
 As part of the above, I tried to speed up the graph creation by incrementally building it during the string parsing phase, but I couldn’t get it to beat the simple implementation. At any rate, it turns out that the parsing and graph creation is not the bottleneck.
 Merging nodes is the bottleneck of this algorithm. I tried doing parallel merge and sorted merge using plain sorted arrays, but that didn’t help either. Turns out HashSets are too darn optimized.
Footnotes

An algorithm for transitive reduction of an acyclic graph (1987). Gries, D., Martin, A. J. et al

You may notice that I’m performing parallel reads and writes to a shared mutable data structure, namely the 2D adjacency matrix. While this is normally highly discouraged, in this case it is fine, since the reads and writes are guaranteed to never overlap. That said, although conceptually it makes sense (each thread has a different row), it is not obvious that it holds at the processor level.
To verify this, I dug into the System.Corelib.Private source code to check how they perform writes on a 2D array. The fear would be that if arr.[i, 2000] and arr.[i+1, 0] overlap. This can only happen if the CLR isn’t addressing each boolean by itself, but rather batching boolean reads and writes. Thankfully, the CLR uses regular unsafe pointer arithmetic to get and set booleans. Since most modern processors are byteaddressable, this means that there is no overlap.
Phew! These are the details one has to worry about when performing lockfree reads and writes to shared mutable state.