# Paper Reading: Efficient Path Profiling

Recently I’ve been going through CS 6120 from Cornell (compilers), and one of the papers listed in the course was quite interesting, namely Efficient Path Profiling. Once in a while you see a solution so neat that it almost feels like the problem was created in order to make such a solution useful; this paper gave me that feeling. This blog post will give a high level understanding of the problem, the algorithm and some intuitions, while leaving out all the technical details.

The problem setting is that you have a control flow graph (CFG), where each node is a block of code that always executes together (no branches), and each edge is a branch/jump instruction. With huge loss of generality, we assume there will be an `ENTRY`

node and an `EXIT`

node, and the CFG will be a directed acyclic graph (DAG) always going from `ENTRY`

to `EXIT`

. This is clearly unrealistic for normal programs due to the lack of loops, but the paper provides workarounds that aren’t very interesting. The task is to record the paths taken in each execution (from `ENTRY`

to `EXIT`

), so that we can compute statistics about which paths are the most common and make compiler optimizations accordingly.

In other words, we’re doing something like the following. Say we give each node a unique identifier (e.g. 0, 1, 2 …). Each time the program runs, we maintain a list of these identifiers, appending to it every time we visit a new node. And by the end we can add the resulting list to some sort of aggregating data structure.

But that’s a horribly inefficient way to do it. Both appending to the list in each node and aggregating the resulting lists at the end of each execution are going to be expensive. Here, the authors propose: what if we could instead somehow give integer weights to the edges in the CFG such that each path from `ENTRY`

to `EXIT`

has a unique sum, to replace the list of node identifiers? What if those path sums are small enough numbers that you could just make an array and increment the element at the index equal to the path sum?? What if you can pick the edges that are triggered the most and set those weights to 0, so you don’t even need to do anything in the hot paths???

## Compact Unique Path Sums

It turns out all of those are possible. First, it’s actually easy to weight the edges such that each `ENTRY`

->`EXIT`

path gives a different sum. You could just pick unique powers of 2, which could give you really large weights. But you can actually do much, much better and make the sums “compact”, meaning they form a range between 0 and the number of unique paths - 1, so the path sums are as small as possible. It’s also really simple to do so.

First, we define `NumPaths(v)`

as the number of unique paths from node v to `EXIT`

. This takes linear time to compute. Then, for each node, say we have a list of outgoing edges. For the ith edge, we simply take all edges from 0 to `i - 1`

, find their destinations, add up their `NumPaths`

, and use that as the weight. The intuition behind this is that for each outgoing edge, there are `NumPaths`

way to get to the `EXIT`

, and each path has a unique sum between 0 and `NumPaths - 1`

(by induction). Since the first edge already claimed sums 0 to `NumPaths`

- 1, the second edge has to start counting from `NumPaths`

, and we can achieve that by adding `NumPaths`

to the second edge’s weight.

If the maximum `NumPaths`

of a CFG is small enough, we can just maintain an array of length `NumPaths`

to count the frequency each path is taken across many runs. Otherwise, we can still maintain a hash table, incurring a larger overhead.

## Choosing Weights to Zero Out

So far it’s been very simple. Notice that for each node, one of the outgoing edges has weight 0. Of course, at those edges, we don’t actually need to add an instruction to add the weight to the path sum. So we can actually just pick the most frequent edge and assign it 0, to minimize the overhead we’re adding to the program.

But we can actually have more flexibility than picking one outgoing edge per node to zero out. This part is a bit more involved to understand, and this paper basically says “just look at that other paper”. While that other paper has a proof, it still didn’t quite explain why it works. I think I have an intuition, which I will lay out below, but I’m Not a Computer Scientist, so it might be wrong, etc.

The way it works is that: you start off with some estimations for the relative frequencies of each edge being taken. You add an edge from `EXIT`

to `ENTRY`

with frequency 1 (fires every time the program runs), and compute the spanning tree of the resulting graph with the maximum weight (sum of edge frequencies), ignoring directions. All edges in that spanning tree will have an updated weight 0. Note that this is never worse than zeroing out the most frequent edge of each node as described above, because taking one outgoing edge per node also forms a spanning tree (n - 1 edges in total, all nodes are linked to `EXIT`

).

For any edge not in the spanning tree, we call it a “chord”. Each chord `f`

, when added to the spanning tree, forms a cycle `C(f)`

, also ignoring direction. Now, we have the weight assignments for any edge `e`

, `W(e)`

, from the previous section’s algorithm (for the added edge, `W(EXIT->ENTRY) = 0`

). The new weight of any chord `f`

, `W'(f)`

, is the sum of `W`

over `C(f)`

, but we negate `W`

for edges that are in the opposite direction of `f`

in the cycle. For example, say the chord `A -> B`

has `C(A -> B) = A -> B <- C -> A`

, then `W'(A -> B) = W(A -> B) - W(C -> B) + W(C -> A)`

. The claim is that, for any given path from `ENTRY`

to `EXIT`

, `W`

and `W'`

yield the same path sum.

But why is that? Here’s the handwavy part. The intuition is that every program execution, when appended with the edge `EXIT->ENTRY`

, becomes a loop. A directed program execution loop `D`

must contains chords, since all loops contain edges not in the spanning tree, and `D`

is really just a “sum” of all `C(f)`

for chords `f`

in `D`

. So the sum of `W`

over `D`

is equal to the sum of `W`

over all `C(f)`

for `f`

in `D`

. The sum of `W`

over any `C(f)`

is by definition equal to the sum of `W'`

over `C(f)`

.

Hence:

```
sum W over D
= sum W over (C(f) for chord f in D)
= sum W' over (C(f) for chord f in D)
= sum W' over D
```

Here’s a simple example, not a proof, since I don’t have one.

```
program spanning tree execution loop
ENTRY ENTRY ENTRY
/ \ \ / \
A | A | A |
| \ | | \ | | |
B C B C B |
\ / / \ /
EXIT EXIT EXIT
```

In our execution loop, we have 3 chords, `ENTRY->A`

, `B->EXIT`

, and `EXIT->ENTRY`

.

```
C(ENTRY->A) = ENTRY - A - C - ENTRY
C(B->EXIT) = B - EXIT - C - A - B
C(EXIT->ENTRY) = EXIT - ENTRY - C - EXIT
```

Joining all three together (at `A`

and `EXIT`

), we have:

```
ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
```

Cancelling out opposite edges, this simplifies to:

```
ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
ENTRY - A - B - EXIT - ENTRY - C - ENTRY
ENTRY - A - B - EXIT - ENTRY
```

Which is exactly the execution loop.

With these two steps - compute weights, optimize the locations of the zero weight edges - we can insert instructions at the edges in the given program to efficiently compute the unique sum of the path taken each time a program finishes executing. In retrospect, the algorithm to assign weights to give compact path sums seems almost obvious, but that’s more a sign that we’ve asked the right question than that the problem is really trivial.