There are certain tasks that I want to do, but there aren’t clear deadlines, for example working on a hobby project, reading a book, or learning how to play a new song. I might write those down in my notes, thinking to myself: I’ll get to them when I get some free time.

But when I finally have some free time, I start to find excuses. I’m a bit tired right now. Maybe in an hour. Why not tomorrow?

The thing is, if at time t, I have the chance to do a thing, but I delay that to t+1, then by induction I am never going to do it. In other words, if I’m ever going to do it, it might as well be now.

It is always now or never.

]]>In this post I’ll share my favorite Sudoku trick, called the Unique Rectangle. It is really clever and comes up in actual puzzles. I’ve been solving the New York Times hard Sudoku every night for years, and I’ve used this trick on maybe 10-20% of them.

Imagine you’re solving a Sudoku and you’ve penciled in:

```
1 2 3
+===+===+===+
A |12 |12 | ? |
+---+---+---+
B | ? | ? | ? |
+---+---+---+
C | ? | ? | ? |
+===+===+===+
D |12 |123| ? |
+---+---+---+
```

You’ve eliminated all possibilities except `1`

and `2`

for grids `A1`

, `A2`

and `D1`

, and you know `D2`

can only be `1`

, `2`

or `3`

.

The Unique Rectangle rule says that `D2`

must be `3`

.

“Why?” You object. “There are only 3 rules in Sudoku - all 9 3x3 blocks, 9 rows and 9 columns must all contain numbers `1-9`

. If `D2`

is `1`

or `2`

, we can still fill in `A1`

, `A2`

and `D1`

without violating the rules!”

You are right that the normal constraints of Sudoku aren’t enough. This trick relies on a leap of faith: we assume that *the puzzle has a unique solution*. This is true for all valid Sudokus – if there are more than one solutions, then the puzzle will not be solvable using logic, and is therefore invalid. (Using the fact that the puzzle is solvable to solve the puzzle might feel like cheating, but I have no personal issues with it.)

In the above example, if `D2`

isn’t `3`

, we have 2 possibilities:

```
(I) 1 2 3 (II) 1 2 3
+===+===+===+ +===+===+===+
A | 1 | 2 | ? | A | 2 | 1 | ? |
+---+---+---+ +---+---+---+
... ...
+===+===+===+ +===+===+===+
D | 2 | 1 | ? | D | 1 | 2 | ? |
+---+---+---+ +---+---+---+
```

The only constraints that can be used to solve these 4 grids are columns `1`

and `2`

, rows `A`

and `D`

and the 3x3 blocks, but in all these constraints, `(I)`

and `(II)`

are indistinguishable, and we are left with an unsolvable puzzle. So `D2`

has to be `3`

.

More generally, the Unique Rectangle rule states that if you have 4 grids in 2 3x3 blocks forming a rectangle, and 3 of them have only 2 choices, then the fourth grid cannot be either of those choices.

There are some extensions to this trick. One is chaining which I have also used:

```
1 2 3
+===+===+===+
A |12 |12 | ? |
+---+---+---+
...
+===+===+===+
D |23 |23 | ? |
+---+---+---+
...
+===+===+===+
G |13 |134| ? | G2 cannot be 1 or 3
+---+---+---+
```

Another one that I recently figured out during solves:

```
1 2 3
+===+===+===+
A |12 |12 | ? |
+---+---+---+
B | ? | ? | ? |
+---+---+---+
C | ? | ? | ? |
+===+===+===+
D |123|123|34 |
+---+---+---+
Either D1 or D2 must be 3, so we know D3 must be 4.
```

Here is a website with other advanced Sudoku techniques. I have not found the other tricks to be as interesting or as applicable to NYT puzzles.

This sort of *“if there is a solution, it must be this”* thinking comes up in other occasions as well. Here are a few that I can immediately think of.

In minesweeper, it is very common to run into unsolvable boards, but the same logic can still apply. In this example:

```
1 2 3 4
+===+===+===+===+ ...
A | | | 1 | 0 |
+---+---+---+---+
B | | | 2 | 1 |
+---+---+---+---+
C | 1 | 2 | |>| 1 |
+---+---+---+---+
D | 0 | 1 | 1 | 1 |
+---+---+---+---+
...
```

There are three possibilities:

```
(I) (II) (III)
+===+===+ ... +===+===+ ... +===+===+ ...
| X | | | | X | | | |
+---+---+ +---+---+ +---+---+
| | X | | X | | | | X |
+---+---+ +---+---+ +---+---+
... ... ...
```

When you finish the rest of the board, you can see how many mines are left. If it’s 1, you know it must be `(III)`

; but you can’t tell between `(I)`

and `(II)`

. Either way, clicking on `A2`

or `B1`

is never wrong, so you can do it without solving the rest of the board. (In minesweeper, if you must guess, it is better to guess earlier so that you don’t waste time in an unsolvable game).

Another example is Alice at the Convention of Logicians. Everything on this wiki page is worth a read, so I won’t bother explaining the puzzle here.

I’ve also found that this line of thinking is useful in solving math puzzles in general, or even problems that aren’t necessarily designed to be solvable. It is like a less blasphemous form of Pascal’s wager – if you’re right, then great, otherwise it doesn’t matter.

]]>There are three steps in this game. First, both players play rock paper scissors with one hand. Second, both players play rock paper scissors with the other hand. Third, both players take back one hand, and the winner is determined by comparing the remaining hands using normal rock paper scissors rules.

Let’s run through a quick example. Say players A/B play ✊/🖐, then 🖐/✌. Now since A can only pick between ✊ and 🖐, if B keeps 🖐 and retracts ✌, B will never lose. If A anticipates B to play 🖐 and also plays 🖐, then they will tie.

But if B anticipates that, B can sometimes pick ✌ which beats 🖐. But if A anticipates that, maybe A will also sometimes play ✊… In true rock paper scissors spirit, there always seems to be a better strategy.

But of course, all games have optimal strategies, when we allow strategies to incorporate randomness. The optimal play is:

- Pick anything for the first hand with equal probability;
- follow your opponent’s first hand for your second hand with 2/3 probability, unless that’s the same as your first hand, in which case pick the one that beats that;
- Keep the hand that both players have in common with 2/3 probability, unless both players have the same two choices, in which case pick the one that doesn’t lose.

Below, let’s go through an outline of the math involved. First, we have to define what both players are maximizing.

It might not be immediately obvious that it is nontrivial to define the goal of the game - of course you want to win instead of lose! But this is not always the case. For example, some people might really want to avoid losing, instead of trying too hard to win.

Rock paper scissors is commonly played as a way to fairly decide a binary outcome between two people (e.g. who gets to pick the restaurant). It is reasonable to apply the same to this game, meaning players will repeat the game until a winner is determined, allowing no ties. Playing optimally then means maximizing the probability of winning in a repeated game.

Since the game is fair, in the event of a tie, your chance of winning is still 1/2. So, we’re maximizing `P(win)+P(tie)*1/2`

, which is the same as minimizing of `P(lose)+P(tie)*1/2`

. Equivalently we can put those together and maximize `P(win)-P(lose)`

for each game. In other words, we can pretend the loser pays $x to the winner, and maximize the expected value of winnings.

First we can analyze step 3 - both players have to pick one out of two given choices.

Let’s run through the boring cases quickly. Boring case 1: either player has the same choice for both hands (dumb). Boring case 2: both players have the same two choices (the optimal pick is obvious).

Now to analyze the remaining case where both players have different choices. Let’s say A has ✊, 🖐 and B has 🖐, ✌ to pick from. Let’s say the winner of the game wins $3, the loser loses $3. We have four outcomes after both players take back one hand:

A \ B | 🖐 | ✌ |
---|---|---|

✊ | -3 \ 3 | 3 \ -3 |

🖐 | 0 \ 0 | -3 \ 3 |

Since this is a zero sum game, to maximize your winning, you want to make your opponent’s best option as bad as possible. This happens when both of their options are equally bad.

When A (B) plays 🖐 with 2/3 probability, B (A) gets $1 (-1) in expectation no matter which hand they take back. Hence, the optimal strategy for step 3 is to pick the option that can lead to a tie 2/3 of the time (🖐 in this case).

Now that we’ve established step 3’s strategy, let’s do the same to step 2. Say A picked ✊ and B picked 🖐, and both have to pick their second hand. Since we already worked out the expected value for all possibilities at step 3, we can again tabulate the expected value of all scenarios for A and B here.

A \ B | 🖐 , ✊ | 🖐 , ✌ |

✊ , ✌ | -1 \ 1 | 1 \ -1 |

✊ , 🖐 | 0 \ 0 | -1 \ 1 |

If you paid attention, you’ll realize that this is just the previous table from step 3 but with winnings scaled down by 1/3. This means that at step 2, we’re playing the exact same game! So the optimal strategy must also be the same - we play the option that leads to a tie with probability 2/3.

As a bonus fact, we can calculate the probability of a tie. If both players end up with the same two choices after step 2, the game must end in a tie; otherwise there’s still a 2/3*2/3 chance of a tie. This yields 1/3 + 2/3 * (4/9 + 5/9 * 4/9) = 193/243 = 79.4% chance of a tie.

This concludes the analysis of the game. In Cantonese, both players would both say something like “Xxx, Xxx, xxxxXxx” (inscrutable Chinese) where the uppercase letters indicate when the steps happen. I wonder how this gameplay can be translated into English exactly.

]]>We have a system that basically subscribes to a whole bunch of data, does a bunch of computations on them, and publishes output in real time. The computations are split into tasks identifiable by unique names. For both CPU & memory reasons, the system spawns up to a few dozens of workers (Linux processes) across a bunch of computers, and assigns each task to one process by the hash of its name, modulus the number of workers.

For redundancy and latency, we run a few replicas of the whole thing across multiple data centers, all doing roughly the same computations.

One day, during a routine system upgrade, all replicas crashed one after another. This got us into panic mode.

To be clear, this is bad - this is what we specifically tried to prevent by running replicas, and stagger their upgrade schedule.

To find out the root cause, the first thing as always is to inspect the logs. There was a single error message saying which worker crashed first, and the exception that crashed it. The exception suggested that one of the values that came from a data subscription was too large and caused a buffer overflow.

OK, that’s something, but we have hundreds of thousands of data subscriptions, so we need more cleverness to narrow it down so we can find the problematic data.

Well, we know the worker number is X out of N total workers from the logs. If we get a list of all data scriptions and their task names (a data subscription is also considered a task), then we can compute the hash of each name and see which ones are running on worker X. This will narrow it down by a factor of N, which is on the order of 50. Which is not nearly enough.

But it happens that due to whatever reason, some replicas run with a different number of workers. That means we can gather a bunch of X_{i} and N_{i} pairs, and narrow down the set of suspected tasks further.

If you have a few equations of the form `A mod Ni = Xi`

, you can merge them together to get `A mod N* = X*`

where N* is the LCM of all N_{i} using the Chinese Remainder Theorem. The larger we can make N*, the fewer tasks will satisfy the equation, and the better chance we have to pin down exactly the one task we’re looking for.

So we gathered 4 pairs of X and N, and ended up shrinking the number of suspected tasks by a factor of a few thousand, leaving us with only a few dozen options. Poring over the task names one by one, we finally found the one subscription that caused the crashes.

The bug itself is fairly simple, but the mechanism in which it crashed all replicas is a bit subtle.

There was a recent code change that changes the behavior when the system gets erroneous values from data subscriptions. In the past, when a worker gets an error, it just passes the error value to downstream computations. The code change was to append metadata to these errors to help track down where they came from.

This change seems innocent enough, but the issue manifests when the system is configured to subscribe to data published by itself. Let’s say such a self loop exists in a task. When this task first computes, it subscribes to data that hasn’t been published yet, so it will result in an error. In the old code, this error will then be fed in to the task again, but the output will not change. However, in the new code, each round of feedback leads to a bigger error value due to the additional metadata, eventually overflowing buffers and crashing the process and also clients consuming the value.

This also explains why despite rolling out the new executable to only a subset of replicas, all of them crashed. When subscribing to a data, you have to specify some sort of url, and this url points to one of the replicas. In other words, only one replica has the self loop, and other replicas are consumers of the output of the loop. So when the loop is completed in the roll process, all replicas will crash upon receiving the large value, regardless of whether they contain the code change.

This incident didn’t end up causing too much headache because it was fixable with a rollback. Either way, it’s a good exercise to think through it to learn the maximum amount of lessons out of it.

Pinning down the issue this time required some amount of luck. In particular, we had replicas running with different number of workers. This did not have to be the case, and it even seems undesirable to have different enviroments. One change we could make here is to change the hashing scheme of task names to worker. We could hash the task name and the replica ID together (i.e. use the ID to salt the hash). This way, we don’t have to rely on the numbers of workers being coprime with each other to narrow down tasks using worker IDs.

This incident is not the first time that snowballing error values caused hiccups. I’ve also heard of stories where parsing and appending to error values lead to accidentally quadratic time complexity. Perhaps we should be a bit careful when dealing with error values, because they can sometimes be unexpectedly large. (I’m not sure how much this makes sense in various programming languages; some languages might not have the concept of a error value object that can be manipulated at runtime.)

Another thing that is less clear is that perhaps we could just outright ban self loops, as these are probably just footguns. But this might or might be reasonable depending on the actual situation.

One might also be tempted to think that instead of relying on clever filtering based on hashes, we should just improve the error message in the logs to show exactly what caused the crash. I think practically this would not have helped in this case. Sometimes you just don’t know where the system could crash - if we had anticipated it, we would’ve fixed it. Wrapping every single part of code with error tagging just seems excessive and infeasible.

In the end, I didn’t think anyone made a mistake in the process, and there was not much we could’ve done to avoid it. Testing couldn’t have caught it because the loop would only exist in production, due to all testing systems also subscribing to the url that points to production; this bug was hard to anticipate in code review; and careful deployment wouldn’t have prevented it.

Sometimes, incidents are just a cost of business, even if they happen in production.

]]>```
let sign__old f =
assert (Float.is_finite f);
if String.is_prefix (Float.to_string f) ~prefix:"-"
then `Negative
else `Nonnegative
```

Naturally, I replaced it with the following:

```
let sign__new f =
assert (Float.is_finite f);
if Float.is_negative f then `Negative else `Nonnegative
```

A few days later, this caused an issue in prod. How is it possible? These two functions are obviously equivalent, right?

It turns out there is exactly one edge case for which these two functions behave differently: -0., aka negative float zero.

In the IEEE floating point standard, numbers are represented as sign and magnitude. This means it is technically possible to have both a positive and a negative zero. While these two values are numerically equal, both are treated as valid floats, and they behave differently when passed into different functions.

In this case, `sign__new`

sees negative zero as ``Nonnegative`

, because it is not numerically smaller than zero, despite having a negative sign. On the other hand, `Float.to_string (-0.)`

produces `"-0."`

, so `sign__old`

thinks it is ``Negative`

.

I think it’s likely uncommon that the existence of negative zero leads to bugs in code, because typically programs see these two values as having the same behavior. In my case, the code is constructing an AST that represents the float value. The old code produces a unary negation applied to positive zero immediate, while the new code proces a negative zero immediate. This change caused an exception in prod because the code generation and parsing process no longer round trips.

The first time I learned about negative zero, I thought it was a misfeature. But it actually makes sense, considering infinities are also signed. With all four values representable, we can have `1/-inf = -0`

, `1/-0 = -inf`

and so on, which is nice.

With the bug identified, the fix is easy: use `Float.ieee_negative`

instead of `Float.is_negative`

.

If you enjoyed this post, try another: Precisely Compare Ints and Floats

]]>Overall, it was easy if you know how to build it, but knowing how was not easy.

The project is to build a web game. The game play is like regular minesweeper, except that there are 2 players mining the same map, and the goal is to race to flag 50 mines out of 99 instead of clearing the whole map. When a player clicks on a mine, it is a free point for the opponent; and when a player flags a grid that isn’t a mine, they get frozen for a few seconds. When a flag is correctly placed, it shows up on both clients, but cleared grids are only visible locally.

Project objectives in descending order of importance:

- Finish the project
- Do it right

Ultimately, the deliverables are a front end web app and a websocket back end to handle the real time messages between front end clients. For both front and back end, there are many ways to build them.

After some research, I settled on vue.js (Vue 3) for the client, and NestJS for the server. None of the decisions here, or below, are obvious by any means. I picked the stack based on the following criteria.

- Popularity. Solutions to common problems can be googled, and tooling is probably better.
- Nice APIs. Static typing, concise syntax, modularity, etc.
- Easy (for me) to learn. This is important for actually finishing the project.

Some metrics that I did not care about at all: performance, security and bloat.

Here are some of the other choices I considered:

- Native apps instead of web. This is a no go because few people want to install apps these days, especially on the desktop, and performance is extremely unimportant given how simple the game is.
- OCaml client+server, using bonsai (incr_dom, js_of_ocaml) for front end. I am most proficient in OCaml and I like the language. But frankly, the popularity checkbox is extremely unchecked, and the syntax isn’t even nice. Unless you already work in an OCaml repo, using OCaml for web clients is pretty hard to justify.
- React client. I mean, you can’t go wrong with react, since it’s the most popular choice. But I find Vue simpler to use and more opinionated. In this case, Vue having a designated way to do things is a plus, since I’m here to build a game, not to form opinions about how to build a game.
- Other client frameworks like angular, svelte, solid, etc. I didn’t seriously consider these, since they seem to be less popular that React and Vue.
- Other languages for the server, e.g. python, go. For simplicity, I think using the same language for both the client and the server is good, and in this case that language is TypeScript. So I didn’t seriously consider these.
- Not using a back end framework. NestJS is designed to have opinions on how to do things, which again is good when I don’t have opinions.

Here is a grab bag of things that I used.

- VS Code
- socket.io for websocket communication in both client and server
- Tailwind css
- TypeScript for both client and server
- Font awesome for icons
- Google font
- Vuex (to persist player name)

Things that weren’t obvious to me at first:

- For servers to be able to push updates to clients, either you need a websocket connection between the two, or you use the server push event API. I didn’t look closely at the latter, as I believe it is much less used.
- There are actually 2 web servers: one to serve the front end through a GET request, and one to serve the websocket back end. It is possible to serve both from the same server, but as far as I can tell people don’t do that.
- When you click on a “link” that changes the url, e.g. from https://onemistakes.click/foo to https://onemistakes.click/bar, it is not actually going to a new website. Instead, it just runs whatever javascript you want it to run to show another view, load resources, etc. This url change can also be programmatically triggered. This is how single page apps, or SPAs, work.

The server handles:

- Creating/joining game rooms
- Generating new game map
- Message relaying between clients
- Tracking which client flagged/bombed a grid first

And the client does the rest.

- The client tracks the state of each grid, which can be one of the following: unclicked, clicked, flagged by local player before server ack, flagged by either player after server ack, bombed by either player.
- Flags before server ack are displayed as flagged by the local user, but ignored in score counting. This ensures the user sees no lag and there is no race for deciding the winner.
- The client requests a new game automatically a few seconds after the game ends.

This design is similar to rollback netcode, although greatly simplified because the game is very simple. To explain what that is, we first have to explore a few alternatives.

A multiplayer game is like a distributed state machine. You have a state, and all players provide inputs to change it. There are a few designs that achieve this.

- Server as the only master: Clients send user inputs to the server, server sends back the current state for display.
- Designated client as the only master: Server only relays messages between the master client and other clients. Only the master client can change the state and disseminate it to other clients.
- All clients run the state machine, and proceed only when all inputs from all clients arrive. (This is called deterministic lockstep.)
- All clients run the state machine, and guess the inputs from other clients before they arrive. When they arrive, replay the state machine to correct wrong guesses. (This is called rollback netcode.)

1 is perhaps the simplest. But it suffers from two issues: players experience a lag between making an action and getting a response due to waiting for the server, and the server, which is a shared resource, has to do more work. 2 and 3 also have the issue of lag.

Even though there is no explicit state machine, the design explained above is like rollback netcode, because by displaying local flags immediately, we are basically guessing that the opponent did nothing unless we later learn otherwise. Since the game is simple, using the above state tracking logic greatly simplifies the code, while achieving the same behavior as rollback netcode.

After a few weekends, I got the game working on localhost. Excluding boilerplate, the client took around 800 lines and the server under 200. But deployment is another inevitable battle.

I ended up pushing both the client and server code to Github private repos. I linked the server repo to AWS CodePipeline and Elastic Beanstalk, linked the client repo to AWS Amplify and got a domain name from Route 53 to sign SSL certs. Now I just have to git push my code and everything works automatically, which is nice.

For my previous hosting experience, I’ve been renting an EC2 instance and doing things like ssh, scp, cron jobs, Let’s Encrypt and so on. This time, once I figured out what AWS services I needed, setting them up was much less painful than what I did in the past manually. Again, easy once you know how, because there are so many AWS products and just knowing which ones are relevant is already a bunch of work.

There were a few more issues I had to deal with at this stage.

- Client needs to talk to different servers depending on prod/dev mode. This was solved using environment variables.
- AWS instance ran out of memory building the docker image for the server, because building takes more memory than running it. This was solved by adding a prebuild hook to add swap space to the instance.
- Client couldn’t talk to server unless the server has https. This was solved by buying a domain name from AWS and signing a cert for the load balancer of the back end. For this reason I have to use a load balancer even though there can only be one server (otherwise clients may not see each other).

I did not look into AWS competitors, as I was already an AWS customer, and the popular sentiment seems to be that AWS is still the leader in cloud services. Self hosting was a nonstarter, because there is no concern about data integrity (the server persists nothing on disk), and I only care about building the game, not everything else that runs it.

When I set up the repo and deployment flow, I made sure that I would not have to do this again for the next game. When I build the next little thing, I should be able to just write some more code, test it locally, push to Github, and then I’m done.

Now I just need an idea…

]]>To summarize the article in an extremely lossy fashion, the problem is that we want to take an audio clip, and find similar clips in a database. To do so, we can convert each clip into low dimensional “fingerprints”, and utilize nearest neighbor algorithms to find clips with the most similar fingerprints. Given any clip, one can cut it into small segments, run FFT to get spectrograms, then binarize the resulting images to get bit vectors. But even with, say, 128 x 32 = 4096 bits per fingerprint, it’s still too many to run nearest neighbors. How can we further reduce dimensions to facilitate faster matching?

It is important to note that in this context, the 0s and the 1s in the bit vectors aren’t symmetric. Instead, it is far more useful to think of the bit vectors as sets of 1s. Imagine if clip A has a C note and clip B has a D note. You would think they’re completely different because they don’t share any common notes; it would be silly to say they’re mostly the same because they have a lot of common missing notes.

If you think about it, finding similar sets is a general problem: you have a universe of unique items (pixels in the spectrogram), you have many sparse sets of them (spectrograms), and you wish to reduce each set’s dimensionality while preserving pairwise similarity. It’s just like comparing people’s book lists, movie lists, or interests.

First, we need to define a metric of similarity between two sets. It seems like a natural definition would be: the size of the intersection divided by the size of the union. If both sets are equal, you get 1; if both sets share no common elements, you get 0. It turns out this is called the Jaccard Index.

Here’s the interesting part: say you have two n-bit vectors. If you randomly permute both vectors in the same way, and then find the index of the first 1 bit (this is called the MinHash), then the probability that the two indices are equal is exactly equal to their Jaccard Index. Let’s go through a quick example before explaining why this works.

```
A = 0101 0001
B = 0011 0001
Random permutation 1:
abcdefgh becomes -> daefbcgh (1st position -> 2nd position, 2nd -> 5th, etc)
A becomes -> 1000 1001, first 1: 1
B becomes -> 1000 0101, first 1: 1, equal to A's
Random permutation 2:
abcdefgh becomes -> bcadfehg
A becomes -> 1001 0010, first 1: 1
B becomes -> 0101 0010, first 1: 2, different from A's
```

In the above example, the Jaccard Index is 2/4 (intersection / union), which means we should expect the test to return “equal” 50% of the time. So, why is this true?

First, note that matching zeros in both vectors will not affect the result of the test. This is because no matter where they end up after permutation, they will not affect the test result. Therefore, we can safely drop them from the inputs without changing the result.

```
A drops matching 0s -> 1011
B drops matching 0s -> 0111
```

After dropping matching zeros, we are left with matching 1s or differing bits. It is now easy to see that the first 1’s index will be equal iff an index with matching 1s is selected as the first digit in the random permutation, hence the probability will be equal to the Jaccard Index.

In a universe of n things, any n-permutation will yield a MinHash function. Now, we just have to precompute a bunch of these functions, and apply them to each bit vector to get the fingerprints. Given a pair of fingerprints, we just have to count the number of matching MinHash outputs to get an estimate of the similarity. One cool observation is that more hashes only gives you more accurate similarity estimates, which means even when the universe becomes larger, or more sets are added, you still don’t really have to linearly scale up the total fingerprint size.

Given the fingerprints, we still have to figure out how to search efficiently, but that’s another complicated subject for another day.

]]>`i`

and a 64 bit floating point `f`

, how can we tell which one is larger?
Well, duh! In a language with implicit type casting, this is hardly even a question. Just do `i < f`

and it magically works, right?

This is almost correct, but the issue is that by turning the int into a float, we are dropping precision for large integers. This is because only 52 bits of the float is used for representing the “mantissa”, i.e. the binary digits after the “1.”; in other words, we can only keep 53 binary significant figures in a float. So if you have an int 253+1, the closest float will be 253, so naively your code will think the values are equal, while one is actually numerically larger than the other. How hard is it to do this comparison exactly?

Let’s say we’re writing a compare function that returns a positive number when the int is larger, a negative number if the float is larger, and 0 if they’re equal. And for simplicity let’s assume we already have a compare function for ints and floats respectively. Here’s a seemingly clever way to do it.

```
function compare_int_float(int i, float f)
f_cmp = compare_float(int_to_float(i), f)
if f_cmp != 0
return f_cmp
else
return compare_int(i, float_to_int(f))
```

There are a few observations here. I’ll just state them for now, and will explain them in comments in the final version of the pseudocode.

- If the float comparison doesn’t say the numbers are equal, then we can trust the result.
- If they are “float equal”, then
`f`

must be numerically an integer. Therefore, we can compare them as if they were ints.

But actually this code has a bug. Can you spot it?

The bug is that for certain inputs, this function can raise, specifically through calling `float_to_int(f)`

on an out of bounds `f`

. This happens when `f = 2**63`

and `i`

rounds to `f`

when converted to a float. Below is the final pseudocode:

```
function compare_int_float(int i, float f)
f_cmp = compare_float(int_to_float(i), f)
if f_cmp != 0
(* If i rounds to a float less/greater than f,
i must be less/greater than f, because otherwise
f would be a float that is closer to i,
and i should have rounded to f. *)
return f_cmp
else if f = 2**63
then
(* Large integers round up to 2**63, which is larger than
max int, 2**63-1. We need to handle this case, otherwise
float_to_int will raise. *)
return -1
else
(* When i is converted to a float, its significant digits
can be dropped. Regardless, it will still be an integer,
so f (which is equal to i rounded) is also an integer.
Therefore we can turn f into an int and compare exactly. *)
return compare_int(i, float_to_int(f))
```

The problem setting is that you have a control flow graph (CFG), where each node is a block of code that always executes together (no branches), and each edge is a branch/jump instruction. With huge loss of generality, we assume there will be an `ENTRY`

node and an `EXIT`

node, and the CFG will be a directed acyclic graph (DAG) always going from `ENTRY`

to `EXIT`

. This is clearly unrealistic for normal programs due to the lack of loops, but the paper provides workarounds that aren’t very interesting. The task is to record the paths taken in each execution (from `ENTRY`

to `EXIT`

), so that we can compute statistics about which paths are the most common and make compiler optimizations accordingly.

In other words, we’re doing something like the following. Say we give each node a unique identifier (e.g. 0, 1, 2 …). Each time the program runs, we maintain a list of these identifiers, appending to it every time we visit a new node. And by the end we can add the resulting list to some sort of aggregating data structure.

But that’s a horribly inefficient way to do it. Both appending to the list in each node and aggregating the resulting lists at the end of each execution are going to be expensive. Here, the authors propose: what if we could instead somehow give integer weights to the edges in the CFG such that each path from `ENTRY`

to `EXIT`

has a unique sum, to replace the list of node identifiers? What if those path sums are small enough numbers that you could just make an array and increment the element at the index equal to the path sum?? What if you can pick the edges that are triggered the most and set those weights to 0, so you don’t even need to do anything in the hot paths???

It turns out all of those are possible. First, it’s actually easy to weight the edges such that each `ENTRY`

->`EXIT`

path gives a different sum. You could just pick unique powers of 2, which could give you really large weights. But you can actually do much, much better and make the sums “compact”, meaning they form a range between 0 and the number of unique paths - 1, so the path sums are as small as possible. It’s also really simple to do so.

First, we define `NumPaths(v)`

as the number of unique paths from node v to `EXIT`

. This takes linear time to compute. Then, for each node, say we have a list of outgoing edges. For the ith edge, we simply take all edges from 0 to `i - 1`

, find their destinations, add up their `NumPaths`

, and use that as the weight. The intuition behind this is that for each outgoing edge, there are `NumPaths`

way to get to the `EXIT`

, and each path has a unique sum between 0 and `NumPaths - 1`

(by induction). Since the first edge already claimed sums 0 to `NumPaths`

- 1, the second edge has to start counting from `NumPaths`

, and we can achieve that by adding `NumPaths`

to the second edge’s weight.

If the maximum `NumPaths`

of a CFG is small enough, we can just maintain an array of length `NumPaths`

to count the frequency each path is taken across many runs. Otherwise, we can still maintain a hash table, incurring a larger overhead.

So far it’s been very simple. Notice that for each node, one of the outgoing edges has weight 0. Of course, at those edges, we don’t actually need to add an instruction to add the weight to the path sum. So we can actually just pick the most frequent edge and assign it 0, to minimize the overhead we’re adding to the program.

But we can actually have more flexibility than picking one outgoing edge per node to zero out. This part is a bit more involved to understand, and this paper basically says “just look at that other paper”. While that other paper has a proof, it still didn’t quite explain why it works. I think I have an intuition, which I will lay out below, but I’m Not a Computer Scientist, so it might be wrong, etc.

The way it works is that: you start off with some estimations for the relative frequencies of each edge being taken. You add an edge from `EXIT`

to `ENTRY`

with frequency 1 (fires every time the program runs), and compute the spanning tree of the resulting graph with the maximum weight (sum of edge frequencies), ignoring directions. All edges in that spanning tree will have an updated weight 0. Note that this is never worse than zeroing out the most frequent edge of each node as described above, because taking one outgoing edge per node also forms a spanning tree (n - 1 edges in total, all nodes are linked to `EXIT`

).

For any edge not in the spanning tree, we call it a “chord”. Each chord `f`

, when added to the spanning tree, forms a cycle `C(f)`

, also ignoring direction. Now, we have the weight assignments for any edge `e`

, `W(e)`

, from the previous section’s algorithm (for the added edge, `W(EXIT->ENTRY) = 0`

). The new weight of any chord `f`

, `W'(f)`

, is the sum of `W`

over `C(f)`

, but we negate `W`

for edges that are in the opposite direction of `f`

in the cycle. For example, say the chord `A -> B`

has `C(A -> B) = A -> B <- C -> A`

, then `W'(A -> B) = W(A -> B) - W(C -> B) + W(C -> A)`

. The claim is that, for any given path from `ENTRY`

to `EXIT`

, `W`

and `W'`

yield the same path sum.

But why is that? Here’s the handwavy part. The intuition is that every program execution, when appended with the edge `EXIT->ENTRY`

, becomes a loop. A directed program execution loop `D`

must contains chords, since all loops contain edges not in the spanning tree, and `D`

is really just a “sum” of all `C(f)`

for chords `f`

in `D`

. So the sum of `W`

over `D`

is equal to the sum of `W`

over all `C(f)`

for `f`

in `D`

. The sum of `W`

over any `C(f)`

is by definition equal to the sum of `W'`

over `C(f)`

.

Hence:

```
sum W over D
= sum W over (C(f) for chord f in D)
= sum W' over (C(f) for chord f in D)
= sum W' over D
```

Here’s a simple example, not a proof, since I don’t have one.

```
program spanning tree execution loop
ENTRY ENTRY ENTRY
/ \ \ / \
A | A | A |
| \ | | \ | | |
B C B C B |
\ / / \ /
EXIT EXIT EXIT
```

In our execution loop, we have 3 chords, `ENTRY->A`

, `B->EXIT`

, and `EXIT->ENTRY`

.

```
C(ENTRY->A) = ENTRY - A - C - ENTRY
C(B->EXIT) = B - EXIT - C - A - B
C(EXIT->ENTRY) = EXIT - ENTRY - C - EXIT
```

Joining all three together (at `A`

and `EXIT`

), we have:

```
ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
```

Cancelling out opposite edges, this simplifies to:

```
ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
ENTRY - A - B - EXIT - ENTRY - C - ENTRY
ENTRY - A - B - EXIT - ENTRY
```

Which is exactly the execution loop.

With these two steps - compute weights, optimize the locations of the zero weight edges - we can insert instructions at the edges in the given program to efficiently compute the unique sum of the path taken each time a program finishes executing. In retrospect, the algorithm to assign weights to give compact path sums seems almost obvious, but that’s more a sign that we’ve asked the right question than that the problem is really trivial.

]]>Consider an alternative approach, where we put all numbers in an unordered array. To update, we just overwrite the old number, and to pop min we scan the array. Then, if we make n updates followed by one pop min, the total cost is now just O(n). The problem with that is that the worst case could grow to O(n2) when we pop min a lot more than expected in production. Hence the question: is there a way to do roughly O(1) work per update, but still end up with O(log n) worst case for pop min?

In general, this is impossible. No heaps support O(1) update because update is strictly harder than pop min, due to the fact that updating the min element to infinity achieves the same effect as popping. Perhaps we can further relax the requirements to make progress. One way to do so is to assume that the updates are “random”.

It’s not entirely clear what the definition of random updates ought to be. To start, one reasonable definition would be that for each update, (A) an existing element is chosen uniformly at random, and (B) its updated rank is also independently chosen uniformly at random.

When I got to this point, I dived in and devised some complicated data structures which achieved the desired behaviors. But I later figured out that in fact the existing data structures I knew already satisfy the above requirements. Let’s take a look.

The simplest heap in existence is the binary heap, where we have a binary tree embedded in an array. The element at index i has children at indices 2i+1 and 2i+2, and we maintain the heap property that a child must be no less than its parent. To update an element in the heap, we can just overwrite the old element in the array, and simply recursively swap elements until the heap property holds. The time complexity of update is just how many swaps we need. In the worst case, we need to make O(log n) swaps, e.g. when the min element is updated to become the max.

What about the “average” case given our assumptions of randomness? First, for updates that increase an element, we have to swap it with its children recursively. The worst case is that we have to swap it all the way down. In that case, assuming the element is randomly picked, the expected number of swaps is roughly:

```
0 * (1/2) + 1 * (1/4) + 2 * (1/8) + 3 * (1/16) + ...
= (1/4 + 1/8 + 1/16 + ...) + (1/8 + 1/16 + ...) + (1/16 + ...) + ...
= 1/2 + 1/4 + 1/8 + ...
= 1
```

This is because roughly half the elements are already at the bottom so they never need to be swapped down, then the remaining half are one level up, and so on.

Then, for updates that decrease the element, it takes some reasoning to see that it’s symmetric with the previous case. Say in heap H1, we’re decreasing an element at rank R1 to rank R2. After that’s done, we have H2, and if we were to change the rank back to R1, we actually have to do the exact same swaps to move it back to its original position (this might not be very obvious, but you can work out an example to convince yourself). Now, we claim that H1 and H2 are be equally probable configurations, since the probability distribution from which we drew H1 should be invariant through random updates. Hence, the expected number of swaps needed to decrease a rank is the same as that to increase a rank, which is 1. (By the way, I feel like there ought to be a better argument. This argument relies on H1 and H2 being in the same probability distribution, which might not hold when other heap operations are carried out.)

All in all, randomly updating in place for a binary heap is actually O(1). In other words, binary heaps support O(1) random updates and O(log n) worst case for everything, which is exactly what we desire.

That’s great, except that I only had access to a pairing heap implementation. Pairing heap is this cool data structure where we have a tree (no limit on number of children per node) that lazily rebalances itself on pop min.

Here’s an extremely simplified description. We start with a tree with (only) the heap property. To “meld” (combine) two trees, we just take the tree with the smaller root, and stick the other tree under that root as an immediate child. Inserting an element is melding with a tree of size 1. To pop min, we first remove the root, and now we have to merge a whole bunch of trees, which were the immediate children of the root. The naive way of melding all of them in one go will result in a bad time complexity, since we might have to go through all of them again for the next pop min. The trick is to first meld the trees in pairs, then meld all those results in (reverse) order. This cuts down the number of immediate children for the next round by at least half. Lastly, removing any given node is just: cut it out from its parent, pop min from the detached branch, then meld the rest of it back.

The exact time complexity of all operations of pairing heap is still an open problem, but for our purposes, let’s just say insert takes O(1), and removing any node has amortized worst case O(log n). The naive way to update an element would be to remove the old value and then insert the new value. To remove a randomly picked element, the expected amount of work is proportional to the expected number of children, which is less than 1. Insert is also O(1), so in total, a random update is O(1). Note that this analysis only assumes (A).

Again, we get what we want: O(1) for random updates, O(log n) for amortized worst case pop min and updates.

While I was figuring this out, I learned that there are quite a variety of these data structures out there. Fibonacci heap used to be the poster child of being theoretically great but not practical, but these days we have rank pairing heap that achieves the same asymptotic bounds and claims to be competitive in practice as well. Aside, there are a bunch of variants of pairing heap. I’m not sure whether all these different heaps have similar properties as discussed here, but at this point I don’t care enough to find out, since most of these heaps are probably never used in real life anyway.

]]>