Jekyll2022-04-08T03:42:35+00:00inacsb.com/feed.xmlNot a Computer Scientist, But…Blog about math and code, often with almost rigorous proofs.A Tale of Two Zeros2022-04-07T00:00:00+00:002022-04-07T00:00:00+00:00inacsb.com/a-tale-of-two-zeros<p>One day I came across some code that looked like this (paraphrased):</p>
<div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">sign__old</span> <span class="n">f</span> <span class="o">=</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Float</span><span class="p">.</span><span class="n">is_finite</span> <span class="n">f</span><span class="p">);</span>
<span class="k">if</span> <span class="nn">String</span><span class="p">.</span><span class="n">is_prefix</span> <span class="p">(</span><span class="nn">Float</span><span class="p">.</span><span class="n">to_string</span> <span class="n">f</span><span class="p">)</span> <span class="o">~</span><span class="n">prefix</span><span class="o">:</span><span class="s2">"-"</span>
<span class="k">then</span> <span class="nt">`Negative</span>
<span class="k">else</span> <span class="nt">`Nonnegative</span>
</code></pre></div></div>
<p>Naturally, I replaced it with the following:</p>
<div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">sign__new</span> <span class="n">f</span> <span class="o">=</span>
<span class="k">assert</span> <span class="p">(</span><span class="nn">Float</span><span class="p">.</span><span class="n">is_finite</span> <span class="n">f</span><span class="p">);</span>
<span class="k">if</span> <span class="nn">Float</span><span class="p">.</span><span class="n">is_negative</span> <span class="n">f</span> <span class="k">then</span> <span class="nt">`Negative</span> <span class="k">else</span> <span class="nt">`Nonnegative</span>
</code></pre></div></div>
<p>A few days later, this caused an issue in prod. How is it possible? These two functions are obviously equivalent, right?</p>
<p>It turns out there is exactly one edge case for which these two functions behave differently: -0., aka negative float zero.</p>
<h2 id="positive-vs-negative-zero">Positive vs Negative Zero</h2>
<p>In the <a href="https://en.wikipedia.org/wiki/IEEE_754">IEEE floating point standard</a>, numbers are represented as sign and magnitude. This means it is technically possible to have both a positive and a negative zero. While these two values are numerically equal, both are treated as valid floats, and they behave differently when passed into different functions.</p>
<p>In this case, <code class="language-plaintext highlighter-rouge">sign__new</code> sees negative zero as <code class="language-plaintext highlighter-rouge">`Nonnegative</code>, because it is not numerically smaller than zero, despite having a negative sign. On the other hand, <code class="language-plaintext highlighter-rouge">Float.to_string (-0.)</code> produces <code class="language-plaintext highlighter-rouge">"-0."</code>, so <code class="language-plaintext highlighter-rouge">sign__old</code> thinks it is <code class="language-plaintext highlighter-rouge">`Negative</code>.</p>
<p>I think it’s likely uncommon that the existence of negative zero leads to bugs in code, because typically programs see these two values as having the same behavior. In my case, the code is constructing an AST that represents the float value. The old code produces a unary negation applied to positive zero immediate, while the new code proces a negative zero immediate. This change caused an exception in prod because the code generation and parsing process no longer round trips.</p>
<p>The first time I learned about negative zero, I thought it was a misfeature. But it actually makes sense, considering infinities are also signed. With all four values representable, we can have <code class="language-plaintext highlighter-rouge">1/-inf = -0</code>, <code class="language-plaintext highlighter-rouge">1/-0 = -inf</code> and so on, which is nice.</p>
<p>With the bug identified, the fix is easy: use <code class="language-plaintext highlighter-rouge">Float.ieee_negative</code> instead of <code class="language-plaintext highlighter-rouge">Float.is_negative</code>.</p>
<p>If you enjoyed this post, try another: <a href="/precisely-compare-ints-and-floats">Precisely Compare Ints and Floats</a></p>One day I came across some code that looked like this (paraphrased):Building Multiplayer Minesweeper: Easy If You Know How2022-03-20T12:15:02+00:002022-03-20T12:15:02+00:00inacsb.com/building-multiplayer-minesweeper-easy-if-you-know-how<p>I spent the past few weekends building a multiplayer minesweeper game on the web. It’s <a href="https://www.onemistakes.click/">now live</a>, so get a nerdy friend and try it out! In this post, I’ll list out the things that I picked up along the way. I started roughly from scratch, so if you also have little background but want to do something similar, you can use this post as a starting point.</p>
<p>Overall, it was easy if you know how to build it, but knowing how was not easy.</p>
<h2 id="objectives">Objectives</h2>
<p>The project is to build a web game. The game play is like regular minesweeper, except that there are 2 players mining the same map, and the goal is to race to flag 50 mines out of 99 instead of clearing the whole map. When a player clicks on a mine, it is a free point for the opponent; and when a player flags a grid that isn’t a mine, they get frozen for a few seconds. When a flag is correctly placed, it shows up on both clients, but cleared grids are only visible locally.</p>
<p>Project objectives in descending order of importance:</p>
<ol>
<li>Finish the project</li>
<li>Do it right</li>
</ol>
<h2 id="stack">Stack</h2>
<p>Ultimately, the deliverables are a front end web app and a websocket back end to handle the real time messages between front end clients. For both front and back end, there are many ways to build them.</p>
<p>After some research, I settled on vue.js (Vue 3) for the client, and NestJS for the server. None of the decisions here, or below, are obvious by any means. I picked the stack based on the following criteria.</p>
<ol>
<li>Popularity. Solutions to common problems can be googled, and tooling is probably better.</li>
<li>Nice APIs. Static typing, concise syntax, modularity, etc.</li>
<li>Easy (for me) to learn. This is important for actually finishing the project.</li>
</ol>
<p>Some metrics that I did not care about at all: performance, security and bloat.</p>
<p>Here are some of the other choices I considered:</p>
<ul>
<li>Native apps instead of web. This is a no go because few people want to install apps these days, especially on the desktop, and performance is extremely unimportant given how simple the game is.</li>
<li>OCaml client+server, using bonsai (incr_dom, js_of_ocaml) for front end. I am most proficient in OCaml and I like the language. But frankly, the popularity checkbox is extremely unchecked, and the syntax isn’t even nice. Unless you already work in an OCaml repo, using OCaml for web clients is pretty hard to justify.</li>
<li>React client. I mean, you can’t go wrong with react, since it’s the most popular choice. But I find Vue simpler to use and more opinionated. In this case, Vue having a designated way to do things is a plus, since I’m here to build a game, not to form opinions about how to build a game.</li>
<li>Other client frameworks like angular, svelte, solid, etc. I didn’t seriously consider these, since they seem to be less popular that React and Vue.</li>
<li>Other languages for the server, e.g. python, go. For simplicity, I think using the same language for both the client and the server is good, and in this case that language is TypeScript. So I didn’t seriously consider these.</li>
<li>Not using a back end framework. NestJS is designed to have opinions on how to do things, which again is good when I don’t have opinions.</li>
</ul>
<p>Here is a grab bag of things that I used.</p>
<ul>
<li>VS Code</li>
<li>socket.io for websocket communication in both client and server</li>
<li>Tailwind css</li>
<li>TypeScript for both client and server</li>
<li>Font awesome for icons</li>
<li>Google font</li>
<li>Vuex (to persist player name)</li>
</ul>
<p>Things that weren’t obvious to me at first:</p>
<ul>
<li>For servers to be able to push updates to clients, either you need a websocket connection between the two, or you use the server push event API. I didn’t look closely at the latter, as I believe it is much less used.</li>
<li>There are actually 2 web servers: one to serve the front end through a GET request, and one to serve the websocket back end. It is possible to serve both from the same server, but as far as I can tell people don’t do that.</li>
<li>When you click on a “link” that changes the url, e.g. from https://onemistakes.click/foo to https://onemistakes.click/bar, it is not actually going to a new website. Instead, it just runs whatever javascript you want it to run to show another view, load resources, etc. This url change can also be programmatically triggered. This is how single page apps, or SPAs, work.</li>
</ul>
<h2 id="design">Design</h2>
<p>The server handles:</p>
<ol>
<li>Creating/joining game rooms</li>
<li>Generating new game map</li>
<li>Message relaying between clients</li>
<li>Tracking which client flagged/bombed a grid first</li>
</ol>
<p>And the client does the rest.</p>
<ol>
<li>The client tracks the state of each grid, which can be one of the following: unclicked, clicked, flagged by local player before server ack, flagged by either player after server ack, bombed by either player.</li>
<li>Flags before server ack are displayed as flagged by the local user, but ignored in score counting. This ensures the user sees no lag and there is no race for deciding the winner.</li>
<li>The client requests a new game automatically a few seconds after the game ends.</li>
</ol>
<p>This design is similar to rollback netcode, although greatly simplified because the game is very simple. To explain what that is, we first have to explore a few alternatives.</p>
<p>A multiplayer game is like a distributed state machine. You have a state, and all players provide inputs to change it. There are a few designs that achieve this.</p>
<ol>
<li>Server as the only master: Clients send user inputs to the server, server sends back the current state for display.</li>
<li>Designated client as the only master: Server only relays messages between the master client and other clients. Only the master client can change the state and disseminate it to other clients.</li>
<li>All clients run the state machine, and proceed only when all inputs from all clients arrive. (This is called deterministic lockstep.)</li>
<li>All clients run the state machine, and guess the inputs from other clients before they arrive. When they arrive, replay the state machine to correct wrong guesses. (This is called rollback netcode.)</li>
</ol>
<p>1 is perhaps the simplest. But it suffers from two issues: players experience a lag between making an action and getting a response due to waiting for the server, and the server, which is a shared resource, has to do more work. 2 and 3 also have the issue of lag.</p>
<p>Even though there is no explicit state machine, the design explained above is like rollback netcode, because by displaying local flags immediately, we are basically guessing that the opponent did nothing unless we later learn otherwise. Since the game is simple, using the above state tracking logic greatly simplifies the code, while achieving the same behavior as rollback netcode.</p>
<h2 id="deployment">Deployment</h2>
<p>After a few weekends, I got the game working on localhost. Excluding boilerplate, the client took around 800 lines and the server under 200. But deployment is another inevitable battle.</p>
<p>I ended up pushing both the client and server code to Github private repos. I linked the server repo to AWS CodePipeline and Elastic Beanstalk, linked the client repo to AWS Amplify and got a domain name from Route 53 to sign SSL certs. Now I just have to git push my code and everything works automatically, which is nice.</p>
<p>For my previous hosting experience, I’ve been renting an EC2 instance and doing things like ssh, scp, cron jobs, Let’s Encrypt and so on. This time, once I figured out what AWS services I needed, setting them up was much less painful than what I did in the past manually. Again, easy once you know how, because there are so many AWS products and just knowing which ones are relevant is already a bunch of work.</p>
<p>There were a few more issues I had to deal with at this stage.</p>
<ul>
<li>Client needs to talk to different servers depending on prod/dev mode. This was solved using environment variables.</li>
<li>AWS instance ran out of memory building the docker image for the server, because building takes more memory than running it. This was solved by adding a prebuild hook to add swap space to the instance.</li>
<li>Client couldn’t talk to server unless the server has https. This was solved by buying a domain name from AWS and signing a cert for the load balancer of the back end. For this reason I have to use a load balancer even though there can only be one server (otherwise clients may not see each other).</li>
</ul>
<p>I did not look into AWS competitors, as I was already an AWS customer, and the popular sentiment seems to be that AWS is still the leader in cloud services. Self hosting was a nonstarter, because there is no concern about data integrity (the server persists nothing on disk), and I only care about building the game, not everything else that runs it.</p>
<h2 id="future">Future</h2>
<p>When I set up the repo and deployment flow, I made sure that I would not have to do this again for the next game. When I build the next little thing, I should be able to just write some more code, test it locally, push to Github, and then I’m done.</p>
<p>Now I just need an idea…</p>I spent the past few weekends building a multiplayer minesweeper game on the web. It’s now live, so get a nerdy friend and try it out! In this post, I’ll list out the things that I picked up along the way. I started roughly from scratch, so if you also have little background but want to do something similar, you can use this post as a starting point.Searching for Sets: Jaccard Index and MinHash2022-01-02T19:59:25+00:002022-01-02T19:59:25+00:00inacsb.com/searching-for-sets-jaccard-index-and-minhash<p><a href="https://emysound.com/blog/open-source/2020/06/12/how-audio-fingerprinting-works.html">Here’s</a> a well-written intro to audio fingerprinting. One part of the article contains a clever trick that seems generally useful and interesting to think about. I will attempt to quickly describe the problem below and explain the solution.</p>
<h2 id="the-problem-fingerprints-of-sparse-sets">The Problem: Fingerprints of Sparse Sets</h2>
<p>To summarize the article in an extremely lossy fashion, the problem is that we want to take an audio clip, and find similar clips in a database. To do so, we can convert each clip into low dimensional “fingerprints”, and utilize nearest neighbor algorithms to find clips with the most similar fingerprints. Given any clip, one can cut it into small segments, run FFT to get spectrograms, then binarize the resulting images to get bit vectors. But even with, say, 128 x 32 = 4096 bits per fingerprint, it’s still too many to run nearest neighbors. How can we further reduce dimensions to facilitate faster matching?</p>
<p>It is important to note that in this context, the 0s and the 1s in the bit vectors aren’t symmetric. Instead, it is far more useful to think of the bit vectors as sets of 1s. Imagine if clip A has a C note and clip B has a D note. You would think they’re completely different because they don’t share any common notes; it would be silly to say they’re mostly the same because they have a lot of common missing notes.</p>
<p>If you think about it, finding similar sets is a general problem: you have a universe of unique items (pixels in the spectrogram), you have many sparse sets of them (spectrograms), and you wish to reduce each set’s dimensionality while preserving pairwise similarity. It’s just like comparing people’s book lists, movie lists, or interests.</p>
<h2 id="the-trick-minhash">The Trick: MinHash</h2>
<p>First, we need to define a metric of similarity between two sets. It seems like a natural definition would be: the size of the intersection divided by the size of the union. If both sets are equal, you get 1; if both sets share no common elements, you get 0. It turns out this is called the Jaccard Index.</p>
<p>Here’s the interesting part: say you have two n-bit vectors. If you randomly permute both vectors in the same way, and then find the index of the first 1 bit (this is called the MinHash), then the probability that the two indices are equal is exactly equal to their Jaccard Index. Let’s go through a quick example before explaining why this works.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A = 0101 0001
B = 0011 0001
Random permutation 1:
abcdefgh becomes -> daefbcgh (1st position -> 2nd position, 2nd -> 5th, etc)
A becomes -> 1000 1001, first 1: 1
B becomes -> 1000 0101, first 1: 1, equal to A's
Random permutation 2:
abcdefgh becomes -> bcadfehg
A becomes -> 1001 0010, first 1: 1
B becomes -> 0101 0010, first 1: 2, different from A's
</code></pre></div></div>
<p>In the above example, the Jaccard Index is 2/4 (intersection / union), which means we should expect the test to return “equal” 50% of the time. So, why is this true?</p>
<p>First, note that matching zeros in both vectors will not affect the result of the test. This is because no matter where they end up after permutation, they will not affect the test result. Therefore, we can safely drop them from the inputs without changing the result.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A drops matching 0s -> 1011
B drops matching 0s -> 0111
</code></pre></div></div>
<p>After dropping matching zeros, we are left with matching 1s or differing bits. It is now easy to see that the first 1’s index will be equal iff an index with matching 1s is selected as the first digit in the random permutation, hence the probability will be equal to the Jaccard Index.</p>
<p>In a universe of n things, any n-permutation will yield a MinHash function. Now, we just have to precompute a bunch of these functions, and apply them to each bit vector to get the fingerprints. Given a pair of fingerprints, we just have to count the number of matching MinHash outputs to get an estimate of the similarity. One cool observation is that more hashes only gives you more accurate similarity estimates, which means even when the universe becomes larger, or more sets are added, you still don’t really have to linearly scale up the total fingerprint size.</p>
<p>Given the fingerprints, we still have to figure out how to search efficiently, but that’s another complicated subject for another day.</p>Here’s a well-written intro to audio fingerprinting. One part of the article contains a clever trick that seems generally useful and interesting to think about. I will attempt to quickly describe the problem below and explain the solution.Precisely Compare Ints and Floats2021-12-21T22:03:08+00:002021-12-21T22:03:08+00:00inacsb.com/precisely-compare-ints-and-floats<p>Here’s a seemingly trivial task that I ran into recently - given a 64 bit int <code class="language-plaintext highlighter-rouge">i</code> and a 64 bit floating point <code class="language-plaintext highlighter-rouge">f</code>, how can we tell which one is larger?</p>
<p>Well, duh! In a language with implicit type casting, this is hardly even a question. Just do <code class="language-plaintext highlighter-rouge">i < f</code> and it magically works, right?</p>
<p>This is almost correct, but the issue is that by turning the int into a float, we are dropping precision for large integers. This is because only 52 bits of the float is used for representing the “mantissa”, i.e. the binary digits after the “1.”; in other words, we can only keep 53 binary significant figures in a float. So if you have an int 253+1, the closest float will be 253, so naively your code will think the values are equal, while one is actually numerically larger than the other. How hard is it to do this comparison exactly?</p>
<p>Let’s say we’re writing a compare function that returns a positive number when the int is larger, a negative number if the float is larger, and 0 if they’re equal. And for simplicity let’s assume we already have a compare function for ints and floats respectively. Here’s a seemingly clever way to do it.</p>
<div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span> <span class="n">compare_int_float</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">,</span> <span class="kt">float</span> <span class="n">f</span><span class="p">)</span>
<span class="n">f_cmp</span> <span class="o">=</span> <span class="n">compare_float</span><span class="p">(</span><span class="n">int_to_float</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">,</span> <span class="n">f</span><span class="p">)</span>
<span class="k">if</span> <span class="n">f_cmp</span> <span class="o">!=</span> <span class="mi">0</span>
<span class="n">return</span> <span class="n">f_cmp</span>
<span class="k">else</span>
<span class="n">return</span> <span class="n">compare_int</span><span class="p">(</span><span class="n">i</span><span class="o">,</span> <span class="n">float_to_int</span><span class="p">(</span><span class="n">f</span><span class="p">))</span>
</code></pre></div></div>
<p>There are a few observations here. I’ll just state them for now, and will explain them in comments in the final version of the pseudocode.</p>
<ul>
<li>If the float comparison doesn’t say the numbers are equal, then we can trust the result.</li>
<li>If they are “float equal”, then <code class="language-plaintext highlighter-rouge">f</code> must be numerically an integer. Therefore, we can compare them as if they were ints.</li>
</ul>
<p>But actually this code has a bug. Can you spot it?</p>
<hr />
<p>The bug is that for certain inputs, this function can raise, specifically through calling <code class="language-plaintext highlighter-rouge">float_to_int(f)</code> on an out of bounds <code class="language-plaintext highlighter-rouge">f</code>. This happens when <code class="language-plaintext highlighter-rouge">f = 2**63</code> and <code class="language-plaintext highlighter-rouge">i</code> rounds to <code class="language-plaintext highlighter-rouge">f</code> when converted to a float. Below is the final pseudocode:</p>
<div class="language-ocaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function</span> <span class="n">compare_int_float</span><span class="p">(</span><span class="kt">int</span> <span class="n">i</span><span class="o">,</span> <span class="kt">float</span> <span class="n">f</span><span class="p">)</span>
<span class="n">f_cmp</span> <span class="o">=</span> <span class="n">compare_float</span><span class="p">(</span><span class="n">int_to_float</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="o">,</span> <span class="n">f</span><span class="p">)</span>
<span class="k">if</span> <span class="n">f_cmp</span> <span class="o">!=</span> <span class="mi">0</span>
<span class="c">(* If i rounds to a float less/greater than f,
i must be less/greater than f, because otherwise
f would be a float that is closer to i,
and i should have rounded to f. *)</span>
<span class="n">return</span> <span class="n">f_cmp</span>
<span class="k">else</span> <span class="k">if</span> <span class="n">f</span> <span class="o">=</span> <span class="mi">2</span><span class="o">**</span><span class="mi">63</span>
<span class="k">then</span>
<span class="c">(* Large integers round up to 2**63, which is larger than
max int, 2**63-1. We need to handle this case, otherwise
float_to_int will raise. *)</span>
<span class="n">return</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">else</span>
<span class="c">(* When i is converted to a float, its significant digits
can be dropped. Regardless, it will still be an integer,
so f (which is equal to i rounded) is also an integer.
Therefore we can turn f into an int and compare exactly. *)</span>
<span class="n">return</span> <span class="n">compare_int</span><span class="p">(</span><span class="n">i</span><span class="o">,</span> <span class="n">float_to_int</span><span class="p">(</span><span class="n">f</span><span class="p">))</span>
</code></pre></div></div>Here’s a seemingly trivial task that I ran into recently - given a 64 bit int i and a 64 bit floating point f, how can we tell which one is larger?Paper Reading: Efficient Path Profiling2021-10-23T22:39:36+00:002021-10-23T22:39:36+00:00inacsb.com/paper-reading-efficient-path-profiling<p>Recently I’ve been going through <a href="https://www.cs.cornell.edu/courses/cs6120/2020fa/self-guided/">CS 6120 from Cornell</a> (compilers), and one of the papers listed in the course was quite interesting, namely <a href="https://www.cs.purdue.edu/homes/xyzhang/spring10/epp.pdf">Efficient Path Profiling</a>. Once in a while you see a solution so neat that it almost feels like the problem was created in order to make such a solution useful; this paper gave me that feeling. This blog post will give a high level understanding of the problem, the algorithm and some intuitions, while leaving out all the technical details.</p>
<p>The problem setting is that you have a control flow graph (CFG), where each node is a block of code that always executes together (no branches), and each edge is a branch/jump instruction. With huge loss of generality, we assume there will be an <code class="language-plaintext highlighter-rouge">ENTRY</code> node and an <code class="language-plaintext highlighter-rouge">EXIT</code> node, and the CFG will be a directed acyclic graph (DAG) always going from <code class="language-plaintext highlighter-rouge">ENTRY</code> to <code class="language-plaintext highlighter-rouge">EXIT</code>. This is clearly unrealistic for normal programs due to the lack of loops, but the paper provides workarounds that aren’t very interesting. The task is to record the paths taken in each execution (from <code class="language-plaintext highlighter-rouge">ENTRY</code> to <code class="language-plaintext highlighter-rouge">EXIT</code>), so that we can compute statistics about which paths are the most common and make compiler optimizations accordingly.</p>
<p>In other words, we’re doing something like the following. Say we give each node a unique identifier (e.g. 0, 1, 2 …). Each time the program runs, we maintain a list of these identifiers, appending to it every time we visit a new node. And by the end we can add the resulting list to some sort of aggregating data structure.</p>
<p>But that’s a horribly inefficient way to do it. Both appending to the list in each node and aggregating the resulting lists at the end of each execution are going to be expensive. Here, the authors propose: what if we could instead somehow give integer weights to the edges in the CFG such that each path from <code class="language-plaintext highlighter-rouge">ENTRY</code> to <code class="language-plaintext highlighter-rouge">EXIT</code> has a unique sum, to replace the list of node identifiers? What if those path sums are small enough numbers that you could just make an array and increment the element at the index equal to the path sum?? What if you can pick the edges that are triggered the most and set those weights to 0, so you don’t even need to do anything in the hot paths???</p>
<h2 id="compact-unique-path-sums">Compact Unique Path Sums</h2>
<p>It turns out all of those are possible. First, it’s actually easy to weight the edges such that each <code class="language-plaintext highlighter-rouge">ENTRY</code>-><code class="language-plaintext highlighter-rouge">EXIT</code> path gives a different sum. You could just pick unique powers of 2, which could give you really large weights. But you can actually do much, much better and make the sums “compact”, meaning they form a range between 0 and the number of unique paths - 1, so the path sums are as small as possible. It’s also really simple to do so.</p>
<p>First, we define <code class="language-plaintext highlighter-rouge">NumPaths(v)</code> as the number of unique paths from node v to <code class="language-plaintext highlighter-rouge">EXIT</code>. This takes linear time to compute. Then, for each node, say we have a list of outgoing edges. For the ith edge, we simply take all edges from 0 to <code class="language-plaintext highlighter-rouge">i - 1</code>, find their destinations, add up their <code class="language-plaintext highlighter-rouge">NumPaths</code>, and use that as the weight. The intuition behind this is that for each outgoing edge, there are <code class="language-plaintext highlighter-rouge">NumPaths</code> way to get to the <code class="language-plaintext highlighter-rouge">EXIT</code>, and each path has a unique sum between 0 and <code class="language-plaintext highlighter-rouge">NumPaths - 1</code> (by induction). Since the first edge already claimed sums 0 to <code class="language-plaintext highlighter-rouge">NumPaths</code> - 1, the second edge has to start counting from <code class="language-plaintext highlighter-rouge">NumPaths</code>, and we can achieve that by adding <code class="language-plaintext highlighter-rouge">NumPaths</code> to the second edge’s weight.</p>
<p>If the maximum <code class="language-plaintext highlighter-rouge">NumPaths</code> of a CFG is small enough, we can just maintain an array of length <code class="language-plaintext highlighter-rouge">NumPaths</code> to count the frequency each path is taken across many runs. Otherwise, we can still maintain a hash table, incurring a larger overhead.</p>
<h2 id="choosing-weights-to-zero-out">Choosing Weights to Zero Out</h2>
<p>So far it’s been very simple. Notice that for each node, one of the outgoing edges has weight 0. Of course, at those edges, we don’t actually need to add an instruction to add the weight to the path sum. So we can actually just pick the most frequent edge and assign it 0, to minimize the overhead we’re adding to the program.</p>
<p>But we can actually have more flexibility than picking one outgoing edge per node to zero out. This part is a bit more involved to understand, and this paper basically says “just look at that other paper”. While that other paper has a proof, it still didn’t quite explain why it works. I think I have an intuition, which I will lay out below, but I’m Not a Computer Scientist, so it might be wrong, etc.</p>
<p>The way it works is that: you start off with some estimations for the relative frequencies of each edge being taken. You add an edge from <code class="language-plaintext highlighter-rouge">EXIT</code> to <code class="language-plaintext highlighter-rouge">ENTRY</code> with frequency 1 (fires every time the program runs), and compute the spanning tree of the resulting graph with the maximum weight (sum of edge frequencies), ignoring directions. All edges in that spanning tree will have an updated weight 0. Note that this is never worse than zeroing out the most frequent edge of each node as described above, because taking one outgoing edge per node also forms a spanning tree (n - 1 edges in total, all nodes are linked to <code class="language-plaintext highlighter-rouge">EXIT</code>).</p>
<p>For any edge not in the spanning tree, we call it a “chord”. Each chord <code class="language-plaintext highlighter-rouge">f</code>, when added to the spanning tree, forms a cycle <code class="language-plaintext highlighter-rouge">C(f)</code>, also ignoring direction. Now, we have the weight assignments for any edge <code class="language-plaintext highlighter-rouge">e</code>, <code class="language-plaintext highlighter-rouge">W(e)</code>, from the previous section’s algorithm (for the added edge, <code class="language-plaintext highlighter-rouge">W(EXIT->ENTRY) = 0</code>). The new weight of any chord <code class="language-plaintext highlighter-rouge">f</code>, <code class="language-plaintext highlighter-rouge">W'(f)</code>, is the sum of <code class="language-plaintext highlighter-rouge">W</code> over <code class="language-plaintext highlighter-rouge">C(f)</code>, but we negate <code class="language-plaintext highlighter-rouge">W</code> for edges that are in the opposite direction of <code class="language-plaintext highlighter-rouge">f</code> in the cycle. For example, say the chord <code class="language-plaintext highlighter-rouge">A -> B</code> has <code class="language-plaintext highlighter-rouge">C(A -> B) = A -> B <- C -> A</code>, then <code class="language-plaintext highlighter-rouge">W'(A -> B) = W(A -> B) - W(C -> B) + W(C -> A)</code>. The claim is that, for any given path from <code class="language-plaintext highlighter-rouge">ENTRY</code> to <code class="language-plaintext highlighter-rouge">EXIT</code>, <code class="language-plaintext highlighter-rouge">W</code> and <code class="language-plaintext highlighter-rouge">W'</code> yield the same path sum.</p>
<p>But why is that? Here’s the handwavy part. The intuition is that every program execution, when appended with the edge <code class="language-plaintext highlighter-rouge">EXIT->ENTRY</code>, becomes a loop. A directed program execution loop <code class="language-plaintext highlighter-rouge">D</code> must contains chords, since all loops contain edges not in the spanning tree, and <code class="language-plaintext highlighter-rouge">D</code> is really just a “sum” of all <code class="language-plaintext highlighter-rouge">C(f)</code> for chords <code class="language-plaintext highlighter-rouge">f</code> in <code class="language-plaintext highlighter-rouge">D</code>. So the sum of <code class="language-plaintext highlighter-rouge">W</code> over <code class="language-plaintext highlighter-rouge">D</code> is equal to the sum of <code class="language-plaintext highlighter-rouge">W</code> over all <code class="language-plaintext highlighter-rouge">C(f)</code> for <code class="language-plaintext highlighter-rouge">f</code> in <code class="language-plaintext highlighter-rouge">D</code>. The sum of <code class="language-plaintext highlighter-rouge">W</code> over any <code class="language-plaintext highlighter-rouge">C(f)</code> is by definition equal to the sum of <code class="language-plaintext highlighter-rouge">W'</code> over <code class="language-plaintext highlighter-rouge">C(f)</code>.</p>
<p>Hence:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sum W over D
= sum W over (C(f) for chord f in D)
= sum W' over (C(f) for chord f in D)
= sum W' over D
</code></pre></div></div>
<p>Here’s a simple example, not a proof, since I don’t have one.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>program spanning tree execution loop
ENTRY ENTRY ENTRY
/ \ \ / \
A | A | A |
| \ | | \ | | |
B C B C B |
\ / / \ /
EXIT EXIT EXIT
</code></pre></div></div>
<p>In our execution loop, we have 3 chords, <code class="language-plaintext highlighter-rouge">ENTRY->A</code>, <code class="language-plaintext highlighter-rouge">B->EXIT</code>, and <code class="language-plaintext highlighter-rouge">EXIT->ENTRY</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>C(ENTRY->A) = ENTRY - A - C - ENTRY
C(B->EXIT) = B - EXIT - C - A - B
C(EXIT->ENTRY) = EXIT - ENTRY - C - EXIT
</code></pre></div></div>
<p>Joining all three together (at <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">EXIT</code>), we have:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
</code></pre></div></div>
<p>Cancelling out opposite edges, this simplifies to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENTRY - A - B - EXIT - ENTRY - C - EXIT - C - A - C - ENTRY
ENTRY - A - B - EXIT - ENTRY - C - ENTRY
ENTRY - A - B - EXIT - ENTRY
</code></pre></div></div>
<p>Which is exactly the execution loop.</p>
<p>With these two steps - compute weights, optimize the locations of the zero weight edges - we can insert instructions at the edges in the given program to efficiently compute the unique sum of the path taken each time a program finishes executing. In retrospect, the algorithm to assign weights to give compact path sums seems almost obvious, but that’s more a sign that we’ve asked the right question than that the problem is really trivial.</p>Recently I’ve been going through CS 6120 from Cornell (compilers), and one of the papers listed in the course was quite interesting, namely Efficient Path Profiling. Once in a while you see a solution so neat that it almost feels like the problem was created in order to make such a solution useful; this paper gave me that feeling. This blog post will give a high level understanding of the problem, the algorithm and some intuitions, while leaving out all the technical details.Random Heap Updates Are Cheap2021-05-27T02:58:55+00:002021-05-27T02:58:55+00:00inacsb.com/random-heap-updates-are-cheap<p>A while ago I encountered an algorithmic challenge at work. Basically, the idea is that we have a bag of numbers, and we’d like to be able to update each number as well as insert and remove, and also occasionally pop the smallest number. All of these are simple and typical heap operations. But in our use case, we’re going to be updating numbers much more frequently than popping the smallest number. Recall from your data structure classes that removing from a heap costs O(log n), and updating is just removing followed by inserting, so logically if we make n updates followed by one pop min, we’re going to pay O(n log n).</p>
<p>Consider an alternative approach, where we put all numbers in an unordered array. To update, we just overwrite the old number, and to pop min we scan the array. Then, if we make n updates followed by one pop min, the total cost is now just O(n). The problem with that is that the worst case could grow to O(n2) when we pop min a lot more than expected in production. Hence the question: is there a way to do roughly O(1) work per update, but still end up with O(log n) worst case for pop min?</p>
<p>In general, this is impossible. No heaps support O(1) update because update is strictly harder than pop min, due to the fact that updating the min element to infinity achieves the same effect as popping. Perhaps we can further relax the requirements to make progress. One way to do so is to assume that the updates are “random”.</p>
<p>It’s not entirely clear what the definition of random updates ought to be. To start, one reasonable definition would be that for each update, (A) an existing element is chosen uniformly at random, and (B) its updated rank is also independently chosen uniformly at random.</p>
<p>When I got to this point, I dived in and devised some complicated data structures which achieved the desired behaviors. But I later figured out that in fact the existing data structures I knew already satisfy the above requirements. Let’s take a look.</p>
<h2 id="binary-heap">Binary Heap</h2>
<p>The simplest heap in existence is the binary heap, where we have a binary tree embedded in an array. The element at index i has children at indices 2i+1 and 2i+2, and we maintain the heap property that a child must be no less than its parent. To update an element in the heap, we can just overwrite the old element in the array, and simply recursively swap elements until the heap property holds. The time complexity of update is just how many swaps we need. In the worst case, we need to make O(log n) swaps, e.g. when the min element is updated to become the max.</p>
<p>What about the “average” case given our assumptions of randomness? First, for updates that increase an element, we have to swap it with its children recursively. The worst case is that we have to swap it all the way down. In that case, assuming the element is randomly picked, the expected number of swaps is roughly:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0 * (1/2) + 1 * (1/4) + 2 * (1/8) + 3 * (1/16) + ...
= (1/4 + 1/8 + 1/16 + ...) + (1/8 + 1/16 + ...) + (1/16 + ...) + ...
= 1/2 + 1/4 + 1/8 + ...
= 1
</code></pre></div></div>
<p>This is because roughly half the elements are already at the bottom so they never need to be swapped down, then the remaining half are one level up, and so on.</p>
<p>Then, for updates that decrease the element, it takes some reasoning to see that it’s symmetric with the previous case. Say in heap H1, we’re decreasing an element at rank R1 to rank R2. After that’s done, we have H2, and if we were to change the rank back to R1, we actually have to do the exact same swaps to move it back to its original position (this might not be very obvious, but you can work out an example to convince yourself). Now, we claim that H1 and H2 are be equally probable configurations, since the probability distribution from which we drew H1 should be invariant through random updates. Hence, the expected number of swaps needed to decrease a rank is the same as that to increase a rank, which is 1. (By the way, I feel like there ought to be a better argument. This argument relies on H1 and H2 being in the same probability distribution, which might not hold when other heap operations are carried out.)</p>
<p>All in all, randomly updating in place for a binary heap is actually O(1). In other words, binary heaps support O(1) random updates and O(log n) worst case for everything, which is exactly what we desire.</p>
<h2 id="pairing-heap">Pairing Heap</h2>
<p>That’s great, except that I only had access to a pairing heap implementation. Pairing heap is this cool data structure where we have a tree (no limit on number of children per node) that lazily rebalances itself on pop min.</p>
<p>Here’s an extremely simplified description. We start with a tree with (only) the heap property. To “meld” (combine) two trees, we just take the tree with the smaller root, and stick the other tree under that root as an immediate child. Inserting an element is melding with a tree of size 1. To pop min, we first remove the root, and now we have to merge a whole bunch of trees, which were the immediate children of the root. The naive way of melding all of them in one go will result in a bad time complexity, since we might have to go through all of them again for the next pop min. The trick is to first meld the trees in pairs, then meld all those results in (reverse) order. This cuts down the number of immediate children for the next round by at least half. Lastly, removing any given node is just: cut it out from its parent, pop min from the detached branch, then meld the rest of it back.</p>
<p>The exact time complexity of all operations of pairing heap is still an open problem, but for our purposes, let’s just say insert takes O(1), and removing any node has amortized worst case O(log n). The naive way to update an element would be to remove the old value and then insert the new value. To remove a randomly picked element, the expected amount of work is proportional to the expected number of children, which is less than 1. Insert is also O(1), so in total, a random update is O(1). Note that this analysis only assumes (A).</p>
<p>Again, we get what we want: O(1) for random updates, O(log n) for amortized worst case pop min and updates.</p>
<h2 id="others">Others?</h2>
<p>While I was figuring this out, I learned that there are quite a variety of these data structures out there. Fibonacci heap used to be the poster child of being theoretically great but not practical, but these days we have rank pairing heap that achieves the same asymptotic bounds and claims to be competitive in practice as well. Aside, there are a bunch of variants of pairing heap. I’m not sure whether all these different heaps have similar properties as discussed here, but at this point I don’t care enough to find out, since most of these heaps are probably never used in real life anyway.</p>A while ago I encountered an algorithmic challenge at work. Basically, the idea is that we have a bag of numbers, and we’d like to be able to update each number as well as insert and remove, and also occasionally pop the smallest number. All of these are simple and typical heap operations. But in our use case, we’re going to be updating numbers much more frequently than popping the smallest number. Recall from your data structure classes that removing from a heap costs O(log n), and updating is just removing followed by inserting, so logically if we make n updates followed by one pop min, we’re going to pay O(n log n).Building an AVL Tree From a Sorted Sequence in One Pass2020-10-25T01:10:07+00:002020-10-25T01:10:07+00:00inacsb.com/building-an-avl-tree-from-a-sorted-sequence-in-one-pass<p>Recently I came across the function Map.of_increasing_sequence in the <a href="https://github.com/janestreet/base/blob/master/src/map.ml">base library</a> of OCaml. It might sound like a very simple and common function, but the implementation is actually quite cool. Let’s dive in. (Spoiler: it’s related to a weird number system.)</p>
<h2 id="first-impressions">First Impressions</h2>
<p>A sequence is like an iterator - you can either get the next value or reach the end. A map is an immutable AVL tree of key value pairs. An AVL tree is a binary tree with two properties - each node is larger than all nodes in its left subtree and smaller than all nodes in its right, and both subtrees can only have heights differing by at most 1. From an algorithmic standpoint, making a BST from a sorted array is trivial. You can achieve <code class="language-plaintext highlighter-rouge">O(n)</code> time complexity, and <code class="language-plaintext highlighter-rouge">O(log(n))</code> space excluding the return value, just by simple recursion. But you can’t do that to a sequence, since sequences don’t permit random access. Now, we can always turn the sequence into an array first, but that would require <code class="language-plaintext highlighter-rouge">O(n)</code> extra space. Can we do better?</p>
<p>It turns out that library function is implemented with only one pass through the sequence, using only log(n) extra space. The code had almost no documentation, and I couldn’t find any description online (although I didn’t try very hard), so here I’ll attempt to motivate and derive the algorithm.</p>
<h2 id="first-attempt-try-to-build-a-tree">First Attempt: Try to Build a Tree</h2>
<p>Imagine yourself with the task of building a balanced tree and are handed one number at a time, and you need to incrementally build a BST as quickly as possible. That might look like this.</p>
<p>I get a 1 - that’s easy. I’ll have a tree with one node.</p>
<p>2 - OK, that’s bigger than 1, I can make that the new root, and make 1 the left child.</p>
<p>3 - Let’s put that as the right child. So far so good.</p>
<p>4 - Hmm, maybe we could make that the new root, and have the tree rooted at 2 as the left child?</p>
<p>5, 6, 7 - That looks like 1, 2, 3 all over again, we can put those in the right subtree. Now we have a complete BST, looking good!</p>
<p>8 - That’s awkward again, let’s just say it’s the new root again.</p>
<p>9-15 - Looking like 1-7 again…</p>
<p>Maybe you can see the recursive pattern here. This looks like a procedure that produces reasonably balanced trees, and the height is always bounded by <code class="language-plaintext highlighter-rouge">O(log(n))</code>. That seems good, right?</p>
<p>It would be acceptable, but only if your map library only has two functions - build the tree, then look up values and never change it again. The problem is, the BST also has to support adds and deletes as well. The most common BST types, like red black trees and AVL trees, all have their own invariant, and unfortunately we’re not meeting those standards with our almost-balanced trees. For example, At step 8, we have a root (8), a left subtree of size 7, and an empty right subtree. That’s not a valid red black tree or AVL tree, and you can’t just return that, since that would break the rest of your library. How can we fix this?</p>
<h2 id="second-attempt-build-branches-instead">Second Attempt: Build Branches Instead</h2>
<p>One might think that we could make this work somehow with some clever ordering of insertions to the tree. But the fundamental issue here is that with only one tree, we can never change the root - at the moment we make the newest element the root, the tree must become heavily imbalanced. So perhaps we can instead maintain a bag of branches, and quickly assemble them into a tree when we hit the end of the sequence.</p>
<p>Here, the defining characteristic of a branch would be its composability. Let’s define a branch (called a fragment in the source code) like the trees we had in step 4 and 8 in the last attempt. In other words, <strong>a branch would be a tree with a complete left subtree and an empty right subtree</strong>. To merge two branches into a tree, you could put one branch as another branch’s right child. A branch of height n has 2^n nodes.</p>
<p>You could also merge two branches of the same size into a new branch. Here’s one way to do it: to merge X with Y, take the left subtree of branch Y and move it to the right subtree of branch X. Now X becomes a complete binary tree, and you can set it to be the left subtree of Y. This fits our branch definition, while also preserving order in the tree.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Merge branches as tree:
X Y X
/ + / = / \
A B A Y
/
B
Merge branches as branch:
X Y Y
/ + / = /
A B X
/ \
A B
</code></pre></div></div>
<p>Now let’s try again:</p>
<p>1 - One node by itself is a branch.</p>
<p>2 - We could make 2 a branch and merge with 1.</p>
<p>3 - We could have 3 be its own branch. Now we have branches of size 2 and 1.</p>
<p>4 - Let’s merge 4 with 3, and then we have 2 branches of size 2. We could merge those two again into one branch.</p>
<p>5 - That’s a new branch.</p>
<p>6 - Add that to 5’s branch…</p>
<p>That starts to look recursive again. Now we always have a bunch of branches. And when we need to generate the final tree, we could just iteratively merge them together, from small to large. Is that good enough?</p>
<h2 id="we-are-building-a-binary-number">We Are Building a Binary Number</h2>
<p>It’s not. To see this, we can frame this algorithm a bit more abstractly.</p>
<p>Consider the heights of our branches. For each integer <code class="language-plaintext highlighter-rouge">n</code>, each time we can only have either <code class="language-plaintext highlighter-rouge">0</code> or <code class="language-plaintext highlighter-rouge">1</code> branch of height <code class="language-plaintext highlighter-rouge">n</code>, because once we have <code class="language-plaintext highlighter-rouge">2</code>, we merge them together. We can visualize the branch building in this table.</p>
<table>
<thead>
<tr>
<th># branches of size 4</th>
<th># branches of size 2</th>
<th># branches of size 1</th>
<th># Nodes</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>3</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>4</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>6</td>
</tr>
</tbody>
</table>
<p>Our branches sizes correspond to the binary representation of total tree size.</p>
<p>From here, we can see that we’re abstractly incrementing a binary counter. Now the problem is we can have a lot of gaps, or 0s, in the binary number. For example, for 17 nodes, we’ll have a large branch of size 16 and a small branch of 1 node. Now if the sequence terminates, we’ll have to merge those branches into a tree - but that again will be a heavily one-sided tree.</p>
<h2 id="third-and-final-attempt-keep-2-branches-at-each-level">Third and Final Attempt: Keep 2 Branches at Each Level</h2>
<p>Since gaps are causing us problems, maybe we could just, like, not have them. And in fact we could. This is hinted at in the code - “using skew binary encoding”. (Although from Wikipedia, skew binary system actually refers to a slightly different definition.)</p>
<p>In this new “binary” encoding, we could use the digits <code class="language-plaintext highlighter-rouge">0</code>, <code class="language-plaintext highlighter-rouge">1</code> and <code class="language-plaintext highlighter-rouge">2</code>, as opposed to just <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">1</code>. Each position in the number would still have the same weight. So for example, <code class="language-plaintext highlighter-rouge">212 = 2*4 + 1*2 + 2 = 12</code>. Here’s how to count in this new number system.</p>
<table>
<thead>
<tr>
<th># branches of size 4</th>
<th># branches of size 2</th>
<th># branches of size 1</th>
<th># Nodes</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>1</td>
<td>3</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>0</td>
<td>2</td>
<td>1</td>
<td>5</td>
</tr>
<tr>
<td>0</td>
<td>2</td>
<td>2</td>
<td>6</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>7</td>
</tr>
</tbody>
</table>
<h3 id="counting-in-the-new-system">Counting in the new system</h3>
<p>Basically, to add one, we flip <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">1</code> and <code class="language-plaintext highlighter-rouge">1</code> to <code class="language-plaintext highlighter-rouge">2</code>, but we flip <code class="language-plaintext highlighter-rouge">2</code> back to <code class="language-plaintext highlighter-rouge">1</code> and carry forward. There are never any <code class="language-plaintext highlighter-rouge">0</code>s in any number.</p>
<p>Translating this back to our branch building, that means we don’t merge when we have two branches. We merge when we have three - and we merge the older two branches, “carrying” it forward to the next level, and always keep one branch for each height.</p>
<p>Let’s convince ourselves that at every point in the process, we can merge all branches and end up with a valid AVL tree (i.e. the algorithm is correct).</p>
<p>Say we are n steps into the branch building process, and we have to make a tree. We can convert n into a string of 1 and 2 in this number system. Starting from the least significant digit, we either have 1 or 2. Together, that gives us either a tree of height 1 or 2. Moving onto the next digit, we again have either 1 or 2 branches of size 2. At the max, we have 3 branches, each of size 2. If we first merge the left two branches into a branch, then merge that with the right branch to a tree, that leaves us with a maximum tree height of 3 (while minimum is 2). At the next level, we can at max have 3 branches of height 3. Similarly, we end up with a tree with height between 3 and 4.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Illustrating the 222 case, with 14 nodes.
Level 1:
13 + 14
-------
14
13
Level 2:
10 + 12 + 14
09 11 13
-----------------
12
10 14
09 11 13
Level 3:
04 08 12
02 + 06 + 10 14
01 03 05 07 09 11 13
------------------------------------
08
04 12
02 06 10 14
01 03 05 07 09 11 13
</code></pre></div></div>
<p>In this process, we always create trees that preserve order. And after level <code class="language-plaintext highlighter-rouge">n</code>, the tree that we end up with always have height <code class="language-plaintext highlighter-rouge">n</code> or <code class="language-plaintext highlighter-rouge">n+1</code>. That satisfies the AVL tree invariant that two subtrees should have heights differ by at most <code class="language-plaintext highlighter-rouge">1</code>.</p>
<h2 id="thats-it">That’s It!</h2>
<p>Now we should be reasonably convinced that this algorithm produces valid BST. But there were a lot of details that were glossed over. To be completely rigorous, we would need to formalize the observations into claims, and prove them by induction. But I believe this process captures the key ideas already.</p>
<p>We also skipped the time/space complexity discussion. There are two slightly nontrivial details here. First, each insertion could lead to a cascade of branch merges (or carries), so we need to argue that insertion has an amortized cost of <code class="language-plaintext highlighter-rouge">O(1)</code>. Then, we need to realize that the final tree merging takes <code class="language-plaintext highlighter-rouge">O(log(n))</code> branches, and each tree merge is <code class="language-plaintext highlighter-rouge">O(1)</code> as well. As an aside, this number system has a unique representation for each number, which is perhaps not totally obvious.</p>
<p>I am not able to identify the inventor of this algorithm. It doesn’t seem particularly likely that the author of this code was also the inventor.</p>
<p>There are still some aspects of that source file that I don’t quite understand, but is perhaps not closely related. In particular, the invariant is that subtrees have heights differing by at most <code class="language-plaintext highlighter-rouge">2</code>, not <code class="language-plaintext highlighter-rouge">1</code>, like normal AVL trees. Maybe I’ll find out why another day.</p>Recently I came across the function Map.of_increasing_sequence in the base library of OCaml. It might sound like a very simple and common function, but the implementation is actually quite cool. Let’s dive in. (Spoiler: it’s related to a weird number system.)What I Learned in Two Years’ Tech Work in Finance2020-10-11T14:30:13+00:002020-10-11T14:30:13+00:00inacsb.com/what-i-learned-in-two-years-tech-work-in-finance<p>It’s been two years since I started working full time as a software engineer. As I accumulated experience, I became a lot more hesitant to write, because I start to feel that I can’t contribute anything new on top of what everyone else already knows. And I even felt bad for having written some of the old posts, since they now seem quite silly and naive.</p>
<p>In some sense those feelings must reflect a lot of truth, but that still shouldn’t stop me from writing. Perfect is the enemy of good, and if I wait until I know everything, I’ll never write again; hence this post. Random thoughts will be laid out in no particular order.</p>
<h2 id="over-engineering">Over-engineering</h2>
<p>One mistake that I’ve repeated is to optimize prematurely. As a recent college grad having done a lot of brain teasers, it’s really tempting to work clever algorithms into the job. But a lot of the times, it’s just unnecessary. In coding competitions, we are only rewarded for writing correct, fast and short programs. Nothing else mattered. In professional work, we need to add a few more terms to the equation - cost of human effort (to write, test and review the code), flexibility for future modifications, and simplicity of the solution. A simple solution that gets 90% of the cases right is perhaps even better than a complicated solution that gets 99.99%, if humans can much more easily understand failure modes and be able to manually fix things in the former case. After all, the alternative is to spend a lot of time debugging when the one edge case happens, and breaks the system.</p>
<p>I think this is important enough to deserve an emphasis - simplicity is valuable. A prediction system that gives you a slightly inaccurate number in a predictable way is much better than one that gives you a more accurate magic number that no one understands. In financial markets, complexity creeps in wherever competition is fierce, but the simplicity of many models would probably still be surprising to outsiders (hint: it’s not all machine learning).</p>
<p>There’s another form of short-sighted cleverness, which is to tweak the system very slightly to achieve what I want without learning about the whole system or understanding the full consequences of those tweaks. The smallest diff isn’t necessarily the best fix. Adding a patch that isn’t well thought out and that doesn’t fit in with the rest of the code is just incurring tech debt. Perhaps we could call this under-engineering.</p>
<h2 id="dont-pretend-you-understand">Don’t Pretend You Understand</h2>
<p>One thing that is common, perhaps more so among newer folks, is to pretend to understand, whether listening to a conversation or getting answers from teammates. Having been on both sides, I believe that this is a really bad habit. Of course it is only human nature to hide your inexperience, and I’ve also heard criticism about people asking for help before spending a lot of time on the issues. But no, I think a lot of the times, asking questions eagerly is way more productive over all, especially for new teammates.</p>
<p>I say this multiple times to my interns: if a problem can possibly take you half an hour to figure out, and I already know the answer, then the decision here is between one minute of my time versus thirty minutes of yours. Is my time 30x more valuable? I wish! If it’s a tough question and the mentor doesn’t already know the answer, then it seems even sillier for the intern to struggle alone for a long time.</p>
<p>There are times when I’m answering questions from newer folks, and I know that there’s no way they understood some statements I made. However they were still nodding and reacting as if they did. Invariably they will return in a few days with the same questions. This is counter productive.</p>
<p>This problem is less common, although perhaps more serious for more experienced people, since the pride has built up. I’ve told myself “this is something I should know by now” multiple times and stopped myself from asking my coworkers. But the truth is no one knows every corner of the system, and everyone knows that.</p>
<p>There is no shame in not knowing things. People expect that. Just ask.</p>
<h2 id="aligned-incentives">Aligned Incentives</h2>
<p>One idiosyncrasy about the finance industry is that a lot of compensation comes in the form of variable and unpredictable year end bonuses, as compared to stock offerings in tech companies. Of course people like certainty. But I think there’s a case for an opaque process of reward in the form of bonuses.</p>
<p>From first principle, employees have different incentives, and they usually aren’t the same as the company’s goals. Whenever incentives diverge or even conflict, we can get serious issues.</p>
<p>There are countless examples in real life. One example in finance is the reward curve for some hedge funds. Hedge funds are roughly companies that take investor money and help them pick investments. Some hedge funds collect a fixed fee, plus a significant fraction in additional return. That means when the investments increase in value, they make more money. That’s good incentive, right? The problem is when the investments lose money, they aren’t affected - they still take the fixed fee, regardless of how much was lost. (This is not entirely accurate, since they will lose clients if they keep losing money.) Therefore the funds will tend to make riskier investments. If your investment is going to make on average 10% a year, you’d rather make 100% this year and lose ~80% next year, as you can collect a much larger fee. This is worse for the clients.</p>
<p>Now let’s look at employee compensation. We want to reward employees in a way that encourages them to help the company make more money (assuming making money is the goal of the company). One thing we can do is to measure the amount of contributions for everyone, and reward accordingly. For example, we could measure hours spent in the office, or number of lines of code written, or survey people for their estimations of their teammates’ contributions, etc. The problem is these are only proxy measurements, and once you start measuring them, people will optimize for the proxies instead of the actual goal. If you measure lines of code written, you’ll encourage verbose and redundant code; if you measure hours in the office, people will stay longer but not necessarily work at the same speed, and so on.</p>
<p>But the problem is fundamental - you can’t measure the actual contribution and hard work, and by measuring proxies you’ll encourage cynical behavior. A fix, if not a complete solution, is through obfuscating the reward function. If I tell you that I’ll give you an unknown amount of money by the end of the year based on How Well You DidTM, and let you fill in the rest, then you won’t be encouraged to write bad code, or only focus on projects that had Impact, or other things bad for the firm.</p>
<p>I feel that this has worked quite well in my company. But there are a lot of assumptions for this to work. One is that employees have to be OK with not knowing how much they’ll make. There also needs to be a lot of trust between employees and managers, so that employees can trust that they’ll be evaluated fairly by the end of the year.</p>
<h2 id="and-more">And More…</h2>
<p>This post is getting long and messy, so maybe I’ll call it a day for now. There are a lot of smaller lessons that come from trading and recruiting. Trading is arguably the best arena to hone one’s rational decision making skills, and interesting stories come up every now and then. Maybe I’ll write a follow up some day.</p>It’s been two years since I started working full time as a software engineer. As I accumulated experience, I became a lot more hesitant to write, because I start to feel that I can’t contribute anything new on top of what everyone else already knows. And I even felt bad for having written some of the old posts, since they now seem quite silly and naive.Thoughts on Fooled by Randomness2019-10-05T16:57:40+00:002019-10-05T16:57:40+00:00inacsb.com/thoughts-on-fooled-by-randomness<p>Just finished Nassim Nicholas Taleb’s well-known book, Fooled by Randomness. Here are some brief thoughts, in no particular order.</p>
<h2 id="the-birthday-irony">The Birthday Irony</h2>
<p>Despite the author’s years working in trading and writing a book on probability, in one of the few cases where he did actual math, he did it wrong. Here’s the original:</p>
<blockquote>
<p>If you meet someone randomly, there is a one in 365.25 chance of your sharing their birthday, and a considerably smaller one of having the exact birthday of the same year.</p>
<p>Nassim Nicholas Taleb, Fooled by Randomness</p>
</blockquote>
<p>It seems like he was trying to say - on average, there are <code class="language-plaintext highlighter-rouge">365.25</code> days a year (first order approximation of leap years), so you have a <code class="language-plaintext highlighter-rouge">1/365.25</code> chance of meeting someone of the same birthday.</p>
<p>If you do the math though, here’s the actual probability: every four years (<code class="language-plaintext highlighter-rouge">365 * 4 + 1 = 1461</code> days), there are <code class="language-plaintext highlighter-rouge">1460</code> days in which your probability of sharing a birthday is <code class="language-plaintext highlighter-rouge">4/1461</code>, and <code class="language-plaintext highlighter-rouge">1</code> day in which it is <code class="language-plaintext highlighter-rouge">1/1461</code>. So, the probability is <code class="language-plaintext highlighter-rouge">1460/1461 * 4/1461 + 1/1461 * 1/1461 = 1/365.44</code>. That’s significantly off from <code class="language-plaintext highlighter-rouge">365.25</code> that you can’t really say “I just made a first order approximation”.</p>
<p>To fully understand this error, let’s say there is one extra day in <code class="language-plaintext highlighter-rouge">n</code> years, instead of <code class="language-plaintext highlighter-rouge">4</code>. Then the number, instead of <code class="language-plaintext highlighter-rouge">365.25</code> or <code class="language-plaintext highlighter-rouge">365.44</code>, will be <code class="language-plaintext highlighter-rouge">(365n^2/(365n+1)^2 + 1/(365n+1)^2)^-1</code>. After taking Taylor series expansions, we get <code class="language-plaintext highlighter-rouge">365 + 2/n - 364/365n^2 + O(n^-3)</code>, or <code class="language-plaintext highlighter-rouge">365 + 2/n + O(n^-2)</code>, instead of the <code class="language-plaintext highlighter-rouge">365 + 1/n</code> that the author had guessed.</p>
<p>Let’s spend a little time to gain intuitions on why it’s <code class="language-plaintext highlighter-rouge">365 + 2/n</code> instead of <code class="language-plaintext highlighter-rouge">1/n</code>. Consider Alice and Bob, and a year is exactly <code class="language-plaintext highlighter-rouge">365</code> days. Then the chance of sharing a birthday is <code class="language-plaintext highlighter-rouge">1</code> in <code class="language-plaintext highlighter-rouge">365</code>. Now say we add <code class="language-plaintext highlighter-rouge">x</code> days to Bob’s calendar only, so Bob’s birthday has <code class="language-plaintext highlighter-rouge">365+x</code> possible choices while Alice still has <code class="language-plaintext highlighter-rouge">365</code>. Then, the probability that they have the same birthday is <code class="language-plaintext highlighter-rouge">1</code> in <code class="language-plaintext highlighter-rouge">365+x</code>. At this point, it is clear that if we add <code class="language-plaintext highlighter-rouge">x</code> days to Alice’s calendar, the chance of sharing a birthday goes down, therefore we know that the author’s estimate of probability is too high. Then, add <code class="language-plaintext highlighter-rouge">x</code> to Alice’s calendar. If <code class="language-plaintext highlighter-rouge">x</code> is small, we can ignore the probability that their shared birthday is on one of the days in <code class="language-plaintext highlighter-rouge">x</code> (that probability is second order). Then, approximately we have the probability of sharing a birthday as <code class="language-plaintext highlighter-rouge">365/(365+x)^2</code>, which is close to <code class="language-plaintext highlighter-rouge">1/(365 + 2x)</code>, again ignoring the second order term. Substituting <code class="language-plaintext highlighter-rouge">x</code> for <code class="language-plaintext highlighter-rouge">1/n</code>, we have arrived at the desired result. The factor <code class="language-plaintext highlighter-rouge">2</code> comes from the fact that we added a leap day not only to Bob, but also to Alice.</p>
<p>Anyway, on a higher level, the lesson is that you should fully justify your simplifying assumptions, instead of jumping to conclusions.</p>
<h2 id="wittgensteins-ruler">Wittgenstein’s Ruler</h2>
<p>This idea has never explicitly come to my mind, so I thought it was interesting. It says something like if you don’t have a reliable ruler, and you use it against a table, you might be measuring your ruler with the table. One example he mentioned was that some people in finance claimed that a ten sigma event happened. Using the principle - if you measured a ten sigma event, your ruler (mathematical model) is probably seriously flawed.</p>
<p>One takeaway from this is that statistics is merely a language to simplify and describe the real world, the world does not run according to the rules. It would be ridiculous to plot data points under a bell shape, and say that the world is wrong when the new data point doesn’t fit under it.</p>
<p>Another way of saying the same thing is conditional probability. Relevant xkcd: <a href="https://www.xkcd.com/1132/">https://www.xkcd.com/1132/</a></p>
<p>One way I’ve seen it in real life is the current political situation in Hong Kong. Say there’s a certain probability that one citizen goes nuts and riot in the street, and there’s a certain probability that the government has done something terribly wrong. If you have very few people rioting, then the ruler tells you that those guys are probably at fault. But if you have a majority of citizens supporting the riots or rioting, then those guys become the ruler, and you’re measuring the government.</p>
<h2 id="think-about-all-possibilities">Think about All Possibilities</h2>
<p>One very valid point in the book is that you should think of the world as taking one sample path in infinitely many possibilities. When you evaluate an outcome, you should think of all the things that could have happened. For example, if your friend did a thing and made a huge success, it doesn’t mean he made a good decision or that you should’ve done the same, or even that you should follow suit. We have only one data point, you don’t know what the probability distribution looks like. Maybe he could have lost it all. When you think about all that could have happened, you will have less jealousy to the lucky and more sympathy to the unfortunate.</p>
<h2 id="happiness-is-relative">Happiness is Relative</h2>
<p>This is a tangential point to randomness, but still important to keep in mind. Given that you have basic human needs fulfilled, your happiness often doesn’t depend on how much you have, but how much more you have compared to those around you. More generally, it’s not the absolute well-being that matters, but the changes. So to be happy, don’t be the medium fish in the big pond, go to the small pond and be a king. If you start out at the top, tough luck, because chances are your status will revert to mean over time.</p>
<h2 id="limit-your-loss">Limit Your Loss</h2>
<p>If there’s one actionable item from the book, that’s to always remember to limit your worst case scenario. Between a steady increase in personal well-being with no risk of going bankrupt and more income but also a chance of losing everything, you should prefer the former, because eventually the unfortunate thing will happen. That’s called the ergodicity - any event with a nonzero probability will eventually happen, mathematically.</p>
<h2 id="the-authors-conspicuous-faults">The Author’s Conspicuous Faults</h2>
<p>I believe most readers will often find the author’s comments controversial and provocative, if not arrogant and overgeneralizing. There’s a bunch of stuff he said that is just plain wrong.</p>
<p>He said in the beginning of the book that he didn’t rewrite according to his editor’s suggestions, because he didn’t want to hide his personal shortcomings. But the point of a nonfiction book that is non-autobiographical is not to convey who you are, but to give readers inspirations and positive influence. If you say a bad thing in the book that you believe in, you’re not “being true”, you’re bad influence! I don’t know what exactly he was referring to, but I suspect they should include my following points.</p>
<p>He’s exceptionally arrogant, way off the charts. You’ll see him saying things like “I know nothing about this, despite having read a lot into it” and “I know nothing, but I am the person that knows the most about knowing nothing”. He just couldn’t write one sentence that ends in a defeated tone. Before he puts a period down, he must add another clause to the sentence to remind the readers that he’s just being humble, he didn’t mean it. It’s quite funny when you look for it.</p>
<p>He also loves stereotyping people to the extreme. He would say things like “journalists are born to be fooled by randomness”, “MBAs don’t know what they’re doing”, “company executives don’t have visible skills” and “economists don’t understand this whatever concept”. One thing he said in the beginning of the book was that he didn’t need data to back up his claims, because he’s only doing “thought experiments”. I think he mistook that for “unfounded personal opinions”. When you make claims about journalists and economists being dumb, that’s hardly a thought experiment. You absolutely need to back up your claims.</p>
<p>Overall, this book has some good ideas, but not that many. If you already have a decent background in math, maybe you can skip this book without harm.</p>Just finished Nassim Nicholas Taleb’s well-known book, Fooled by Randomness. Here are some brief thoughts, in no particular order.Fast RNG in an Interval2019-10-03T00:14:55+00:002019-10-03T00:14:55+00:00inacsb.com/fast-rng-in-an-interval<p><a href="https://arxiv.org/abs/1805.10941">https://arxiv.org/abs/1805.10941</a> - Fast Random Integer Generation in an Interval</p>
<p>Just read this interesting little paper recently. The original paper is already quite readable, but perhaps I can give a more readable write-up.</p>
<h2 id="problem-statement">Problem Statement</h2>
<p>Say you want to generate a random number in range <code class="language-plaintext highlighter-rouge">[0, s)</code>, but you only have a random number generator that gives you a random number in range <code class="language-plaintext highlighter-rouge">[0, 2^L)</code> (inclusive, exclusive). One simple thing you can do is to first generate a number <code class="language-plaintext highlighter-rouge">x</code>, divide that by <code class="language-plaintext highlighter-rouge">s</code> and take the remainder. Another thing you can do is to scale the range down by something like this: <code class="language-plaintext highlighter-rouge">x * (s / 2^L)</code>, with some floating point math, casting, whatever that works. Both ways will give you a resulting integer in the specified range.</p>
<p>But these are not “correct”, in a sense that they don’t generate random integers with a uniform distribution. Say, <code class="language-plaintext highlighter-rouge">s = 3</code> and <code class="language-plaintext highlighter-rouge">2^L = 4</code>, then you will always end up with one number being generated with probability 1/2, the other two numbers 1/4. Given 4 equally likely inputs, you just cannot convert that to 3 cases with equal probability. More generally, these simple approaches cannot work when s is not a power of 2.</p>
<h2 id="first-attempt-at-fixing-statistical-biases">First Attempt at Fixing Statistical Biases</h2>
<p>To fix that, you will need to reject some numbers and try again. Like in the above example, when you get the number 3, you shuffle again, until you get any number from 0 to 2. Then, all outcomes are equally likely.</p>
<p>More generally, you need to throw away <code class="language-plaintext highlighter-rouge">2^L mod s</code> numbers, so that the rest will be divisible by <code class="language-plaintext highlighter-rouge">s</code>. Let’s call that number <code class="language-plaintext highlighter-rouge">r</code>, for remainder. So you can throw away the first <code class="language-plaintext highlighter-rouge">r</code> numbers and use the first approach of taking remainder, as shown in this first attempt (pseudocode):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r = (2^L - s) mod s // 2^L is too large, so we subtract s
x = rand()
while x < r do
x = rand()
return x mod s
</code></pre></div></div>
<p>That’s a perfectly fine solution, and in fact it has been used in some popular standard libraries (e.g. GNU C++). However, division is a slow operation compared to others like multiplication, addition and branching, and in this function we are always doing two divisions (mod). If we can somehow cut down on our divisions, our function may run a lot faster.</p>
<h2 id="reducing-number-of-divisions">Reducing number of divisions</h2>
<p>It turns out we can do just that, with just a simple twist. Instead of getting rid of the first <code class="language-plaintext highlighter-rouge">r</code> numbers, we get rid of the last <code class="language-plaintext highlighter-rouge">r</code> numbers. And we can verify whether <code class="language-plaintext highlighter-rouge">x</code> is in the last <code class="language-plaintext highlighter-rouge">r</code> numbers like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x = rand ()
x_mod_s = x mod s
while x - x_mod_s > 2^L - s do
x = rand ()
x_mod_s = x mod s
return x_mod_s
</code></pre></div></div>
<p>The greater-than comparison on line 3 is a little tricky. It’s mathematically the same as comparing <code class="language-plaintext highlighter-rouge">x - x_mod_s + s</code> with <code class="language-plaintext highlighter-rouge">2^L</code>, but we do this instead because you can’t express <code class="language-plaintext highlighter-rouge">2^L</code> with <code class="language-plaintext highlighter-rouge">L</code> number of bits. So basically, the check is saying if the next multiple of <code class="language-plaintext highlighter-rouge">s</code> after <code class="language-plaintext highlighter-rouge">x</code> is larger than <code class="language-plaintext highlighter-rouge">2^L</code>, then <code class="language-plaintext highlighter-rouge">x</code> is in the last <code class="language-plaintext highlighter-rouge">r</code> numbers and must be thrown away. We never actually calculate <code class="language-plaintext highlighter-rouge">r</code>, but with a little cleverness we manage to do the same check.</p>
<p>How many divisions are we doing here? Well, at least one on line 2, and possibly 0 or many more, depending on how many times the loop is run. Since we’re rejecting less than half of the possible outcomes (we’re at least keeping <code class="language-plaintext highlighter-rouge">s</code> and at most rejecting <code class="language-plaintext highlighter-rouge">s - 1</code>), we have at least 1/2 chance of breaking out of the loop each time, which means the expected number of loops is at most 1 (<code class="language-plaintext highlighter-rouge">0 \* 1/2 + 1 \* 1/4 + 2 \* 1/8 ... = 1</code>). So we know that the expected number of divisions is at worst 2, equal to that of the previous attempt. But most of the time, the expected number is a lot closer to 1 (e.g. when <code class="language-plaintext highlighter-rouge">s</code> is small), so this can theoretically be almost a 2x speed up.</p>
<p>So that’s pretty cool. But can we do even better?</p>
<h2 id="finally-fast-random-integer">Finally, Fast Random Integer</h2>
<p>Remember other than taking remainders, there’s also the scaling approach <code class="language-plaintext highlighter-rouge">x * (s / 2^L)</code>? It turns out if you rewrite that as <code class="language-plaintext highlighter-rouge">(x * s) / 2^L</code>, it becomes quite efficient to compute, because computers can “divide” by a power of two by just chopping off bits from the right. Plus, a lot of hardware has support for getting the full multiplication results, so we don’t have to worry about <code class="language-plaintext highlighter-rouge">x * s</code> overflowing. In the approach using mod, we inevitably need one expensive division, but here we don’t anymore, due to quirks of having a denominator of power of 2. So this direction seems promising, but again we have to fix the statistical biases.</p>
<p>So let’s investigate how to do that with our toy example of <code class="language-plaintext highlighter-rouge">s</code> = 3, <code class="language-plaintext highlighter-rouge">2^L</code> = 4. Let’s look at what happens to all possible values of <code class="language-plaintext highlighter-rouge">x</code>.</p>
<table>
<thead>
<tr>
<th><code class="language-plaintext highlighter-rouge">x</code></th>
<th><code class="language-plaintext highlighter-rouge">s * x</code></th>
<th><code class="language-plaintext highlighter-rouge">(s * x) / 2^L</code></th>
<th><code class="language-plaintext highlighter-rouge">(s * x) mod 2^L</code></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>3</td>
<td>0</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>6</td>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>9</td>
<td>2</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>Essentially we have <code class="language-plaintext highlighter-rouge">s</code> intervals of size <code class="language-plaintext highlighter-rouge">2^L</code>, and each interval maps to one single unique outcome. In this case, <code class="language-plaintext highlighter-rouge">[0,4)</code> maps to 0, <code class="language-plaintext highlighter-rouge">[4, 8)</code> maps to 1, and <code class="language-plaintext highlighter-rouge">[8, 12)</code> maps to 2. From the third column, we have two cases mapping to 0, and we’d like to get rid of one of them.</p>
<p>Note that the fundamental reason behind this uneven distribution is because <code class="language-plaintext highlighter-rouge">2^L</code> is not divisible by <code class="language-plaintext highlighter-rouge">s</code>, so any contiguous range of <code class="language-plaintext highlighter-rouge">2^L</code> numbers will contain a variable number of multiples of <code class="language-plaintext highlighter-rouge">s</code>. That menas we can fix that by rejecting <code class="language-plaintext highlighter-rouge">r</code> numbers in each range! More specifically, if we reject the first <code class="language-plaintext highlighter-rouge">r</code> numbers in each interval, then each interval will contain the same number of multiples of <code class="language-plaintext highlighter-rouge">s</code>. In the above example, the mapping becomes <code class="language-plaintext highlighter-rouge">[1, 4)</code> maps to 0, <code class="language-plaintext highlighter-rouge">[5, 8)</code> maps to 1, and <code class="language-plaintext highlighter-rouge">[9, 12)</code> maps to 2. Fair and square!</p>
<p>Let’s put that in pseudocode:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r = (2^L - s) mod s
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L // equivalent to x_s mod 2^L
while x_s_mod < r do
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
return shift_right x_s L // equivalent to x_s / 2^L
</code></pre></div></div>
<p>Now that would work, and it would take exactly 1 expensive division on line 1 to compute <code class="language-plaintext highlighter-rouge">r</code> every single time. That beats both of the above algorithms! But wait, we can do even better! Since <code class="language-plaintext highlighter-rouge">r < s</code>, we can first check <code class="language-plaintext highlighter-rouge">x_s_mod</code> against <code class="language-plaintext highlighter-rouge">s</code>, and only compute <code class="language-plaintext highlighter-rouge">r</code> if that check fails. This is the algorithm proposed in the paper. It looks something like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
if x_s_mod < s then
r = (2^L - s) mod s
while x_s_mod < r do
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
return shift_right x_s L
</code></pre></div></div>
<p>Now the number of expensive divisions is either 0 or 1, with some probability depending on <code class="language-plaintext highlighter-rouge">s</code> and <code class="language-plaintext highlighter-rouge">2^L</code>. This looks clearly faster than the other algorithms, and experiments in the paper confirmed that. But as often is the case, performance comes at the cost of less readable code. Also in this case, we’re relying on hardware support for full multiplication results, so the code is less portable and in reality looks pretty low level and messy. Go and Swift have adopted this, deciding the tradeoff worthy, according to the author’s blog (<a href="https://lemire.me/blog/2019/09/28/doubling-the-speed-of-stduniform_int_distribution-in-the-gnu-c-library/">https://lemire.me/blog/2019/09/28/doubling-the-speed-of-stduniform_int_distribution-in-the-gnu-c-library/</a>), C++ may also use this soon.</p>
<h2 id="how-many-divisions-exactly">How Many Divisions Exactly?</h2>
<p>There’s still one last part we haven’t figured out - we know the expected number of divisions is between 0 and 1, but what exactly is it? In other words, how many multiples of <code class="language-plaintext highlighter-rouge">s</code>, in the range <code class="language-plaintext highlighter-rouge">[0, s * 2^L)</code>, has a remainder less than <code class="language-plaintext highlighter-rouge">s</code> when dividing by <code class="language-plaintext highlighter-rouge">2^L</code>? To people with more number theory background, this is probably obvious. But starting from scratch, it can take quite a lot of work to prove, so I’ll just sketch the intuitions.</p>
<p>It’s a well known fact that if <code class="language-plaintext highlighter-rouge">p</code> and <code class="language-plaintext highlighter-rouge">q</code> are co-prime (no common factors other than 1), then the numbers <code class="language-plaintext highlighter-rouge">{ 0, p mod q, 2p mod q, 3p mod q ... (q-1) p mod q }</code> will be exactly <code class="language-plaintext highlighter-rouge">0</code> to <code class="language-plaintext highlighter-rouge">q-1</code>. This is because if there is any repeated number, then we have <code class="language-plaintext highlighter-rouge">a * p mod q = b * p mod q</code> (assuming <code class="language-plaintext highlighter-rouge">a > b</code>), which indicates <code class="language-plaintext highlighter-rouge">(a - b) * p mod q = 0</code>. But we know that <code class="language-plaintext highlighter-rouge">0 < a - b < q</code>, and <code class="language-plaintext highlighter-rouge">p</code> has no common factor with <code class="language-plaintext highlighter-rouge">q</code>, so if we multiply those two together, it cannot be a multiple of <code class="language-plaintext highlighter-rouge">q</code>. So it’s impossible to have duplicates, and multiples of <code class="language-plaintext highlighter-rouge">p</code> will evenly distribute among <code class="language-plaintext highlighter-rouge">[0, q)</code> when taken mod <code class="language-plaintext highlighter-rouge">q</code>.</p>
<p>Now if <code class="language-plaintext highlighter-rouge">s</code> and <code class="language-plaintext highlighter-rouge">2^L</code> are co-prime, there will be exactly <code class="language-plaintext highlighter-rouge">s</code> number of multiples of <code class="language-plaintext highlighter-rouge">s</code> that has a remainder ranging from 0 to <code class="language-plaintext highlighter-rouge">s - 1</code>. That means the expected number of divisions in this case is <code class="language-plaintext highlighter-rouge">s / 2^L</code>.</p>
<p>If they aren’t co-prime, that means s is divisible by some power of 2. Say <code class="language-plaintext highlighter-rouge">s = s' * 2^k</code>, where <code class="language-plaintext highlighter-rouge">s'</code> is odd. Then, <code class="language-plaintext highlighter-rouge">s * 2^(L-k) = s' * 2^L</code> will be 0 mod <code class="language-plaintext highlighter-rouge">2^L</code>. So your multiples of <code class="language-plaintext highlighter-rouge">s mod q</code> will go back to 0 after <code class="language-plaintext highlighter-rouge">2^(L-k)</code> times. And you have <code class="language-plaintext highlighter-rouge">2^k</code> iterations of that. So if you go through the final count, it goes <code class="language-plaintext highlighter-rouge">2^k</code>, followed by <code class="language-plaintext highlighter-rouge">2^k - 1</code> number of 0s, rinse and repeat. How many are below <code class="language-plaintext highlighter-rouge">s</code>? You have <code class="language-plaintext highlighter-rouge">s'</code> number of nonzero counts, each one equal to <code class="language-plaintext highlighter-rouge">2^k</code> - it’s again, unsurprisingly, <code class="language-plaintext highlighter-rouge">s</code>. So the expected number of divisions is still indeed <code class="language-plaintext highlighter-rouge">s / 2^L</code>.</p>
<h2 id="final-thoughts">Final Thoughts</h2>
<p>Earlier I said each time you need to throw away <code class="language-plaintext highlighter-rouge">2^L mod s</code> numbers to make an even distribution, but that’s not completely necessary. For example, if <code class="language-plaintext highlighter-rouge">s = 5</code> and <code class="language-plaintext highlighter-rouge">2^L = 8</code>, you don’t have to fully reject 3 cases. In fact, you can save up those little randomness for the next iteration. In the next iteration, say you get into 1 of the 3 cases again. Then, combined with the 3 cases you saved up last time, you are now in 9 equally likely events. If you are in the first 5, then you can safely return that value without introducing biases. However, this is only useful when generating the random bit strings are really expensive, which is totally not the case in non-cryptographic use cases.</p>
<p>One last note - we have established that the expected number of divisions is <code class="language-plaintext highlighter-rouge">s / 2^L</code>. As <code class="language-plaintext highlighter-rouge">s</code> gets close to <code class="language-plaintext highlighter-rouge">2^L</code>, it seems like our code can become slower. But I think that’s not necessarily the case, because the time division takes is probably variable as well, if the hardware component uses any sort of short-circuiting at all. When <code class="language-plaintext highlighter-rouge">s</code> is close to <code class="language-plaintext highlighter-rouge">2^L</code>, <code class="language-plaintext highlighter-rouge">2^L mod s</code> is essentially one or two subtractions plus some branching, which can theoretically be done really fast. So, given my educated guess/pure speculation, <code class="language-plaintext highlighter-rouge">s / 2^L</code> growing isn’t a real concern.</p>https://arxiv.org/abs/1805.10941 - Fast Random Integer Generation in an Interval