Building an AVL Tree From a Sorted Sequence in One Pass

2020-10-25 9 minute read

Recently I came across the function Map.of_increasing_sequence in the base library of OCaml. It might sound like a very simple and common function, but the implementation is actually quite cool. Let’s dive in. (Spoiler: it’s related to a weird number system.)

First Impressions

A sequence is like an iterator - you can either get the next value or reach the end. A map is an immutable AVL tree of key value pairs. An AVL tree is a binary tree with two properties - each node is larger than all nodes in its left subtree and smaller than all nodes in its right, and both subtrees can only have heights differing by at most 1. From an algorithmic standpoint, making a BST from a sorted array is trivial. You can achieve O(n) time complexity, and O(log(n)) space excluding the return value, just by simple recursion. But you can’t do that to a sequence, since sequences don’t permit random access. Now, we can always turn the sequence into an array first, but that would require O(n) extra space. Can we do better?

It turns out that library function is implemented with only one pass through the sequence, using only log(n) extra space. The code had almost no documentation, and I couldn’t find any description online (although I didn’t try very hard), so here I’ll attempt to motivate and derive the algorithm.

First Attempt: Try to Build a Tree

Imagine yourself with the task of building a balanced tree and are handed one number at a time, and you need to incrementally build a BST as quickly as possible. That might look like this.

I get a 1 - that’s easy. I’ll have a tree with one node.

2 - OK, that’s bigger than 1, I can make that the new root, and make 1 the left child.

3 - Let’s put that as the right child. So far so good.

4 - Hmm, maybe we could make that the new root, and have the tree rooted at 2 as the left child?

5, 6, 7 - That looks like 1, 2, 3 all over again, we can put those in the right subtree. Now we have a complete BST, looking good!

8 - That’s awkward again, let’s just say it’s the new root again.

9-15 - Looking like 1-7 again…

Maybe you can see the recursive pattern here. This looks like a procedure that produces reasonably balanced trees, and the height is always bounded by O(log(n)). That seems good, right?

It would be acceptable, but only if your map library only has two functions - build the tree, then look up values and never change it again. The problem is, the BST also has to support adds and deletes as well. The most common BST types, like red black trees and AVL trees, all have their own invariant, and unfortunately we’re not meeting those standards with our almost-balanced trees. For example, At step 8, we have a root (8), a left subtree of size 7, and an empty right subtree. That’s not a valid red black tree or AVL tree, and you can’t just return that, since that would break the rest of your library. How can we fix this?

Second Attempt: Build Branches Instead

One might think that we could make this work somehow with some clever ordering of insertions to the tree. But the fundamental issue here is that with only one tree, we can never change the root - at the moment we make the newest element the root, the tree must become heavily imbalanced. So perhaps we can instead maintain a bag of branches, and quickly assemble them into a tree when we hit the end of the sequence.

Here, the defining characteristic of a branch would be its composability. Let’s define a branch (called a fragment in the source code) like the trees we had in step 4 and 8 in the last attempt. In other words, a branch would be a tree with a complete left subtree and an empty right subtree. To merge two branches into a tree, you could put one branch as another branch’s right child. A branch of height n has 2^n nodes.

You could also merge two branches of the same size into a new branch. Here’s one way to do it: to merge X with Y, take the left subtree of branch Y and move it to the right subtree of branch X. Now X becomes a complete binary tree, and you can set it to be the left subtree of Y. This fits our branch definition, while also preserving order in the tree.

Merge branches as tree:
    X       Y        X
   /   +   /   =    / \
  A       B        A   Y
                      /
                     B

Merge branches as branch:
    X       Y        Y
   /   +   /   =    /
  A       B        X
                  / \
                 A   B

Now let’s try again:

1 - One node by itself is a branch.

2 - We could make 2 a branch and merge with 1.

3 - We could have 3 be its own branch. Now we have branches of size 2 and 1.

4 - Let’s merge 4 with 3, and then we have 2 branches of size 2. We could merge those two again into one branch.

5 - That’s a new branch.

6 - Add that to 5’s branch…

That starts to look recursive again. Now we always have a bunch of branches. And when we need to generate the final tree, we could just iteratively merge them together, from small to large. Is that good enough?

We Are Building a Binary Number

It’s not. To see this, we can frame this algorithm a bit more abstractly.

Consider the heights of our branches. For each integer n, each time we can only have either 0 or 1 branch of height n, because once we have 2, we merge them together. We can visualize the branch building in this table.

# branches of size 4	# branches of size 2	# branches of size 1	# Nodes
0	0	1	1
0	1	0	2
0	1	1	3
1	0	0	4
1	0	1	5
1	1	0	6

Our branches sizes correspond to the binary representation of total tree size.

From here, we can see that we’re abstractly incrementing a binary counter. Now the problem is we can have a lot of gaps, or 0s, in the binary number. For example, for 17 nodes, we’ll have a large branch of size 16 and a small branch of 1 node. Now if the sequence terminates, we’ll have to merge those branches into a tree - but that again will be a heavily one-sided tree.

Third and Final Attempt: Keep 2 Branches at Each Level

Since gaps are causing us problems, maybe we could just, like, not have them. And in fact we could. This is hinted at in the code - “using skew binary encoding”. (Although from Wikipedia, skew binary system actually refers to a slightly different definition.)

In this new “binary” encoding, we could use the digits 0, 1 and 2, as opposed to just 0 and 1. Each position in the number would still have the same weight. So for example, 212 = 2*4 + 1*2 + 2 = 12. Here’s how to count in this new number system.

# branches of size 4	# branches of size 2	# branches of size 1	# Nodes
0	0	1	1
0	0	2	2
0	1	1	3
0	1	2	4
0	2	1	5
0	2	2	6
1	1	1	7

Counting in the new system

Basically, to add one, we flip 0 to 1 and 1 to 2, but we flip 2 back to 1 and carry forward. There are never any 0s in any number.

Translating this back to our branch building, that means we don’t merge when we have two branches. We merge when we have three - and we merge the older two branches, “carrying” it forward to the next level, and always keep one branch for each height.

Let’s convince ourselves that at every point in the process, we can merge all branches and end up with a valid AVL tree (i.e. the algorithm is correct).

Say we are n steps into the branch building process, and we have to make a tree. We can convert n into a string of 1 and 2 in this number system. Starting from the least significant digit, we either have 1 or 2. Together, that gives us either a tree of height 1 or 2. Moving onto the next digit, we again have either 1 or 2 branches of size 2. At the max, we have 3 branches, each of size 2. If we first merge the left two branches into a branch, then merge that with the right branch to a tree, that leaves us with a maximum tree height of 3 (while minimum is 2). At the next level, we can at max have 3 branches of height 3. Similarly, we end up with a tree with height between 3 and 4.

Illustrating the 222 case, with 14 nodes.
Level 1:
13 + 14 
-------
  14
13

Level 2:
 10  +  12  +  14
09     11     13
-----------------
      12
  10      14
09  11  13

Level 3:
      04          08          12
  02      +   06      +   10      14
01  03      05  07      09  11  13
------------------------------------
              08
      04              12
  02      06      10      14
01  03  05  07  09  11  13  

In this process, we always create trees that preserve order. And after level n, the tree that we end up with always have height n or n+1. That satisfies the AVL tree invariant that two subtrees should have heights differ by at most 1.

That’s It!

Now we should be reasonably convinced that this algorithm produces valid BST. But there were a lot of details that were glossed over. To be completely rigorous, we would need to formalize the observations into claims, and prove them by induction. But I believe this process captures the key ideas already.

We also skipped the time/space complexity discussion. There are two slightly nontrivial details here. First, each insertion could lead to a cascade of branch merges (or carries), so we need to argue that insertion has an amortized cost of O(1). Then, we need to realize that the final tree merging takes O(log(n)) branches, and each tree merge is O(1) as well. As an aside, this number system has a unique representation for each number, which is perhaps not totally obvious.

I am not able to identify the inventor of this algorithm. It doesn’t seem particularly likely that the author of this code was also the inventor.

There are still some aspects of that source file that I don’t quite understand, but is perhaps not closely related. In particular, the invariant is that subtrees have heights differing by at most 2, not 1, like normal AVL trees. Maybe I’ll find out why another day.