Prefix-free codes and ordinals

Consider the problem of representing a number in computer memory, which is idealized as a sequence of zeros and ones. The binary number system is a well-known solution to this problem — for example, the sequence “01101” represents $0 \cdot 2^4 + 1 \cdot 2^3 + 1 \cdot 2^2 + 0 \cdot 2^1 + 1 \cdot 2^0 = 11$. But there’s a problem: You don’t expect the entire computer to just be used to represent one number; you expect it to have other things stored afterwards. So how do you tell where the number ends? If the sequence begins $01101001\dots$ does this represent the number $011_2$ or $01101_2$ or $0110100_2$?

The solution to this problem most commonly used in practice is to declare in advance a fixed number of bits that will be used to represent the number, usually 32 bits or 64 bits. For example, if we fix a 5-bit representation the $01101001\dots$ always means that the number is $01101_2 = 11$ and $001\dots$ represents other stuff in memory. This works well enough in practice, but it has a problem: The number of bits you set aside for storing the number forces an upper limit on how big the number can be. For example, you cannot store any number bigger than $2^64$ in 64 bits. If the computer ever needs to store a bigger number than can be represented with space set aside then the computer fails.

I’ll introduce some terminology. The condition that a system of representing numbers is unambiguous can be phrased formally by saying the this method is a prefix-free code. A prefix-free code consists of a set of codewords, which are sequences of bits, such that no codeword is a prefix of another. A continuing stream of bits can be interpreted as a codeword by taking an initial segment that is a codeword, and by prefix-free property at most one such interpretation is possible. Since we want this code to represent numbers we also want a map from the codewords to the natural numbers (by which I mean including zero, naturally)

The fixed-width codes I mentioned earlier are examples of prefix-free codes whose code words map to a finite subset of the natural numbers. The simplest prefix-free code that can represent every natural number is the unary number system: A number is represented by that number of repeating ones followed by a zero. So “0” is 0, “10” is 1, “110” is 2, et cetera.

This system is extremely inefficient: It takes $n+1$ bits to represent the number $n$. In contrast, the simple (non-prefix-free) binary representation takes $\log_2 n$ bits. A more efficient system is what I’ll call the level 1 binary system: A number is represented by a number $k$ in unary concatenated by a $k$-bit representation of the number. This takes $2 \log_2 n$ bits to represent the number $n$. That’s much more efficient than the unary system, but is still a factor of two less efficient than the simple binary system. A more efficient system is the level 2 binary system: First represent a number $k_0$ in unary, then a $k_0$-bit number $k_1$, and then the number $n$ is represented as a $k_1$-bit number.

Asymptotically this should be efficient enough for any purpose, but the way it is constructed naturally inspires the generalization to a level $k$ binary system for all $k$. The level 3 system shaves off $\log \log n$ bits, the level 4 system another $\log \log \log n$ bits, and so on. After all of these comes the level $\omega$ binary system: A number $k$ is given in unary, followed by a representation of a number in the level $k$ binary system. Here’s a particularly elegant reformulation of this system: “0” is a codeword representing 0. If $x$ is a codeword representing $n$ and $y$ is a $n$-bit string, then $1 x y$ is a codeword representing the number which has $1y$ as its binary representation. Within this formulation the system has the property that every number has a unique code for it. This trick where the bit-string $y$ represents the binary number $1y$ can also be applied to the level $k$ binary systems for finite $k$, which results in a modified system where number has a unique code.

All of the prefix-free codes I described have the following additional property: If we restrict only to codewords that represent a number in the shortest possible way than one codeword is lexicographically less than another codeword if and only if the corresponding number is smaller. It follows that for any bit-string $a$, the set of numbers whose shortest codeword starts with $a$ is either an interval or a ray. For example, in the unary system the string $1^n$ of $n$ consecutive ones is a prefix for all numbers in the ray of numbers greater or equal to $n$. In the level 1 binary system, the prefix $1^9 0110$ is used to represent any 9-bit number whose first three bits are $110$, i.e., numbers in the range from $1.10_2 \times 2^8$ to $1.11_2 \times 2^8 - 1$. In other words, the prefix $1^9 0110$ gives a scientific-notation-like approximation for the corresponding number as $1.10_2 \times 2^8$.

Similarly, in the level $\omega$ binary system, if the sequence $x$ denotes the number $n$ and $b_0, \dots, b_{k-1}$ are bits, then the prefix $1 x b_0 \dots b_{k-1}$ begins the representation of any number that is represented as $(1.b_0 \dots b_{k-1})_2 \times 2^n$ in binary scientific notation. What about the prefix $11x b_0 \dots b_{k-1}$? If the representation of a number $m$ begins with this sequence, then the fragment $1xb_0 \dots b_{k-1}$ must denote a truncation of a representation of the number $k$ of bits of $m$. Then $k$ is approximately $(1.b_0 \dots b_{k-1})_2 \times 2^n$, and so $m$ is approximately $2^{1.b_0 \dots b_{k-1} \times 2^n}$. Similarly the prefix $1 \dots 11x b_0 \dots b_{k-1}$ represents a number approximately equal to $2^{\cdot^{\cdot^{\cdot^{2^{ 1.b_0 \dots b_{k-1} \times 2^n}}}}}$. Every bit-sequence that is a prefix of a valid level $\omega$ codeword takes one of these forms.

In conclusion, truncating the level $\omega$ binary system representation of a number provides information on the approximate value of the number in a manner generalizing scientific notation. Moreover, this information remains efficient and useful as long as the number is no longer than a small tetration. In contrast, plain scientific notation ceases to be useful for numbers larger than an exponentiall whose exponent is too large to be written in full.

The suggestive terms “level $k$” and “level $\omega$” should inspire you to wonder whether there are level $\alpha$ systems for larger ordinals $\alpha$. For example, the level $\omega+1$ system would represent a number using a unary representation of a number $k$, followed by a $k$-bit number $n$, followed by the number represented in the level $n$ binary number system. Truncations of such representations could describe the approximate value of a number like truncations of the level $\omega$ system except with bigger ordinals there will be an even larger range that these truncations can usefully describe. Hopefully this can combine the power of ordinal-based function hierarchies for describing large numbers with the precision of scientific notation. Indeed such a system can be defined. I will describe a particularly elegant formulation:

Let $\alpha$ be a countable ordinal. Let $X = \{\beta \leq \alpha| \beta \text{ is a limit ordinal}\}$. Define a cofinal sequence system on $\alpha$ to be a sequence $(p_n)_{n \in \omega}$ of functions $p_n : X \to \alpha$ such that $\beta = \sup_n p_n (\beta)$ for all $\beta \in X$. The name for this concept is my own invention; I’m sure this concept has been studied before but I have not bothered to look up the standard terminology. Given such a system define $p : \alpha \setminus \{0\} \to \alpha$ by $p (\beta) = \beta'$ if $\beta = \beta' + 1$ is a successor ordinal and $p (\beta) = p_0 (\beta)$ if $\beta$ is a limit ordinal. Define a partial order $\lesssim$ by $\alpha \lesssim \beta$ is $\alpha$ can be obtained from $\beta$ by iterating $p$ a finite number of times. Then define the cofinal sequence system $(p_n)_{n \in \omega}$ to be progressive if $p_n (\beta) \lesssim p_{n+1} (\beta)$ for all $n \in \omega$, $\beta \in X$.

Let $(p_n)_{n \in \omega}$ be a progressive cofinal sequence system, with associated order $\lesssim$. A function $f : \alpha+1 \to \omega$ is said to have good support if $\mathrm{supp} f := \{x \leq \alpha | f (x) \neq 0\}$ is finite and linearly ordered by $\lesssim$. For example, the Kronecker delta function $\delta_\beta (\beta') = 1$ if $\beta = \beta'$ and $=0$ if $\beta \neq \beta'$ has good support. For each function $f$ with good support I will define an associated prefix-free code.

To define this prefix-free code, first decompose $f$ as $f = g + n \delta_\beta$ where $\beta$ is the largest element of $\mathrm{supp} f$ and $n = f (\beta)$ (so that every element of $\mathrm{supp} g$ is strictly $\lesssim$-less than $\beta$). If $f = 0$ set $\beta = n = g = 0$. The definition of the prefix-free code associated to $f$ depends on the cofinality of $\beta$:

• If $\beta = 0$ (and so $g = 0$) the empty string $e$ is the only valid codeword and is associated to the number $n$.
• If $\beta = \beta' + 1$ is a successor ordinal then the codewords are $bx$ where $b$ is a bit and $x$ is a codeword for the prefix-free code of $g + (2n-1+b) \delta_{\beta'}$ with the same associated number.
• If $\beta$ is a limit ordinal then the codewords of $f$ are the codewords of $g+\sum_{k, with the same associated numbers.

These codes satisfy the following properties:

• The set of numbers that can be represented in each code is a closed interval.
• Each number is represented by at most one codeword in each code.
• Each code maps the lexicographic ordering of the codewords into the standard order of the corresponding numbers.
• If $f \not\equiv 0$ has good support and $\beta$ is the largest element of $\mathrm{supp} f$ then the smallest element that can be represented with the code associated to $f$ is one greater than the largest number that can be represented with the code associated to $f - \delta_{\beta}$.

These properties can be proven by induction on the largest element of $\mathrm{supp} f$.

Finally define the level $\alpha$ prefix-free code to have codewords $1^n 0 x$ where $x$ is a codeword for $n \delta_\alpha$. Every number can be represented uniquely by this code, and lexicographically greater codewords map to bigger numbers.

In particular, take $\alpha = \omega^2$. Define a progressive cofinal sequence system on $\omega^2$ by $p_n ((m+1) \omega) = m \omega + n$, $p_n (\omega^2) = n \omega$. Then the level $\omega^2$ prefix-free code is exactly identical to the level $\omega$ binary number system I defined earlier.

2 thoughts on “Prefix-free codes and ordinals”

1. Reblogged this on Gentzen translated and commented:
Very nice. I like that the ordinal based construction allows for quite some freedom, while still ensuring that every number can be represented uniquely, and that lexicographically greater codewords map to bigger numbers.

Allowing non-binary digits would provide even more freedom. I guess the only thing which would change in the ordinal based construction is that (2n-1+b) would be replaced by (rn-r+1+d), where d is an r-ary digit. And the encoding of $1^n0x$ would have to be adapted too.

I wonder what motivated you to write this post. Was it just the obvious motivation to construct a connection between ordinals and natural number representations?

2. Thanks. My inspiration was seeing other discussions of what I called the level $\omega$ binary system. I believe this was in a paper implementing it in binary lambda calculus, but I can’t bother to find it again. I was thinking about nice properties of this representation system and trying to generalize it.