April 11, 2020

Thinking in Bits

As a big proponent of Tonsky’s Performance first philosophy, I believe every software engineer should pay attention to performance from the start. If used in the right places, bitwise operators can increase the speed of your programs by several orders of magnitude — without compromising on readability.

However, unlike some performance low-hanging fruit found in complex frameworks, code that works at the bit level should be introduced early on in the project. Serialization protocols are mostly set in stone from day 1, as updating them on a production environment normally requires downtime to perform the necessary migrations.

It’s good to practice fundamentals from time to time. You never know when these will come in handy. Sure, if you’re a frontend UI engineer, the & (different from &&), ^ or ~ operators may be outright new to you. This is not to say that they are useless or “too primitive”: the Apollo Guidance Computer was entirely built from NOR gates.

Armstrong training in the lunar module simulator at Kennedy Space Center (Source)

To explore some use cases of these operators, we’re going to design a program to parse Minecraft’s Chunk Data packet. Videogames require various data structures to optimize the amount of bytes sent over the wire to reduce latency. In particular, this packet contains the game’s block data serialized as a compacted array of values. This data structure must be designed with performance in mind since a Minecraft server can send up to 120 packets per player per second.¹¹ Alstad, T. et al. (2015). Minecraft Computer Game Performance Analysis and Network Traffic Emulation by a Custom Bot. Science and Information Conference (SAI), 227-236.

A compacted array holds a fixed number length of integers with bitsPerValue bits each. It has the following API:

public class CompactedArray {

    public CompactedArray(final int length, final int bitsPerValue) {
        // ...
    }

    public int get(final int index) {
        // ...
    }

    public void set(final int index, final int value) {
        // ...
    }
}

The main difference between CompactedArray and a regular int[] is that it supports words of any size less than or equal to 32. Indeed, if bitsPerValue were 8, 16, 32 or 64 the primitive array counterpart would outperform the CompactedArray implementation. ◊Method invocation and object memory overhead come to mind. In fact, Minecraft used to store block IDs in a byte array, but they quickly ran out of IDs for new blocks and overhauled the game’s storage and protocol format.

It seems natural that CompactedArray should be backed by a primitive array given their similarities: they both hold a fixed number of values of a certain bit length. But what type should we employ?

In the worst case scenario (bitsPerValue is 32), retrieving a CompactedArray value from a byte array requires 4 loads. Assuming the values are in the processor’s L1 cache, this should only take 4 cycles.²² Intel Optimization Reference Manual However, we need to take into account the JVM bounds checking. Minimizing memory accesses yields better performance. A short array might still need 2 loads if bitsPerValue is greater than 16. See the pattern? As the backing array element size increases, the number of required load instructions decreases. This means a long array is the best candidate:

public class CompactedArray {

    private final long[] data;
    private final int bitsPerValue;

    public CompactedArray(final int length, final int bitsPerValue) {
        this.bitsPerValue = bitsPerValue;
        // TODO Initialize data array
    }

    // ...
}

A new question arises: how big should the data array be? For brevity, let $l$ and $b$ be length and bitsPerValue respectively. The problem is equivalent to finding the least multiple of 64 which can store $lb$ bits. That is,

$\text{data array size} = \left\lceil \frac{lb}{64} \right\rceil,$

where $\left\lceil \cdot \right\rceil$ denotes the ceiling function. For example, if we wanted to store 32 values of 5 bits each, the data array length would be $\left\lceil \frac{32 \times 5}{64} \right\rceil = 3$ .

public class CompactedArray {

    public CompactedArray(final int length, final int bitsPerValue) {
        // Using Math.ceil is suboptimal, see the appendix
        this.data = new long[(int) Math.ceil(length * bitsPerValue / 64D)];
        // ...
    }

    // ...
}

A CompactedArray can now be constructed, but it’s pretty useless in its current form. Let’s first work on value retrieval. The choice of a long array as the backing structure means a value can be found within a single long; or overlapping between 2 longs, where the least significant bits (LSBs) can be found in the first long, and the remaining ones at the beginning of the second. More formally, given a value starting at position $0 \leq s \lt 64$ of a long, the number of bits of that value stored in the long is $1 \leq \min\{b, 64 - s\} \leq b$ , meaning $\max\{b - (64 - s), 0\}$ bits are stored in the next long. As an exercise, find how many bytes a $b$ -bit value can overlap.

Given a CompactedArray element at index $i$ , we know its first bit is stored in position $bi$ , that lies within the $\left\lfloor \frac{bi}{64} \right\rfloor$ -th long. The notation is making this seem worse than it is. In code,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    // Integer division truncates the result
    int offset = bitOffset / 64;
    // ...
}

The right-shift operator

It’s time to introduce the first bitwise operator: the right shift (>>). Imagine we want to compute $a / b$ . If $b$ can be expressed as a power of 2, i.e. $b = 2^x$ for some $x$ , right shifting $a$ by $x$ , a >> x, is equivalent to dividing $a$ by $b$ .

In Algebra class you probably learnt that dividing exponentials with the same base is equivalent to subtracting their exponents. Notice that every bit of an int represents a power of 2. For example, letting bitOffset be 197, offset is

$\begin{aligned} \biggl\lfloor \frac{197}{64} \biggr\rfloor = \left\lfloor \frac{2^7 + 2^6 + 2^2 + 2^0}{2^6} \right\rfloor &= \left\lfloor 2^1 + 2^0 + 2^{-4} + 2^{-6} \right\rfloor \ &= 2^1 + 2^0 = 3. \end{aligned}$

As 64 is equal to $2^6$ , bitOffset / 64 is the same as bitOffset >> 6. We’re not interested in the fractional part (the negative powers of 2), so it is discarded.

The bitwise AND operator

Having computed the data array index where the LSBs of the value can be found, it’s time to figure out which bits we need to extract from it. The starting bit position within the long is $s = bi \bmod 64$ . In code,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset % 64;
    // ...
}

This works, but a modulo operation involves a division. Divisions are much more expensive than bitwise operations.³³ A fast alternative to the modulo reduction

We introduce the AND (&) operator. Given two binary numbers $a, b$ , we want to obtain what 1-bits they have in common. For each power of 2, this operator compares the bits from each argument: if both are 1, the resultant bit is 1. Otherwise, it is 0. For instance, $1001 \And 0101 = 0001$ .

Representing bitOffset in binary, i.e. as $\sum_{j=0}^{31} b_j 2^j$ where $b_j$ is the $j$ -th bit of bitOffset, one can factorize and represent this value as $2^6(b_{31} 2^{31-6} + \dots + b_6 2^0) + b_5 2^5 + \dots + b_0 2^0$ . Reducing this value modulo $64 = 2^6$ results in the non-factorized part $b_5 2^5 + \dots + b_0 2^0$ . This is the same as extracting the 6 first bits using bitOffset & 0b111111, which is much faster. ◊This technique is known as bit masking.

The problem is now divided in two: if the value fits in the long (i.e. $64 - (bi \bmod 64) \leq b$ ) we need to extract $b$ bits by applying a bit mask. However, if the value is overlapped between two longs, we need to join the least and most significant bits (MSBs).

The first case can be expressed as a “power-of-2 filter”. The value is equal to the sum of all bits that lie between $s$ and $s + b$ . That is,

$\text{value} = \sum_{j=s}^{\mathclap{s + b - 1}} b_j 2^{j - s}.$

The inequality $s \leq j \lt s + b$ is equivalent to $0 \leq j - s \lt b$ . Applying this shift by $s$ in code is expressed as data[offset] >>> startBit (the extra > tells Java to also shift the sign bit⁴⁴ JLS, Section 15.19), moving bits in positions $[s, s + b)$ to $[0, b)$ . However, this right shift is insufficient, as it also shifts all bits in positions greater than or equal to $s + b$ . To ignore these, we apply a bitmask that only leaves the first $b$ bits:

public class CompactedArray {

    // ...
    private final long bitmask;

    public CompactedArray(final int length, final int bitsPerValue) {
	// ...
	// Left shifting 0b1 by bitsPerValue results in 2^bitsPerValue
	// If bitsPerValue were 5, 2^5 - 1 = 0b100000 - 0b1 = 0b011111,
	// the bitmask we're interested in.
        this.bitmask = (1L << bitsPerValue) - 1;
    }

    public int get(final int index) {
        int bitOffset = index * bitsPerValue;
        int offset = bitOffset >> 6;
        int startBit = bitOffset & 0b111111;

        long lsbs = (data[offset] >>> startBit) & bitmask;
        // ...
    }
}

The left-shift operator

What about elements contained within two longs? Since the LSBs of the value are stored in the upper-most bits $s, \dots, 63$ of the first long, Java fills the MSBs of data[offset] >>> startBit with zeros after shifting. The first long contains $64 - s$ bits of the final value, meaning the next long contains the remaining $b - (64 - s)$ bits. We need to somehow join the LSBs from the first long (that we already have) with the MSBs stored in the LSBs of the second long, data[offset + 1].

Directly adding these bits would yield an incorrect result since the MSBs are misplaced: when bitsPerValue is 18, the value at index 3 is stored in bits 54 through 63 of the first long and bits 0 through 7 of the second long. If we shift the latter $64 - 54 = 10$ positions to the left, both values fit like puzzle pieces.

More rigorously, the positions we have to shift the MSBs by is exactly the number of bits we have from the first long, $64 - s$ .

In contrast to its sibling, left shifting $a$ by $b$ , a << b is equivalent to multiplying $a$ by $2^b$ . To sum up, given the bits $b_0, \dots, b_{b - (64 - s)}$ from the second long,

$\text{value} = \text{LSBs} + \sum_{j=0}^{b - (64 - s)} 2^{j + 64 - s} b_j,$

which in code is

int shiftBy = 64 - startBit;
// Apply the bitmask to zero out the upper bits we're not interested in
long msbs = (data[offset + 1] << shiftBy) & bitmask;

The bitwise OR operator

Given the binary numbers $a, b$ ; the OR (|) operator tells us what bits of $a$ or $b$ are 1. For each position, the operator compares the bits from each argument: if either or both of them are 1, the resultant bit is 1. Otherwise, it is 0. For example, $1100 \mid 1010 = 1110$ .

Since the LSBs and MSBs of value don’t overlap, all the bits to the left of the LSBs are zero, and all the bits to the right of the MSBs are zero, $\text{MSBs} \mid \text{LSBs}$ yields the value we’re looking for.

public int get(final int index) {
    return (int) (msbs | lsbs);
}

Given an index, everything left is to check in which case we’re in: does the value fit in a long or does it overlap? This is easy, we only need to check whether the last bit $s + b$ offset is greater than 64. Another alternative is comparing the offsets of the initial and last bits:

int lastBitOffset = (bitOffset + bitsPerValue - 1) >> 6;

if (offset != lastBitOffset) {
    // The starting bit and ending bit are in different longs
}

On my machine, this is 1.7x slower than the greater-than comparison.◊JMH benchmark Generally, integer equality and inequality take the same CPU cycles to run.⁵⁵ This SO answer explains this in more detail. The second method introduces a right shift, which takes an extra clock cycle and confuses the branch predictor. I challenge you to create a (faster) branchless version — I’ve tried and failed.

Putting it all together,

public int get(final int index) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;

    // Get least significant bits
    long value = (data[offset] >>> startBit) & bitmask;

    if (startBit + bitsPerValue > 64) {
        int shiftBy = 64 - startBit;
        // OR shifted most significant bits
        value |= (data[offset + 1] << shiftBy) & bitmask;
    }

    return (int) value;
}

The set method involves the same offsets. Computing the updated data[offset] and data[offset + 1] values is a tad more tricky though. Given the previous data, this method shall replace the relevant bits with the new integer value of bitsPerValue bits. Its LSBs will go on positions $s$ to $\min \{ s + b, 63 \}$ of the updated data[offset].

We already know how to align these bits. Namely, we apply the bitmask (to ensure the given value has exactly bitsPerValue bits◊Never trust the caller!) using the AND operator and left shift the LSBs startBits positions. The resultant value is of the form $0 \dots 0 b_{ b - 1} \dots b_0 0 \dots 0$ . However, a new problem arises: how do we update these bits while leaving others untouched?

The bitwise complement operator

By zeroing the bits of data[offset] corresponding to the given index, we get two non-overlapping longs we can OR to get the new value. For this, we construct a bitmask such that, when applied to the old value (by using the AND operator), only the bits we don’t want to update remain. If we left shift bitmask by startBit positions we get a long of the form $0 \dots 0 \underbrace{1 \dots 1}_{b \text{ times}} 0 \dots 0$ . This is the opposite of what we want.

The bitwise complement (~) operator inverts the bits of the operand. ◊This is a unary operator, it only takes one operand. It makes every 0 bit a 1 bit and every 1 bit a 0 bit, e.g. ${\backsim 1011} = 0100$ .

We apply the inverted mask and OR the result with the shifted LSBs to get the new value.

public void set(final int index, final int value) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;
    long maskedValue = value & bitmask; // Ensure value has bitsPerValue bits

    data[offset] =
        // The previous long bits with b zero bits in positions s to min(s + b - 1, 63)
        (data[offset] & ~(bitmask << startBit))
        // The new value shifted to replace the b zeros
        | (maskedValue << startBit);
}

Similarly, the mask for updating the second long is of the form $1 \dots 1 \underbrace{0 \dots 0}_{(64 - s) \text{ times}}.$ We could construct a bitmask and invert it as in the previous example. However, right shifting by $64 - s$ and then left shifting by this same value effectively clears the first $64 - s$ bits. The performance of both methods is comparable, but the former requires a bitmask construction. ◊JMH benchmark

We have assembled the complete set method:

public void set(final int index, final int value) {
    int bitOffset = index * bitsPerValue;
    int offset = bitOffset >> 6;
    int startBit = bitOffset & 0b111111;
    long maskedValue = value & bitmask;

    data[offset] = (data[offset] & ~(bitmask << startBit)) | (maskedValue << startBit);

    if (startBit + bitsPerValue > 64) {
        int shiftBy = 64 - startBit;
        data[offset + 1] = (data[offset + 1] >>> shiftBy << shiftBy) | (maskedValue >> shiftBy);
    }
}

For better readability, one should replace 6 and 0b111111 by named constants. The documented class (with bounds checks and additional methods) and associated tests are available in this gist.

Appendix: Fast ceiling division by a power of 2

Let $a, k$ be non-negative integers, and let $b = 2^k$ . We shall compute $\left\lceil \frac{a}{b} \right\rceil$ . Note that this is different from $\left\lfloor \frac{a}{b} \right\rfloor + 1$ , which can be implemented using integer division. For $a = b$ , the former yields 1 while the latter gives 2. In fact, the result is off by 1 iff $b$ divides $a$ . We construct a bitmask to obtain the remainder of the division:

int b = 1 << k; // 0b100...00
int mask = b - 1; // 0b011...11
int remainder = a & mask;

If $b$ divides $a$ , the remainder is 0 and the expected result is $a / b$ . Otherwise, the result is $\left\lfloor \frac{a}{b} \right\rfloor + 1$ . Recall that dividing by $b$ is equivalent to right shifting by $k$ to express this in code as

int result = (a >>> k) + (remainder > 0 ? 1 : 0);

The second term won’t compile to a branch instruction on most modern JVMs and architectures.