Thinking in Bits
As a big proponent of Tonsky’s Performance first philosophy, I believe every software engineer should pay attention to performance from the start. If used in the right places, bitwise operators can increase the speed of your programs by several orders of magnitude — without compromising on readability.
However, unlike some performance low-hanging fruit found in complex frameworks, code that works at the bit level should be introduced early on in the project. Serialization protocols are mostly set in stone from day 1, as updating them on a production environment normally requires downtime to perform the necessary migrations.
It’s good to practice fundamentals from time to time. You never know when these will come in handy. Sure, if you’re a frontend UI engineer, the &
(different from &&
), ^
or ~
operators may be outright new to you. This is not to say that they are useless or “too primitive”: the Apollo Guidance Computer was entirely built from NOR gates.
To explore some use cases of these operators, we’re going to design a program to parse Minecraft’s Chunk Data packet. Videogames require various data structures to optimize the amount of bytes sent over the wire to reduce latency. In particular, this packet contains the game’s block data serialized as a compacted array of values. This data structure must be designed with performance in mind since a Minecraft server can send up to 120 packets per player per second.1 Alstad, T. et al. (2015). Minecraft Computer Game Performance Analysis and Network Traffic Emulation by a Custom Bot. Science and Information Conference (SAI), 227-236.
A compacted array holds a fixed number length
of integers with bitsPerValue
bits each. It has the following API:
public class CompactedArray { public CompactedArray(final int length, final int bitsPerValue) { // ... } public int get(final int index) { // ... } public void set(final int index, final int value) { // ... } }
The main difference between CompactedArray
and a regular int[]
is that it supports words of any size less than or equal to 32. Indeed, if bitsPerValue
were 8, 16, 32 or 64 the primitive array counterpart would outperform the CompactedArray
implementation. Method invocation and object memory overhead come to mind. In fact, Minecraft used to store block IDs in a byte array, but they quickly ran out of IDs for new blocks and overhauled the game’s storage and protocol format.
It seems natural that CompactedArray
should be backed by a primitive array given their similarities: they both hold a fixed number of values of a certain bit length. But what type should we employ?
In the worst case scenario (bitsPerValue
is 32), retrieving a CompactedArray
value from a byte array requires 4 loads. Assuming the values are in the processor’s L1 cache, this should only take 4 cycles.2 Intel Optimization Reference Manual However, we need to take into account the JVM bounds checking. Minimizing memory accesses yields better performance. A short array might still need 2 loads if bitsPerValue
is greater than 16. See the pattern? As the backing array element size increases, the number of required load instructions decreases. This means a long array is the best candidate:
public class CompactedArray { private final long[] data; private final int bitsPerValue; public CompactedArray(final int length, final int bitsPerValue) { this.bitsPerValue = bitsPerValue; // TODO Initialize data array } // ... }
A new question arises: how big should the data array be? For brevity, let and be length
and bitsPerValue
respectively. The problem is equivalent to finding the least multiple of 64 which can store bits. That is,
where denotes the ceiling function. For example, if we wanted to store 32 values of 5 bits each, the data array length would be .
public class CompactedArray { public CompactedArray(final int length, final int bitsPerValue) { // Using Math.ceil is suboptimal, see the appendix this.data = new long[(int) Math.ceil(length * bitsPerValue / 64D)]; // ... } // ... }
A CompactedArray
can now be constructed, but it’s pretty useless in its current form. Let’s first work on value retrieval. The choice of a long array as the backing structure means a value can be found within a single long; or overlapping between 2 longs, where the least significant bits (LSBs) can be found in the first long, and the remaining ones at the beginning of the second. More formally, given a value starting at position of a long, the number of bits of that value stored in the long is , meaning bits are stored in the next long. As an exercise, find how many bytes a -bit value can overlap.
Given a CompactedArray
element at index , we know its first bit is stored in position , that lies within the -th long. The notation is making this seem worse than it is. In code,
public int get(final int index) { int bitOffset = index * bitsPerValue; // Integer division truncates the result int offset = bitOffset / 64; // ... }
The right-shift operator
It’s time to introduce the first bitwise operator: the right shift (>>
). Imagine we want to compute . If can be expressed as a power of 2, i.e. for some , right shifting by , a >> x
, is equivalent to dividing by .
In Algebra class you probably learnt that dividing exponentials with the same base is equivalent to subtracting their exponents. Notice that every bit of an int
represents a power of 2. For example, letting bitOffset
be 197, offset
is
As 64 is equal to , bitOffset / 64
is the same as bitOffset >> 6
. We’re not interested in the fractional part (the negative powers of 2), so it is discarded.
The bitwise AND operator
Having computed the data array index where the LSBs of the value can be found, it’s time to figure out which bits we need to extract from it. The starting bit position within the long is . In code,
public int get(final int index) { int bitOffset = index * bitsPerValue; int offset = bitOffset >> 6; int startBit = bitOffset % 64; // ... }
This works, but a modulo operation involves a division. Divisions are much more expensive than bitwise operations.3 A fast alternative to the modulo reduction
We introduce the AND (&
) operator. Given two binary numbers , we want to obtain what 1-bits they have in common. For each power of 2, this operator compares the bits from each argument: if both are 1, the resultant bit is 1. Otherwise, it is 0. For instance, .
Representing bitOffset
in binary, i.e. as where is the -th bit of bitOffset
, one can factorize and represent this value as . Reducing this value modulo results in the non-factorized part . This is the same as extracting the 6 first bits using bitOffset & 0b111111
, which is much faster. This technique is known as bit masking.
The problem is now divided in two: if the value fits in the long (i.e. ) we need to extract bits by applying a bit mask. However, if the value is overlapped between two longs, we need to join the least and most significant bits (MSBs).
The first case can be expressed as a “power-of-2 filter”. The value is equal to the sum of all bits that lie between and . That is,
The inequality is equivalent to . Applying this shift by in code is expressed as data[offset] >>> startBit
(the extra >
tells Java to also shift the sign bit4 JLS, Section 15.19), moving bits in positions to . However, this right shift is insufficient, as it also shifts all bits in positions greater than or equal to . To ignore these, we apply a bitmask that only leaves the first bits:
public class CompactedArray { // ... private final long bitmask; public CompactedArray(final int length, final int bitsPerValue) { // ... // Left shifting 0b1 by bitsPerValue results in 2^bitsPerValue // If bitsPerValue were 5, 2^5 - 1 = 0b100000 - 0b1 = 0b011111, // the bitmask we're interested in. this.bitmask = (1L << bitsPerValue) - 1; } public int get(final int index) { int bitOffset = index * bitsPerValue; int offset = bitOffset >> 6; int startBit = bitOffset & 0b111111; long lsbs = (data[offset] >>> startBit) & bitmask; // ... } }
The left-shift operator
What about elements contained within two longs? Since the LSBs of the value are stored in the upper-most bits of the first long, Java fills the MSBs of data[offset] >>> startBit
with zeros after shifting. The first long contains bits of the final value, meaning the next long contains the remaining bits. We need to somehow join the LSBs from the first long (that we already have) with the MSBs stored in the LSBs of the second long, data[offset + 1]
.
Directly adding these bits would yield an incorrect result since the MSBs are misplaced: when bitsPerValue
is 18, the value at index 3 is stored in bits 54 through 63 of the first long and bits 0 through 7 of the second long. If we shift the latter positions to the left, both values fit like puzzle pieces.
More rigorously, the positions we have to shift the MSBs by is exactly the number of bits we have from the first long, .
In contrast to its sibling, left shifting by , a << b
is equivalent to multiplying by . To sum up, given the bits from the second long,
which in code is
int shiftBy = 64 - startBit; // Apply the bitmask to zero out the upper bits we're not interested in long msbs = (data[offset + 1] << shiftBy) & bitmask;
The bitwise OR operator
Given the binary numbers ; the OR (|
) operator tells us what bits of or are 1. For each position, the operator compares the bits from each argument: if either or both of them are 1, the resultant bit is 1. Otherwise, it is 0. For example, .
Since the LSBs and MSBs of value
don’t overlap, all the bits to the left of the LSBs are zero, and all the bits to the right of the MSBs are zero, yields the value we’re looking for.
public int get(final int index) { return (int) (msbs | lsbs); }
Given an index, everything left is to check in which case we’re in: does the value fit in a long or does it overlap? This is easy, we only need to check whether the last bit offset is greater than 64. Another alternative is comparing the offsets of the initial and last bits:
int lastBitOffset = (bitOffset + bitsPerValue - 1) >> 6; if (offset != lastBitOffset) { // The starting bit and ending bit are in different longs }
On my machine, this is 1.7x slower than the greater-than comparison.JMH benchmark Generally, integer equality and inequality take the same CPU cycles to run.5 This SO answer explains this in more detail. The second method introduces a right shift, which takes an extra clock cycle and confuses the branch predictor. I challenge you to create a (faster) branchless version — I’ve tried and failed.
Putting it all together,
public int get(final int index) { int bitOffset = index * bitsPerValue; int offset = bitOffset >> 6; int startBit = bitOffset & 0b111111; // Get least significant bits long value = (data[offset] >>> startBit) & bitmask; if (startBit + bitsPerValue > 64) { int shiftBy = 64 - startBit; // OR shifted most significant bits value |= (data[offset + 1] << shiftBy) & bitmask; } return (int) value; }
The set
method involves the same offsets. Computing the updated data[offset]
and data[offset + 1]
values is a tad more tricky though. Given the previous data, this method shall replace the relevant bits with the new integer value of bitsPerValue
bits. Its LSBs will go on positions to of the updated data[offset]
.
We already know how to align these bits. Namely, we apply the bitmask
(to ensure the given value has exactly bitsPerValue
bitsNever trust the caller!) using the AND operator and left shift the LSBs startBits
positions. The resultant value is of the form . However, a new problem arises: how do we update these bits while leaving others untouched?
The bitwise complement operator
By zeroing the bits of data[offset]
corresponding to the given index, we get two non-overlapping longs we can OR to get the new value. For this, we construct a bitmask such that, when applied to the old value (by using the AND operator), only the bits we don’t want to update remain. If we left shift bitmask
by startBit
positions we get a long of the form . This is the opposite of what we want.
The bitwise complement (~
) operator inverts the bits of the operand. This is a unary operator, it only takes one operand. It makes every 0 bit a 1 bit and every 1 bit a 0 bit, e.g. .
We apply the inverted mask and OR the result with the shifted LSBs to get the new value.
public void set(final int index, final int value) { int bitOffset = index * bitsPerValue; int offset = bitOffset >> 6; int startBit = bitOffset & 0b111111; long maskedValue = value & bitmask; // Ensure value has bitsPerValue bits data[offset] = // The previous long bits with b zero bits in positions s to min(s + b - 1, 63) (data[offset] & ~(bitmask << startBit)) // The new value shifted to replace the b zeros | (maskedValue << startBit); }
Similarly, the mask for updating the second long is of the form We could construct a bitmask and invert it as in the previous example. However, right shifting by and then left shifting by this same value effectively clears the first bits. The performance of both methods is comparable, but the former requires a bitmask construction. JMH benchmark
We have assembled the complete set
method:
public void set(final int index, final int value) { int bitOffset = index * bitsPerValue; int offset = bitOffset >> 6; int startBit = bitOffset & 0b111111; long maskedValue = value & bitmask; data[offset] = (data[offset] & ~(bitmask << startBit)) | (maskedValue << startBit); if (startBit + bitsPerValue > 64) { int shiftBy = 64 - startBit; data[offset + 1] = (data[offset + 1] >>> shiftBy << shiftBy) | (maskedValue >> shiftBy); } }
For better readability, one should replace 6
and 0b111111
by named constants. The documented class (with bounds checks and additional methods) and associated tests are available in this gist.
Appendix: Fast ceiling division by a power of 2
Let be non-negative integers, and let . We shall compute . Note that this is different from , which can be implemented using integer division. For , the former yields 1 while the latter gives 2. In fact, the result is off by 1 iff divides . We construct a bitmask to obtain the remainder of the division:
int b = 1 << k; // 0b100...00 int mask = b - 1; // 0b011...11 int remainder = a & mask;
If divides , the remainder is 0 and the expected result is . Otherwise, the result is . Recall that dividing by is equivalent to right shifting by to express this in code as
int result = (a >>> k) + (remainder > 0 ? 1 : 0);
The second term won’t compile to a branch instruction on most modern JVMs and architectures.