December 28, 2018

Varints

While coding Jacobin, a binary stream reader library, I learned about low-level details of data serialization such as endianness, character encodings…, and one interesting compression scheme: variable length encoding.

Variable length encoding is a technique used to compress down integers into a smaller number of bytes than is normally needed. Generally, computers use fixed length types, meaning data is stored in a fixed number of bytes in memory, which makes performing arithmetic and casting require less CPU cycles. However, when transmitting data through the network or storing it on disk, it’s important to compress it down in order to save bandwidth and decrease disk usage, generally resulting in hefty performance gains and reduced infrastructure costs.

Let’s imagine we are in charge of designing a thermometer app that periodically stores temperature delta values on disk. If our app stored these values as a fixed length integer type, say a 64-bit signed integer, we would be wasting space most of the time since delta values tend to have a much smaller range. Specifically, if most delta values fit in a single byte, they would follow a Poisson distribution with $\lambda\approx1$ , where $\lambda$ is the average non-zero bytes needed to represent a delta value.

Distribution of byte lengths

Varints are based on this same premise: smaller numbers in computing are more common than larger ones. The trade-off variable length encoding makes is to spend more bytes to represent larger numbers (while using less to represent smaller ones). In this post we explore one of the most common ways of encoding varints: continuation bits.

Little Endian Base-128 encoding

RPC libraries like Google Protocol buffers or network protocols from games such as Minecraft use this technique, where the main idea is to split the number we want to encode into 7-bit blocks and use a byte (8 bits) to encode each one of them. The 7 lower bits of each byte are used to store the binary representation of the block, and the most significant bit, called the continuation bit, indicates whether there are further bytes to come, i.e. if its value is 1 it means the byte is a continuation block, whereas a 0 value means that we reached the end of the varint, the termination block.

Coming back to the thermometer app example, all delta values in the range 0 through 127 would only need one byte to be encoded (this is why this encoding is base-128). Let’s encode the number 23:

$23_{10} = 2^4+2^2+2^1+2^0 = \underbrace{10111_2}_{\mathrm{binary\ representation}}$

This number only has 5 bits, so it fits in a single 7-bit block. This number would be padded with zeros and sent as 0001 0111, where the top (leftmost) bit is 0, indicating there are no more bytes coming. We now know how to encode the first 128 integers, but what about all the other ones? Let’s encode the number 62129 step by step:

$62129_{10} = 1111001010110001_2$

We first split up the number into 7-bit blocks (starting from the least-significant block):

Then, we set the most significant bit of each block to 1, except for the leftmost block, the termination block, which we will pad with zeros instead:

You might think this doesn’t make sense, because 1 indicates a continuation block and 0 the termination block, but remember this encoding is little endian, meaning the least significant group comes first, so we encode the number by reversing the order of the blocks:

That’s it! The number 62129 can be encoded using 3 bytes. Notice how the top bit of each byte now tells if there’s more blocks coming: if it’s a 1, we know we need to keep listening to receive the full number, whereas a 0 value means that the block is the termination block.

Decoding happens in reverse order: for each byte we receive, we need to take the 7 least-significant bits by using a bit mask, specifically block & 0x7F, multiply them by $2^{7n}$ , where $n$ is the current (0-based) index of the input type, and break out of the loop when we encounter a byte that has a continuation bit equal to 0.

Varints are really powerful since they can encode a number of arbitrary size! With a fixed length representation you are limited by a maximum value, but with varints you can encode big and small numbers alike. Now, this doesn’t mean you should change all your integer types to varints, because size reduction comes at a performance cost. Before doing any type change you should ask yourself these questions:

Do the values I send/store follow a distribution varints would be beneficial for? If the probability of sending a larger integer is the same as of a smaller one, variable length encoding won’t really decrease, and can even increase (remember larger integers take a larger number of bytes than their binary representation, specifically $8/7$ ths more) the space you need to represent your data.
Does the performance gained by decreasing network delay or disk read times outweight the varint encoding/decoding overhead?

Pros

Base-128 encoding has several advantages over other variable length encoding methods, here’s some of them:

Varints are byte aligned, which means there’s no need to pad encoded numbers to a byte boundary.
Highly space-efficient. Any 64-bit number can be encoded in at most 10 bytes.
Varints are self-delimiting, which means you can concatenate them. Because it’s always clear where one value ends and another begins, it’s possible to write them out in order. Note that you should still check for overflows in case you miss a UDP packet or a disk read is erroneous.

Java implementation

Here’s a Java implementation of little endian base-128 encoding and decoding:

static final int VALUE_MASK = 0b01111111;
static final int CONTINUATION_BIT_MASK = 0b10000000;

public static int readVarInt(InputStream stream) throws IOException {
    int readBytes = 0;
    int result = 0;

    byte current;

    do {
        current = (byte) stream.read();

        // Get the block value
        byte value = (byte) (current & VALUE_MASK);

        // Shift block by 7 * readBytes and sum it to result
        result |= (value << (7 * readBytes));

        readBytes++;

        // Check if the value overflows
        if (readBytes > 5) {
            throw new RuntimeException("VarInt is too big");
        }
    } while ((current & CONTINUATION_BIT_MASK) != 0); // Break if `current` is a termination byte

    return result;
}

public static void writeVarInt(OutputStream stream, int value) throws IOException {
    do {
        byte current = (byte) (value & VALUE_MASK);

        // Remove the block (7 bits) read in `current`
        // The >>> operator also shifts the sign bit
        value >>>= 7;

        if (value != 0) {
            // We still have more blocks
            current |= CONTINUATION_BIT_MASK;
        }

        stream.write(current);
    } while (value != 0);
}

There is a large amount of other really cool encoding schemes with diverse trade-offs. Systematically analyzing the distribution of your numbers is the key to choosing or even creating your own encoding scheme.

Little Endian Base-128 encoding

Pros

Java implementation

Further reading