Sorted integer compression

In the database field and information retrieval in general, there is a very common scenario. I have a list of (sorted) integers that I want to store, and I want to do that in an as efficient a manner as possible. There are dozens of methods to do this and this is a hot topic for research. This is so useful because there are so many places where you can operate on a sorted integer list and gain massive benefits. Unlike generic compression routines, we can usually take advantage of the fact that we understand the data we are trying to work with and get better results.

The reason I need to compress integers (actually, int64 values) is that I’m trying to keep track of matches for some data, so the integers that I’m tracking are actually file offsets for user’s data inside of Voron. That lead to a few different scenarios that I have to deal with:

There is a single result
There is a reasonable number of results
There is a boatload of results

I’m trying to figure out what is the best way to store the later two options in as efficient manner as possible.

The first stop was Daniel Lemire’s blog, naturally, since he has wrote about this extensively. I looked at the following schemes: FastPFor and StreamVByte. They have somewhat different purposes, but basically, FastPFor is using a bits stream while StreamVByte is using byte oriented mode. Theoretically speaking, you can get better compression rate from FastPFor, but StreamVByte is supposed to be faster. Another integer compression system come from the Gorilla paper from Facebook, that is a bigger scheme, which include time series values compression. However, part of that scheme talks about how you can compress integers (they use that to store the ticks of a particular operation). We are actually using that for the time series support inside of RavenDB.

I’m not going to cover that in depth, here is the paper on Gorilla compression, the relevant description is at section 4.1.1. Suffice to say that they are using a bit stream and delta of deltas computation. Basically, if you keep getting values that are the same distance apart, you don’t need to record all the value, you can compute that naturally. In the best case scenario, Gorilla compression needs a single bit per value, assuming the results are spaced similarly.

For my purpose, I want to get as high a compression rate as possible, and I need to store just the list of integers. The problem with Gorilla compression is that if we aren’t getting numbers that are the same distance apart, we need to record the amount that they are different. That means that at a minimum, we’ll need a minimum of 9 bits per value. That adds up quickly, sadly.

On the other hand, with PFor, there is a different system. PFor computes the maximum number of bits required for a batch of integer, and then record just those values. I found the Binary Packing section (2.6) in this paper to be the clearest explanation on how that works exactly. The problem with PFor, however, is that if you have a single large outlier, you’ll waste a lot of bits unnecessarily.

I decided to see if I can do something about that and created an encoder that works on batches of 128 integers at a time. This encoder will:

Check the maximum number of bits required to record the deltas of the integers. That along already saves us a lot.
Then we check the top and bottom halves of the batch, to see if we’ll get a benefit from recording them separately. A single large value (or a group of them) that is localized to a part of the batch will be recorded independently in this case.
Finally, instead of only recording the meaningful bit ranges, we’ll also analyze the batch we get further. The idea is to try to find ranges within the batch that have the same distance from one another. We can encode those as repetitions instead of each independent value. That can end up saving a surprisingly amount of space.

You can look at the results of my research here. I’ll caution you that this is raw, but the results are promising. I’m able to beat (in terms of compression rate) the standard PFor implementation by a bit, with a lot less code.

I’m also looking at a compression rate of 30% – 40% for realistic data sets. And if the data is setup just right and I’m able to take advantage of the repeated delta optimization, I can pack things real tight.

Currently numbers say that I can push upward of 10,000 int64 values in an 8KB buffer without any repeated deltas. It goes to just under 500,000 int64 values in an 8KB buffer if I can take full advantage of the deltas.

The reason I mention the delta so often, it is very likely that I’ll record values that are roughly the same size, so we’ll get offsets that are the same space from one another. In that case, my encoder goes to town and the compression rate is basically crazy.

This is a small piece of a much larger work, but this is the first time in a while that I got to code at Voron’s level. This is fun.

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Sorted integer compression

Woah, already finished? 🤯

Related Articles

CollabTalk Podcast | Episode 123 with Oren Eini–Building a business with Open Source foundations

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events