Importing the Stack Overflow dataset into RavenDB

Around 2017 we needed to test RavenDB with realistic datasets. That was the time that we were working hard on the 4.0 release, and we wanted to have some common dataset that was production quality (for all the benefits and complications that this brings) to play with.

A serious issue was that we needed that dataset to also be public, because we wanted to discuss its details. The default dataset people usually talk about in such a scenario is the Enron emails, but that is around half a million documents and quite small, all things considered.

Luckily for us, Stack Overflow has made their dataset publicly available in a machine readable format. That means that we could take that, adapt that to RavenDB and use that to test various aspects of our behaviors with realistic data.

The data is distributed as a set of XML files, so I quickly wrote something that would convert the data to a JSON format and adapt the model to a more relational one. The end result was a dataset with 18 million documents and with a hefty size of 52 GB. I remember that at the time, working with this data was a lengthy process. Importing the data took a long time and indexing even longer.

A few years later, this is still our go-to dataset for anything involving non-trivial amount of data, but we have gotten to the point where the full process of working with it has shrunk significantly. It used to take 45+ minutes to import the data, now it takes less than 10, for example. Basically, we made RavenDB good enough that it wasn’t that much of a challenge.

Of course… Stack Overflow continues to publish their dataset… so I decided it was time to update their data again. I no longer have the code that I used to do the initial import, but the entire process was fairly simple.

You can look at the code that is used to do the import here. This is meant to be quick & dirty code, mind you. It is about 500 lines of code and handles a surprisingly large number of edge cases.

You can find the actual data dump here.

And the explanation about the schema is here.

There is also a database diagram here.

In case you missed the point, the idea is that I want to remember how I did it for the next time I’ll want to refresh our dataset.

So far, I imported a bunch of Stack Exchange communities:

World Building – Just over 100K documents and 1 GB in size. Small enough to play with seamlessly.
Super User – 1.85 million documents and weighing 4 GB in size. I think we’ll use that as the default database for showing things off on the Raspberry Pi edition.
Stack Overflow – 40.5 million documents and exceeding 150 GB in size. This is a great scenario for working with a significant amount of data. That is likely to be our new default benchmarking database.

The other advantage is that everyone is familiar with Stack Overflow. It makes for a great demo when we can pull up realistic data on the fly.

It already gave me some interesting details to explore. For example, enabling documents compression mode for the Super User community reduced the disk utilization to under 2 GB. That is a great space-saving, and it means that we can easily fit the entire database on a small SD card and have a “RavenDB Server + Database in a box” as a Raspberry Pi.

The Stackoverflow dataset is 150GB without compression, with documents compression, it dropped to just 57GB, which is all kinds of amazing.

They make for great demos Smile .

RavenDB

RavenDB Cloud

Try

Experience interactive demos and playground server

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Unlock your business potential

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events

What’s New

Roadmap

On-premise Pricing

Cloud Pricing

Support

Proof of Concept Program

Academic Program

Importing the Stack Overflow dataset into RavenDB

Woah, already finished? 🤯

Related Articles

CollabTalk Podcast | Episode 123 with Oren Eini–Building a business with Open Source foundations

RavenDB’s storage engine: Voron–unlocking the secret

Certificates from the Ground Up

Watch Live Demo

RavenDB

RavenDB Cloud

Try

RavenDB Docs

RavenDB Cloud Docs

Documentation Guide

Download

Features

Performance

Comparison

What’s New

Demo

Bootcamp

Webinars

Workshops

Inside RavenDB Book

GitHub

StackOverflow

Articles

Whitepapers

Events

Promotional Materials

Use Cases

Articles

Whitepapers

Press Releases

Industry Reports

Performance

Comparison

Proof of Concept Program

Academic Program

Events