Playing With the StackOverflow Datasets: All the Wrong Ways

DZone 's Guide to

Playing With the StackOverflow Datasets: All the Wrong Ways

Ayende Rahien continues to toy around with StackOverflow data, running it through the Voron storage engine to see what works and what doesn't work.

· Database Zone ·
Free Resource

As I mentioned, given a 45 GB file containing all the StackOverflow questions since 2008, I wanted to transform items like this:

Into this:

This turns out to be rather tricky to do. The file is sorted by the row ID, but there are tricks here:

  • You can’t assume that all the answers to a question are located near the question itself.

Ideally, we could keep a dictionary of the questions and their answers until we have all the answers to the question, they write it all out. The problem here is that a question might have answered years after it was asked, which means that the answer id would place it very far from the question in the file, which means that we have to keep the question in memory for a very long time. In practice, this happens quite frequently, and it plays havoc with trying to limit the memory utilization for this approach.

  • While you would expect that answers will always have a row id that is higher than their questions, that isn’t always the case.

Let us take this question, about Haskell monads, which has the following answer.

The answer ID is: 2388.

The question ID is: 44965.

That means that if you scan the file sequentially, you are going to find the answers before their questions, sometimes a lot before (I’m guessing that this happens if you have edits to a question, so a popular question might get edited long after it was asked and had answers).

The next thing I tried was to group all the questions together, then process them. Because I can’t do that in memory, I tried writing them to disk directly:

  • /questions/44965/q.
  • /questions/44965/2388.

The idea is that I can write them out to the file system, and then traverse the file system to gather all of them. The problem is that we are talking about over 32 million questions and answers, and that isn’t really going to work for us using most normal file systems.

So I tried writing it all to a ZIP file, which was successful, but resulted in a zip file that was 21GB in size and was very good in ensuring that nothing could open it before I run out of patience.

Eventually, I just gave us and used Voron to handle it. Here are the relevant sections:


Remember that Voron stores everything in a B+Tree and allows ordered operations, so once this is done, all I need to do is traverse the tree, get all the records that have the same parent id, and voila, I have the grouping I wanted.

If I wanted to do it without the use of a database (or Voron), I would probably go with the following manner:

  • Read each entry from the source data, and write it to another file, noting its location.
  • Write “parent id, id, location” to a separate index file
  • Sort the index file.
  • Start reading from the index file as long as I have the same parent id, and merge those questions.

Why a separate file? Because XmlReader does buffering. By writing it out, we have the proper position to seek to.

Note that this will require a lot of random I/O, but that is pretty much just what it has to be.

b+tree, database, memory, stackoverflow

Published at DZone with permission of Oren Eini , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}