Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Importance of a Data Format Part 3 — The Solution

DZone's Guide to

The Importance of a Data Format Part 3 — The Solution

In this post, I'm going to focus primarily on the write side of things—taking JSON as an input and producing a block of memory that represent this JSON. Read on to learn more.

· Database Zone
Free Resource

Whether you work in SQL Server Management Studio or Visual Studio, Redgate tools integrate with your existing infrastructure, enabling you to align DevOps for your applications with DevOps for your SQL Server databases. Discover true Database DevOps, brought to you in partnership with Redgate.

In my previous two posts, I explained the problems we may run into and discussed the requirements for the solution. To quickly reiterate, we don't want to use JSON as the internal storage format in RavenDB because it requires parsing whenever we need to do anything to the document. We want to have a format that is ready to be used without any parsing.

In order to do that, we split the behavior into two distinct responsibilities. We'll have a writer that will take JSON as an input and produce a block of memory that represent this JSON, and we have a reader that can read from this block of memory and if needed restore the original JSON.

For now, I'm going to focus primarily on the write side of things. Because we want to reduce the amount of allocation and retained memory, we cannot use any format that requires reading the whole document before writing it. So, we need to use a forward only mode for writing. This plats nice with memory allocation strategies, because we don't have the need to move data once we've written it. In other words, whenever we read a value from the JSON stream, we need to write it to our own format. That doesn't seems to be so useful, until you realize that we can delay some decisions.

It will probably be easier to understand if we use a real example:

{
   "Name":"Oren",
   "Dogs":[
      "Arava",
      "Oscar",
      "Phoebe"
   ],
   "Age":34,
   "Office":{
      "Name":"Hibernating Rhinos",
      "Street":"Hanashi 21",
      "City":"Hadera"
   }
}

This document shows quite a lot of features from JSON, so it is very suitable for demonstration. We read a token at a time from the document and write it in the new format.

Let us look at how we'll deal with the Dogs array value (just this: ["Arava","Oscar","Phoebe"] ). Loosely speaking, this will look like this:

image

And, here it is with explanations:

  • Len 5 – Arava
  • Len 5 – Oscar
  • Len 6 – Phoebe
  • Array size – 3
  • Offsets from metadata start – [19,13,7]
  • Types – [5,5,5]

First, we have the values, as length prefixed strings; then we have the number of elements in the array; and finally, we have a run of offsets. So far, it is obvious, but what are the offsets? They indicate how far back (from the array metadata start position), we need to go to get to a particular value. In other words, in order to read this array we actually need to get a pointer to the array metadata start, and from there we can tell how many elements we have in the array. Then we can just direct it to the relevant value.

For example, going to the second element will mean going 13 bytes backward from the metadata beginning, which will give us a byte whose value is 5 (length of the string) and then the actual string bytes.

The array after the offset contains the types of each value. In this case, this is an array of 3 strings, so we have [5,5,5]. Splitting the offsets and the types in this manner means that we have better alignment for the values, and it is simpler to deal with in general. By the way, type 5 is a string.

Why are we using backward references in this manner? Why not hold the absolute position in the document? That would certainly be much simpler, no? We use offsets because a large JSON document is typically composed of many small objects and arrays. By using an offset from the metadata start, we can ensure that most of the time this data will be small enough to fit in a single byte. If we can do it, we are able to save quite a significant amount of memory. Of course, we also handle larger objects, including those that need offsets spanning the full 32 bits range. We have scoped the decision of the offset size to the level of a single array/object, because even in big documents, we have a lot of places where byte or short offset would be more than good enough.

One thing to note about this, in JSON format, the array takes 27 bytes; in this format, it takes 26. We saved a whole whopping byte, cue the party noise.

But, that isn't really interesting yet. Let us see how we deal with objects, in particular, the Office value (just this: {"Name":"Hibernating Rhinos","Street":"Hanashi 21","City":"Hadera"} ). This looks like the following:

image

Again, changing to textual format gives:

  • Len 18 – Hibernating Rhinos
  • Len 10 – Hanashi 21
  • Len 6 – Hadera
  • Properties count – 3
    • Offset – 7, PropertyId – 5, Type – 5
    • Offset –37, PropertyId – 0, Type – 5
    • Offset – 18, PropertyId – 4, Type – 5

(To the sharp eyed among you, the last byte (00), is not really there, and shouldn't be accounted for. It is an artifact of how I printed those values.)

In JSON, this object is 67 bytes long, in this format, it is merely 48 bytes. Yep, maidens swoon when they hear about how much space this saved.

But, we are still missing things. To start with, where are the actual property names and what are those property ids that we have been seeing?

One piece of data that we aren't writing as we read the source document is the properties. Indeed, so far, the biggest saving we have seen is from not writing the properties. So, the question is: where are they? The answer is simple, we wait until we finish with the whole document before we write the property names. Instead of writing a property name, we write only the property id. A property id is just the index of the property into the properties we have seen so far.

In other words, we have a small benefit from not having to repeat property names. In the document above, we saved the need to write "Name" twice, because of this feature. At the end of the document, we write the property names just like we write values, but because we need to reference them by id, we also write their offsets into an array.

Finally, we store the root metadata offset, the offset of the property names offset, and the size of offsets that we'll encounter in this document, and we are done.

Almost, there is another subtle issue to deal with. Look at the properties of the Office object that we have above. Notice that they don't look like they are stored in order of arrival. Instead, the offset into the properties of the object are actually stored sorted based on the value of the property name.

In other words, given that I want to access a particular property in an object, what I'll do is run a binary search over the properties of this object. I'll lookup the property for each property id and compare it to the property name I want to find. So, that means searching for a property in O(logN), where N is the number of properties in a particular object.

We have taken things a bit further, and when we encounter a large string value. Currently, above 128 bytes, we'll also try to compress it. We're using LZ4 to compress the data, and for large values, it generally represents quite a nice saving. But, if the savings aren't good enough (doesn't save at least 10% of the value size), we'll just emit the value as is.

So overall, we take the document, only store the property names once, arrange things so accessing any property would be very cheap, and compress large values. We get some space saving out of that (more on this in my next post), but while that is nice, and we certainly paid attention to this, it isn't our main goal.

Let us see what sequence of operation we'll need to access the Office. City property, shall we?

  1. O(log 4) search in the properties of the root object:
    • This search is done on the four properties id that are stored in the root object metadata, sorted so we can search them.
    • The actual property names are stored elsewhere, and we have to go through an array specifying their offsets to get to the actual values.
  2. The search result in the offset of the metadata of the Office object.
  3. O(log 3) search in the properties of the Office object.
  4. The search result in the offset of the City property.
  5. We read the city property and do something with it.

Compare that to "parse the entire document" and then access the relevant value or just parse in a streaming fashion the entire document and extract the value you need. There is also another issue, which I almost forgot to mention, but is quite important.

The blittable format (and please help us find another name) is actually writing all the data into unmanaged memory. This means that it is entirely invisible to the GC, which means that we won't have to wait for the GC to clean it all up. Instead, we can allocae a pool of unmanaged memory and just use that. It makes it drastically simpler to manage how we are actually managing memory usage and growth in our systems.

Because we use memory mapped files for our storage, actually moving around in the document using this format is trivial and quite fast. But, I'm getting ahead of myself. Benchmarks and actual results are for the next post…

It’s easier than you think to extend DevOps practices to SQL Server with Redgate tools. Discover how to introduce true Database DevOps, brought to you in partnership with Redgate

Topics:
data format ,data access object ,json

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}