Over a million developers have joined DZone.

The Importance of a Data Format Part 2 — The Environment Matters

So, here's the setup... the previous post talked about the problems we may encounter with JSON, and this one is about what kind of a solutions we have available. Read on to learn more.

· Database Zone

Sign up for the Couchbase Community Newsletter to stay ahead of the curve on the latest NoSQL news, events, and webinars. Brought to you in partnership with Coucbase.

When designing a new data format, it is important to remember what environment we are operating in, what the requirements are, and what type of scenarios we’ll face.

With RavenDB, we are talking about the internal storage format, so it isn’t something that is available externally. That means that we don’t have to worry about interchanging with anything, which frees up the design quite a bit. We want to reduce parsing costs, size on disk, and managed allocations.

That leads us to a packed binary format, but not all binary formats are born equal. In particular, we need to consider whatever we’ll have a streaming format or a complete format. What is the meaning of that?

A streaming format means that you read it one byte at a time to construct the full details. JSON is a streaming format, for example. That is not something that we want to do, because a streaming format requires us to have an in memory representation to deal with the object. And, even if we wanted a known particular value from the document, we would still need to parse through all the document to get all the relevant fields.

So, we want a complete format. A complete format means that we don’t need to parse the document to get to a particular value.  Internally, we refer to such a format as Blittable. I’m not fond of this term, and I would appreciate suggestions to replace it.

I’ll get to the details about how this is actually laid out in my next post. In this post, I want to outline the overall design for it.

We want a format that can be read in a single call (or, more commonly for us, mmap in its entirety), and once that is done, we can start working with it without additional work. Traversing through this format should be a cheap operation, check out this code:

foreach(var child in doc.children)

It should only materialize the strings for the children’s names (which we accessed), but will have no further costs regarding the rest of the document.

Because we assume that the full document will reside in memory (either by loading it all from disk or by mmapping it), we don’t need to worry about costs of traversing through the document. We can simply and cheaply jump around inside the document.

In other words, we don’t have to put related information close, if we have reason to place it elsewhere. In order to reduce memory consumption during the write phase, we need to make sure that we are mostly forward only writers. That is, the process of writing the document in the new format should not require us to hold the entire document in memory. We should also take the time to reduce the size of the document as much as possible. At the same time, just compressing the whole thing isn’t going to be good for us, we’ll lose the ability to just go to any location on the document cheaply.

Note that for the purpose of this work, we are interested in reducing work only for a single document. There are additional optimizations that we can apply across multiple documents, but they are complex to manage in a dynamic system.

So this is the setup, the previous post talked about the problems we may encounter with JSON, and this one was about what kind of a solutions we have available. Next post will discuss the actual format.

The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

data format ,data access object ,json

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}