The Importance of a Data Format Part 2 — The Environment Matters

DZone 's Guide to

The Importance of a Data Format Part 2 — The Environment Matters

So, here's the setup... the previous post talked about the problems we may encounter with JSON, and this one is about what kind of a solutions we have available. Read on to learn more.

· Database Zone ·
Free Resource

When designing a new data format, it is important to remember what environment we are operating in, what the requirements are, and what type of scenarios we’ll face.

With RavenDB, we are talking about the internal storage format, so it isn’t something that is available externally. That means that we don’t have to worry about interchanging with anything, which frees up the design quite a bit. We want to reduce parsing costs, size on disk, and managed allocations.

That leads us to a packed binary format, but not all binary formats are born equal. In particular, we need to consider whatever we’ll have a streaming format or a complete format. What is the meaning of that?

A streaming format means that you read it one byte at a time to construct the full details. JSON is a streaming format, for example. That is not something that we want to do, because a streaming format requires us to have an in memory representation to deal with the object. And, even if we wanted a known particular value from the document, we would still need to parse through all the document to get all the relevant fields.

So, we want a complete format. A complete format means that we don’t need to parse the document to get to a particular value.  Internally, we refer to such a format as Blittable. I’m not fond of this term, and I would appreciate suggestions to replace it.

I’ll get to the details about how this is actually laid out in my next post. In this post, I want to outline the overall design for it.

We want a format that can be read in a single call (or, more commonly for us, mmap in its entirety), and once that is done, we can start working with it without additional work. Traversing through this format should be a cheap operation, check out this code:

foreach(var child in doc.children)

It should only materialize the strings for the children’s names (which we accessed), but will have no further costs regarding the rest of the document.

Because we assume that the full document will reside in memory (either by loading it all from disk or by mmapping it), we don’t need to worry about costs of traversing through the document. We can simply and cheaply jump around inside the document.

In other words, we don’t have to put related information close, if we have reason to place it elsewhere. In order to reduce memory consumption during the write phase, we need to make sure that we are mostly forward only writers. That is, the process of writing the document in the new format should not require us to hold the entire document in memory. We should also take the time to reduce the size of the document as much as possible. At the same time, just compressing the whole thing isn’t going to be good for us, we’ll lose the ability to just go to any location on the document cheaply.

Note that for the purpose of this work, we are interested in reducing work only for a single document. There are additional optimizations that we can apply across multiple documents, but they are complex to manage in a dynamic system.

So this is the setup, the previous post talked about the problems we may encounter with JSON, and this one was about what kind of a solutions we have available. Next post will discuss the actual format.

data access object, data format, json

Published at DZone with permission of Oren Eini , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}