Over a million developers have joined DZone.

Reducing Parsing Costs in RavenDB

RavenDB works well with JSON for document storage, but parsing JSON is costly. Here's how to reduce the cost of parsing by limiting the number of small objects.

· Database Zone

Learn NoSQL for free with hands-on sample code, example queries, tutorials, and more.  Brought to you in partnership with Couchbase.

Note, this is something that we are considering for 4.0.

RavenDB uses JSON natively to handle pretty much everything. Which makes sense for a JSON Document Database. JSON is easy to parse, human readable, and, for the most part, has no real complexity associated with it.

However, while JSON is easy to parse, there is some computational overhead around parsing it. In fact, in our perf testing, we spend quite a lot of time just serializing and deserialzing JSON. It is one of the major costs we have to deal with, mostly because that is something that happens all the time. We have previously used BSON to store documents internally, but profiling has actually shown that it is cheaper to hold the data as JSON text and parse it. So that is what we are currently doing.

However, even faster parsing is still parsing, and something that we would like to avoid. There is also an issue with how this is actually represented in memory. Let us consider the following JSON:

{"FirstName":"John","LastName":"Smith"}

In memory, this is represented as (highly simplified):

  • Dictionary<string,object> instance
  • string[] for the keys
  • “FirstName” string
  • “LastName” string
  • object[] for the values
    • “John” string
    • “Smith” string

    In other words, especially for large documents, there are a lot of small objects that are being created. That doesn’t impact immediate parsing cost, but those do need to be collected after the fact, and that is something that we would like to avoid.

    We are currently considering using Flat Buffers for internal document storage. The good thing about Flat Buffers is that there is not intermediate parsing step. You get a block of memory that you can immediately access. That has two major advantages, loading the documents to memory would mean just reading a buffer from disk, with no extra cost. But it would also mean that the act of releasing a document would be much cheaper, we would only need to collect the buffer again, not potentially tens of thousands of small objects.

    Another advantage is that we usually need to load documents to be indexed, and usually indexing only require very few fields from the documents. By avoiding the cost of parsing, only paying the price for the objects that we are actually touching, we are in a much better position to reduce the indexing costs.

    A rough scratch schema using flat buffers would be:

    union AnyValueUnion { Document, Array, Value }
    
    table AnyValue
    {
        Value: AnyValueUnion;
    }
    
    table Value
    {
        Type: byte;
        String: string;
        Bytes: [byte];
    }
    
    table Array
    {
        Values: [AnyValue];
    }
    
    table FieldValue 
    {
        Name: string;
        Value: AnyValue;
    }
    
    table Document
    {
        Values: [FieldValue];    
    }

    The Values inside the documents are sorted by field name, so we can search a field using binary search.

    Nitpicker corner: Yes, we probably would want to use a hash here, but this is a rough draft to test things out, it will come later.

    We need to optimize the schema, and using it will not be a lot of fun, but the speed and memory improvements should be significant.

    Another alternative would be a schema such as this:

    table FieldValue 
    {
        Name: int;
        Value: AnyValue;
    }
    
    table Document
    {
        Values: [FieldValue];    
    }
    
    table Root
    {
        Fields: [string];
        Document: Document;
    }

    Here we store all the field names once, then we refer to the field name using its index in the root object. This should have the advantage of reducing repeated string names.

    The Getting Started with NoSQL Guide will get you hands-on with NoSQL in minutes with no coding needed. Brought to you in partnership with Couchbase.

    Topics:
    nosql ,ravendb ,json ,parsing json ,performance ,database

    Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

    Opinions expressed by DZone contributors are their own.

    The best of DZone straight to your inbox.

    SEE AN EXAMPLE
    Please provide a valid email address.

    Thanks for subscribing!

    Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
    Subscribe

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}