Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Importance of a Data Format Part 6 — When Two Orders of Magnitude Aren't Enough

DZone's Guide to

The Importance of a Data Format Part 6 — When Two Orders of Magnitude Aren't Enough

So, check out these result. Our current version can index 18,000 documents in 22 ms vs. the JSON approach which takes 1.8 seconds. Read on to learn more.

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

So, our current version can index 18,000 documents in 22 ms vs. the JSON approach which takes 1.8 seconds. That is an amazing improvement, but I got to tell you, I got a good feeling about it.

Without running a profiler, I'm guessing that a major part of the 22 ms cost is going to be comparisons during a binary search process. We store the property names as a sorted array, and when you need to get a property by name, we need to do a binary search in the properties name to match it.

Our test case gets 3 properties from a total of 18,801 documents. We do this 30 times, to reduce the chance of jitter getting in the way. Here are the profiler results:

image

As you can see, the cost is indeed where I expected it to be. Basically, the ReadNumber and GetComparerFor are the expressions of the number of binary searches.

But, here we can take advantage of the fact that we know that most of those documents are going to have the same structure. In other words, once we have found the position of a particular property, there is a strong chance that the next time we look at a property with the same name, we'll find it in the same position. Caching that and using this as the initial position for doing the binary search means that most of the time, we hit the right property right on the nose, and in the worst case, we have a single extra comparison to make.

This single change gives us a really nice impact:

image

That is about 40% improvement in speed from this single change.

We have been following similar patterns throughout the development of this feature. Write the feature, make sure it passes all its tests, then profiler it. Usually that indicates a problem that we need to solve, either by optimizing code or changing the way we do things in the code... sometimes radically. But, that is a topic for the next blog post.

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
data format ,data access object ,database performance

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}