Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Field Tokenization: A Problem Analysis

DZone's Guide to

Field Tokenization: A Problem Analysis

· Database Zone
Free Resource

Download the Guide to Open Source Database Selection: MySQL vs. MariaDB and see how the side-by-side comparison of must-have features will ease the journey. Brought to you in partnership with MariaDB.

I was asked about the problem of repeating property names in RavenDB recently, so let's consider the following document:

 { 

     "FirstName":"John",

     "LastName":"Doe",

     "BirthDay": "1980-01-01",

     "LastLoggedInAt": "2013-08-10"

 }

There are 40 chars in this document that are dedicated solely to field names, and just 28 chars dedicated to actual data. Now, 40 vs. 28 means that if we have 1 million of those documents, we are losing a lot of space for no good reason. Surely the database can do better on this, right? It can tokenize those things so instead of storing 40 bytes for fields, it would store only 16 (assuming int32 for token ids). And for larger documents, you would massive space savings.

I think that I talked about this before, oh yes, here it is. I outlined a situation in which you manually (or via config at the client side) change the document to look like this:

{ 

  "FN":"John",

  "LN":"Doe",

  "BD": "1980-01-01",

  "LLD": "2013-08-10"

 }

This is pretty horrible, and I cover exactly why this is bad in the blob post above. Now, for 1 million documents, those 40 chars represent about 38MB of disk space. Not really a major issue, I think.

The actual question I was asked was (by Justin):

Repeating keys is so silly MongoDB has a highly voted issue, marked major with a plan to fix. And many client libraries have code built that tries to shorten names for you:

https://jira.mongodb.org/browse/SERVER-863

I personally think reading and writing 2x or more data then necessary is not a silly issue, it impacts all aspects of performance and database management. Disk are cheap sure, then again if your database is smaller you might have been able to put it on a ssd or faster but smaller drive, or just transferred less bytes from that cheap disk to accomplish the same thing. Are you really going to tell me every byte doesn't make a difference after examining these c code bases that do their best to save every byte for performance reasons?

The ugly solution shown above is a manual approach, but surely the DB can do better, right? Interning property names is relatively simple conceptually, but a lot more complex in practice.

For example, if you want to do interning, you probably want to do it end-to-end, which means that the client libraries need to understand the server tokens. That means that you need some way of informing the client when there is a new interned property added, otherwise they wouldn't be able to figure out the real property name. Then there is the issue of how you are going to handle sharding or replication. Is each server going to have its own unique copy of the intern table? Is each client going to have to navigate that? Or are you going to try to have a single intern table, and somehow manage to agree on all the positions in that table among a distributed set of nodes?

Imagine what's going to happen during failover: you suddenly need to get the intern table for each of the replicas, and making sure that you aren't using the wrong intern table is not going to be fun. Not to mention what happens when you have different clients talking to different servers, especially if you had a network split.

Even if you don't care about that and only want this for saving disk I/O on the server, you are still going to run into issues with concurrency management regarding the itern list. If you have two transaction-adding values that both require modifying the intern list, you are creating an interesting problem. Not least of which because if you save just the property name ID to the data store, instead of the full key, you haveto be able to say that this is present if the transaction is committed, because otherwise you have an unknown (or wrong) property name when you read it back.

So now you have concurrency issues, and the potential for conflicts because two transactions may be modifying the values of the intern table at the same time. For fun, assume that they both have documents with the same property name that needs to be added, as well as other property names.

And that doesn't even take into account that it is actually pretty common to have dynamic property names; for example, dates & times. You have 1 million documents, each of them have a property with the names like: "2013-08-03T19:57:32.7060000Z" 

A common scenario where this occurs is tracking things by time. For instance, if you want to track the amount of hours you worked on a bug. The document might look like this:

CONTINUED ON NEXT PAGE >>

{

   "2013-08-03T19:57:32.7060000Z":{

      "Project":"RavenDB",

      "RavenDB-1248":"Fixed",

      "RavenDB-1212":"Progress made"

   }

 }

As you can see, we actually have multiple properties with dynamic names here. It would be a waste to try to intern them. In fact, it is likely going to cause us to fall over and die. If you have to wait for that to happen when you have to pass this to the client... then it's fun & joy all around.

But you know what? I can still probably go and figure out what the most 10,000 common field names are, and have a fixed table of those in place. This would give me the ability to handle things like Name, FirstName, Date, etc. Because it is a fixed list, it would means that a lot of the problems listed above don’t matter. It would mean that if you wanted to call your property FirstNames, you might need to call it Names, because the first one would be in the list and won’t be interned.

That leads to a lot of interesting potential issues with backward and forward compatibility. How do I add more terms to the list? Older clients wouldn’t be able to understand the tokens, and that leads to additional complication just managing the versioning story. And I can assure you that you’ll have a lot of people clamoring and wanting to add their own unique properties to that list. ('What do you mean ‘PensionFundTarget’ isn’t on the intern list? It's a really common term and we could really use the performance boost of that!').

And then we have the issue of debugging. Right now, you can watch what is going on over the wire using Fiddler very easily, and the values are easily understood. Do you really want to try to figure out data like this?

{ 

  "231":"John",

  "432":"Doe",

  "127": "1980-01-01",

  "5841": "2013-08-10"

 }

In pretty much every aspect you can think of, this is a really crazy hard feature, and the benefits that it brings aren’t really that important.

Sure, you can reduce the amount of data that you read from the disk. But that data is already sitting in the same place. And there is absolutely no difference between reading a 30 bytes value and a 200 bytes value (reads/writes are always done at a minimum of 512 bytes). So reading a bit of extra bytes means pretty much nothing to us. We don’t have to seek to it, it is already there.  Also, we have caches to alleviate that problem for us.

Reading from the disk isn’t really our major problem.  That is relatively cheap compared to sending the data to the user over the network. But wait, we actually have found a solution for that: we use gzip compression on the wire, a supported and well known method that you can easily see through with tools like Fiddler. That usually gives you the best of all worlds. You get small response size, you still use readable format, you don’t have to deal with all of the issues mentioned above. Happy happy joy joy!

Hm… can we not apply the same method internally on the server side? Sure we can. We can compress & decompress data on the fly when we write / read to the disk, further reducing the amount of data that we write.

In RavenDB, we do both. Data is compressed on the wire, and can be compressed to the disk as well.

As for the MongoDB issue, it may be highly voted, it may be marked as major, but it has been sitting in the issue tracker for the past 3+ years. I would say that they aren’t in any rush to resolve this. Probably for much the same reasons as I outlined here. It is cheap to vote for an issue, and usually the people setting the issue severity are the people creating it. I wouldn’t put too much stock on anything like that.

Put simply, trying to solve this is a really hard problem, with a lot of cascading implication and dangerous side effects. Conversely there are more effective solutions that are cheaper, simpler and are orthogonal to everything else.

I think you can guess what I would pick.



Interested in reducing database costs by moving from Oracle Enterprise to open source subscription?  Read the total cost of ownership (TCO) analysis. Brought to you in partnership with MariaDB.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}