I have been doing a comparative study on different ways to analyse a large JSON in memory, and wanted to share the results.
I intend to do a wider set of comparisons and this post is going to be the first part. In this I post will discuss the use case and the test results while analysing in a simple row vs. columnar fashion.
Let’s say you got a big chunk of JSON data from a source like twitter API and you want to do some analysis on it by firing multiple and random queries on this large JSON. To speed up the queries it’s preferred to load the JSON in memory and answer the queries instead of reaching out to the disk every time (assuming it fits in memory). Please note, I did not want to bother throwing distributed computing or a real time requirement at this point.
The test data I took was 100000 records from twitter firehose that made up a 264MB JSON array on disk. The test query was to pull up a tweet that came from a particular timezone and from a particular user name – select text where user.screen_name = <> and user.timezone = <>.
1. Row by row: Parse the twitter stream data in a traditional row by row fashion and filter by both user name and time zone parameters, select the tweet text.
2. NoSQL DB: Post the test data into a NoSQL DB like mongodb and run the query. This is not really an in-memory analysis but with mongodb nearing cache speeds, I would like to get some stats here. I will do this test by my next post.
3. Columnar: Convert the JSON to columnar format so that it’s better for analytics. (Later I intend to add compression as well). There are known techniques to convert nested JSON to columnar like Parquet format. I found that Parquet intended to solve more problems than I needed and could potentially perform better, I will revisit Parquet probably later. Another new player - MemSQL converts JSON to columnar and allows SQL querying on the same. MemSQL appears a good fit for our use case but it’s not a free software. Also, it’s a much bigger product that includes real-time and scalability solutions. I also found treasure data has a technology called Plazma, again it did not look like an open sourced one.
I decided to code my own algorithm to flatten the JSON and query in an SQL fashion. My approach is a little different to Parquet in that I did not go by repetition level and definition level. Instead, I traversed the JSON assigning rowids to columns values based on the level and iteration.
With this basic set up, I ran my test query and obtained below stats:
MilliSecs for file load: 30432
MilliSecs for query exec: 33
Increase in heap after file load: 303 MB
Total memory: 1517
Free memory after program: 431
Used (Total – Free): 1086 MB
Row by row
MilliSecs for file load: 34494
MilliSecs for query exec: 192
Increase in heap after file load: 241 MB
Total memory: 1652
Free memory after program: 303
Used (Total – Free): 1349 MB
We can see that right away with this basic columnar implementation, the query execution time is more than 5x faster.
I am unable to conclude well on the memory footprint due to other variables, but looks like the columnar format is lighter on memory by a small margin – Not bad considering I have not done any compression yet.
Columnar will be faster and lighter than Row is not a new finding. This test is to affirm the same and apply to JSON which is a nested data format.