Analyzing a large JSON in memory
Join the DZone community and get the full member experience.
Join For FreeI have been doing a comparative study on different ways to analyse a large JSON in memory, and wanted to share the results.
I intend to do a wider set of comparisons and this post is going to be the first part. In this I post will discuss the use case and the test results while analysing in a simple row vs. columnar fashion.
Use Case
Let’s say you got a big chunk of JSON data from a source like twitter API and you want to do some analysis on it by firing multiple and random queries on this large JSON. To speed up the queries it’s preferred to load the JSON in memory and answer the queries instead of reaching out to the disk every time (assuming it fits in memory). Please note, I did not want to bother throwing distributed computing or a real time requirement at this point.
The test data I took was 100000 records from twitter firehose that made up a 264MB JSON array on disk. The test query was to pull up a tweet that came from a particular timezone and from a particular user name – select text where user.screen_name = <> and user.timezone = <>.
Approaches
1. Row by row: Parse the twitter stream data in a traditional row by row fashion and filter by both user name and time zone parameters, select the tweet text.
2. NoSQL DB: Post the test data into a NoSQL DB like mongodb and run the query. This is not really an in-memory analysis but with mongodb nearing cache speeds, I would like to get some stats here. I will do this test by my next post.
3. Columnar: Convert the JSON to columnar format so that it’s better for analytics. (Later I intend to add compression as well). There are known techniques to convert nested JSON to columnar like Parquet format. I found that Parquet intended to solve more problems than I needed and could potentially perform better, I will revisit Parquet probably later. Another new player - MemSQL converts JSON to columnar and allows SQL querying on the same. MemSQL appears a good fit for our use case but it’s not a free software. Also, it’s a much bigger product that includes real-time and scalability solutions. I also found treasure data has a technology called Plazma, again it did not look like an open sourced one.
I decided to code my own algorithm to flatten the JSON and query in an SQL fashion. My approach is a little different to Parquet in that I did not go by repetition level and definition level. Instead, I traversed the JSON assigning rowids to columns values based on the level and iteration.
With this basic set up, I ran my test query and obtained below stats:
Columnar
Time stats:
MilliSecs for file load: 30432
MilliSecs for query exec: 33
Space stats:
Increase in heap after file load: 303 MB
Total memory: 1517
Free memory after program: 431
Used (Total – Free): 1086 MB
Row by row
Time stats:
MilliSecs for file load: 34494
MilliSecs for query exec: 192
Space stats:
Increase in heap after file load: 241 MB
Total memory: 1652
Free memory after program: 303
Used (Total – Free): 1349 MB
Inference
We can see that right away with this basic columnar implementation, the query execution time is more than 5x faster.
I am unable to conclude well on the memory footprint due to other variables, but looks like the columnar format is lighter on memory by a small margin – Not bad considering I have not done any compression yet.
Columnar will be faster and lighter than Row is not a new finding. This test is to affirm the same and apply to JSON which is a nested data format.
Opinions expressed by DZone contributors are their own.
Trending
-
Integrating AWS With Salesforce Using Terraform
-
Knowing and Valuing Apache Kafka’s ISR (In-Sync Replicas)
-
Essential Architecture Framework: In the World of Overengineering, Being Essential Is the Answer
-
How to LINQ Between Java and SQL With JPAStreamer
Comments