DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Building Scalable Real-Time Apps with AstraDB and Vaadin
Register Now

Trending

  • Integrating AWS With Salesforce Using Terraform
  • Knowing and Valuing Apache Kafka’s ISR (In-Sync Replicas)
  • Essential Architecture Framework: In the World of Overengineering, Being Essential Is the Answer
  • How to LINQ Between Java and SQL With JPAStreamer

Trending

  • Integrating AWS With Salesforce Using Terraform
  • Knowing and Valuing Apache Kafka’s ISR (In-Sync Replicas)
  • Essential Architecture Framework: In the World of Overengineering, Being Essential Is the Answer
  • How to LINQ Between Java and SQL With JPAStreamer
  1. DZone
  2. Data Engineering
  3. Databases
  4. Analyzing a large JSON in memory

Analyzing a large JSON in memory

Abishek Baskaran user avatar by
Abishek Baskaran
·
May. 05, 14 · Interview
Like (0)
Save
Tweet
Share
10.86K Views

Join the DZone community and get the full member experience.

Join For Free

I have been doing a comparative study on different ways to analyse a large JSON in memory, and wanted to share the results.

I intend to do a wider set of comparisons and this post is going to be the first part. In this I post will discuss the use case and the test results while analysing in a simple row vs. columnar fashion.

Use Case

Let’s say you got a big chunk of JSON data from a source like twitter API and you want to do some analysis on it by  firing multiple and random queries on this large JSON. To speed up the queries it’s preferred to load the JSON in memory and answer the queries instead of reaching out to the disk every time (assuming it fits in memory). Please note, I did not want to bother throwing distributed computing or a real time requirement  at this point.

The test data I took was 100000 records from twitter firehose that made up a 264MB JSON array on disk. The test query was to pull up a tweet that came from a particular timezone and from a particular user name – select text where user.screen_name = <> and user.timezone = <>.

Approaches

1. Row by row: Parse the twitter stream data in a traditional row by row fashion and filter by both user name and time zone parameters, select the tweet text.

2. NoSQL DB: Post the test data into a NoSQL DB like mongodb and run the query. This is not really an in-memory analysis but with mongodb nearing cache speeds, I would like to get some stats here. I will do this test by my next post.

3. Columnar: Convert the JSON to columnar format so that it’s better for analytics. (Later I intend to add compression as well). There are known techniques to convert nested JSON to columnar like Parquet format. I found that Parquet intended to solve more problems than I needed and could potentially perform better, I will revisit Parquet probably later. Another new player - MemSQL converts JSON to columnar and allows SQL querying on the same. MemSQL appears a good fit for our use case but it’s not a free software. Also, it’s a much bigger product that includes real-time and scalability solutions. I also found treasure data has a technology called Plazma,  again it did not look like an open sourced one.

I decided to code my own algorithm to flatten the JSON and query in an SQL fashion. My approach is a little different to Parquet in that I did not go by repetition level and definition level. Instead, I traversed the JSON assigning rowids to columns values based on the level and iteration.

With this basic set up, I ran my test query and obtained below stats:

Columnar

Time stats:

MilliSecs for file load: 30432

MilliSecs for query exec: 33

Space stats:

Increase in heap after file load: 303 MB

Total memory: 1517

Free memory after program: 431

Used (Total – Free): 1086 MB

Row by row

Time stats:

MilliSecs for file load: 34494

MilliSecs for query exec: 192

Space stats:

Increase in heap after file load: 241 MB

Total memory: 1652

Free memory after program: 303

Used (Total – Free): 1349 MB

Inference

We can see that right away with this basic columnar implementation, the query execution time is more than 5x faster.

I am unable to conclude well on the memory footprint due to other variables, but looks like the columnar format is lighter on memory by a small margin – Not bad considering I have not done any compression yet.

Columnar will be faster and lighter than Row is not a new finding. This test is to affirm the same and apply to JSON which is a nested data format.

JSON Database Memory (storage engine) sql Test data Use case

Opinions expressed by DZone contributors are their own.

Trending

  • Integrating AWS With Salesforce Using Terraform
  • Knowing and Valuing Apache Kafka’s ISR (In-Sync Replicas)
  • Essential Architecture Framework: In the World of Overengineering, Being Essential Is the Answer
  • How to LINQ Between Java and SQL With JPAStreamer

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: