DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Java Says, Your Data's Not That Big

Java Says, Your Data's Not That Big

We all know Java's a great language, but it's not for data science, right? Wrong! The speed of Java coupled with hardware advancements makes it great for big data.

Larry White user avatar by
Larry White
·
Aug. 15, 18 · Analysis
Like (17)
Save
Tweet
Share
7.49K Views

Join the DZone community and get the full member experience.

Join For Free

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was "easy," which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of "big data" is "Data that is too big to fit on one machine." By that definition what is "big data" for one language is plain-old "data" for another. Java, with it's efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It's a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

But you don't have to take my word for that.

A KDNuggets poll showed that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found they were often not "big data."

In a post summarizing the findings from that poll, they note:

A majority of data scientists (56%) work in Gigabyte dataset range.

In other words, most data scientists can do their work on a laptop.

poll-largest-dataset-analyzed-2013-2015A similar result comes from Facebook, where a study showed that 96% of their data center jobs could fit in 32 GB of RAM.

Another finding from KDNuggets was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.

If your laptop is too small, you can probably do your work faster, easier, and cheaper by leasing a server on the cloud. This is basically the finding of Nobody ever got fired for using Hadoop on a Cluster from Microsoft Research, which examines the total cost of using distributed “big data” tools like Spark and Hadoop. Their summary:

Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.

In a blog post titled "The emperor's new clothes: distributed machine learning" another Microsoft Researcher, Paul Mineiro, states:

Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.

When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the wonderful (free!) textbook Mining of Massive Datasets, so he understands large-scale data crunching. What he said was: "Get your own 1TB RAM server!"

Bottom line: get your own 1TB RAM server

Jure Leskovic’s take on the best way to mine large datasets.

"Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM."

You can have one, too. Amazon offers EC2 instances with up to 4 TB of RAM. You can get a 1TB instance for less than $4 per hour (reserved), or less than $7 per hour (on-demand). This is less than the minimum wage in Massachussetts, and a lot less than the engineering wage. Once you have one, you can make the most of it by using RAM optimized data science tools like Tablesaw.

Tablesaw is not for every job. You're limited to 2.1 billion rows in a single table, but if your work fits, it can save you time. If you use it's indexing capability you can execute searches that are literally as fast as lightning (a few milliseconds) on 1/2 billion rows.

Of course, it works even better on a million row dataset on your laptop.

Data science Java (programming language) Machine learning

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • HTTP vs Messaging for Microservices Communications
  • The 5 Books You Absolutely Must Read as an Engineering Manager
  • How To Handle Secrets in Docker
  • Fargate vs. Lambda: The Battle of the Future

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: