Java Says, Your Data's Not That Big
We all know Java's a great language, but it's not for data science, right? Wrong! The speed of Java coupled with hardware advancements makes it great for big data.
Join the DZone community and get the full member experience.Join For Free
Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was "easy," which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.
One definition of "big data" is "Data that is too big to fit on one machine." By that definition what is "big data" for one language is plain-old "data" for another. Java, with it's efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It's a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.
But you don't have to take my word for that.
A KDNuggets poll showed that most analytics don’t require “Big Data” tools. The poll asked data scientists about the largest data sets they work with, and found they were often not "big data."
In a post summarizing the findings from that poll, they note:
A majority of data scientists (56%) work in Gigabyte dataset range.
In other words, most data scientists can do their work on a laptop.
A similar result comes from Facebook, where a study showed that 96% of their data center jobs could fit in 32 GB of RAM.
Another finding from KDNuggets was that RAM is growing faster than data. By their estimate, RAM is growing at 50% per year, while the trend for the largest data sets is increasing at 20% per year.
If your laptop is too small, you can probably do your work faster, easier, and cheaper by leasing a server on the cloud. This is basically the finding of Nobody ever got fired for using Hadoop on a Cluster from Microsoft Research, which examines the total cost of using distributed “big data” tools like Spark and Hadoop. Their summary:
Should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.
In a blog post titled "The emperor's new clothes: distributed machine learning" another Microsoft Researcher, Paul Mineiro, states:
Since this is my day job, I’m of course paranoid that the need for distributed learning is diminishing as individual computing nodes… become increasingly powerful.
When he wrote that, Mineiro was taking notes at a talk by Stanford prof. Jure Leskovic. Leskovic is co-author of the wonderful (free!) textbook Mining of Massive Datasets, so he understands large-scale data crunching. What he said was: "Get your own 1TB RAM server!"
Jure Leskovic’s take on the best way to mine large datasets.
"Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM."
You can have one, too. Amazon offers EC2 instances with up to 4 TB of RAM. You can get a 1TB instance for less than $4 per hour (reserved), or less than $7 per hour (on-demand). This is less than the minimum wage in Massachussetts, and a lot less than the engineering wage. Once you have one, you can make the most of it by using RAM optimized data science tools like Tablesaw.
Tablesaw is not for every job. You're limited to 2.1 billion rows in a single table, but if your work fits, it can save you time. If you use it's indexing capability you can execute searches that are literally as fast as lightning (a few milliseconds) on 1/2 billion rows.
Of course, it works even better on a million row dataset on your laptop.
Opinions expressed by DZone contributors are their own.