Big Data - Isn't (well, almost)

DZone 's Guide to

Big Data - Isn't (well, almost)

As data sets continue to become larger and larger, and the tech around them more advanced, our definition of big data should shift as well.

· Big Data Zone ·
Free Resource

Back in ancient history (2004) Google's Jeff Dean and Sanjay Ghemawat presented their innovative idea for dealing with huge data sets — a novel idea called MapReduce.

Jeff and Sanjay presented that a typical cluster was made of hundreds to thousands of machines with 2 CPUs and 2-4 GB RAM each. They presented that in the whole of August 2004 Google processed ~3.3 PB of data in 29,423 jobs, i.e. an average job processed around 110GB of data.

Google's MapReduce jobs run in August 2004:

MapReduce jobs

How does that compare to today's systems and workloads?

I couldn't find numbers from Google but others say that by 2017 Google processed over 20PB a day (not to mention answering 40K search queries/second) so Google is definitely in the big data game. The numbers go down fast after that, even for companies who are really big data companies — Facebook presented back in 2017 that they handle 500TB+ of new data daily, the whole of Twitter's data as of May 2018 was around 300PB, and Uber reported their data warehouse is in the 100+ PB range.

Ok, but what about the rest of us? Let's take a look at an example.

Mark Litwintschik took a data set of 1.1 billion rides in taxis and Ubers published by Todd W. Schnider with 500GB of uncompressed CSV (i.e. 5 times larger than the average job Google ran in 2004) and benchmarked it on modern big data infrastructure. For instance, he ran his benchmark with a Spark 2.4.0 on 21-node m3.xlarge cluster, i.e. 4 vCPUs and 15 GB RAM per node (that's x2 CPUs and x4 RAM than the 2004 machines).

His benchmark includes several queries. The one below took 20.412 seconds to complete, so, overall, it seems like decent progress in 14 years (I know this isn't a real comparison - but the next one is!).

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       count(*) trips
FROM trips_orc
GROUP BY passenger_count,
ORDER BY trip_year,
         trips desc;

However, the truth is that this "1.1 billion rides" data set isn't a big data problem. As it happens, Mark also ran the same benchmark on a single core i5 laptop (16GB RAM) using Yandex's ClickHouse and the same query took only 12.748 seconds, meaning it was almost 40% faster.

Where I work today, one of our biggest sets is of millions of hours of patient hospitalizations at a minute resolution with hundreds of different features: vital signs, medicines, labs, and a slew of clinical features built on top of them — this still fits comfortably in the community edition of Vertica (i.e. < 1 TB of data).

In fact, for less than $7 per hour, Amazon will rent you a 1TB RAM machine with 64 cores, and they have machines that go up to 12TB of RAM. So a huge number of datasets and problems are actually "fit in memory" data.

Looking at another aspect, one of the basic ideas behind Map/Reduce and the slew of "big data" technologies that followed is that, since data is big, it is much more efficient to move the computation where the data is. The documentation for Hadoop, the poster child for big data systems, explains this nicely in the HDFS architecture section:

"A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located."

HDFS Architecture

In one of the previous companies I worked for, we handled 10 billion events per day. We were still able to store our data in S3 and read it into a Spark cluster to build reports like "weekly retention" that looked back at 6 months of data - i.e. all the data which was in the terabytes was read from remote storage. It was still what you'd call a big data job running on 150 servers with lots of RAM (something like 36TB cluster-wide), but, again, we could and did bring the data to the computation and not the other way around.

Hadoop itself now supports RAID-like erasure codes and not just 3x replications (which was very helpful in getting data locality) as well as provided storage (i.e. not managed by HDFS).

As a side-note, this inability to provide a real competitive-edge over cloud storage (and thus cloud offerings of Hadoop), along with the rise of Kubernetes, is probably what led Cloudera and HortonWorks to consolidate and merge — but that's for another post.

Anyway, the point I am trying to make is that while data is getting bigger, the threshold to big data is also moving further away. I often see questions on Stack Overflow where a person runs a 3 node Spark cluster and complains that he/she sees bad performance compared with whatever they were doing before. Of course that is what's going to happen, big data tools are built for big data (duh!) and when used with small data (and that can be rather big as we've seen) you are paying all of the overhead and get none of the benefits. The examples I gave here are for static data, but the same holds true for streaming solutions. Solutions such as Kafka are complex, they solve certain types of problems really well and when you need it is invaluable.

In both cases (streaming and batch), when you don't need big data tools they're just a burden. In a lot of cases "big data" isn't really all that big. 

big data, big data sets, hadoop, mapreduce

Published at DZone with permission of Arnon Rotem-gal-oz , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}