DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Big Data - Isn't (well, almost)

Big Data - Isn't (well, almost)

As data sets continue to become larger and larger, and the tech around them more advanced, our definition of big data should shift as well.

Arnon Rotem-gal-oz user avatar by
Arnon Rotem-gal-oz
·
Mar. 28, 19 · Opinion
Like (5)
Save
Tweet
Share
4.60K Views

Join the DZone community and get the full member experience.

Join For Free

Back in ancient history (2004) Google's Jeff Dean and Sanjay Ghemawat presented their innovative idea for dealing with huge data sets — a novel idea called MapReduce.

Jeff and Sanjay presented that a typical cluster was made of hundreds to thousands of machines with 2 CPUs and 2-4 GB RAM each. They presented that in the whole of August 2004 Google processed ~3.3 PB of data in 29,423 jobs, i.e. an average job processed around 110GB of data.

Google's MapReduce jobs run in August 2004:


MapReduce jobs

How does that compare to today's systems and workloads?

I couldn't find numbers from Google but others say that by 2017 Google processed over 20PB a day (not to mention answering 40K search queries/second) so Google is definitely in the big data game. The numbers go down fast after that, even for companies who are really big data companies — Facebook presented back in 2017 that they handle 500TB+ of new data daily, the whole of Twitter's data as of May 2018 was around 300PB, and Uber reported their data warehouse is in the 100+ PB range.

Ok, but what about the rest of us? Let's take a look at an example.

Mark Litwintschik took a data set of 1.1 billion rides in taxis and Ubers published by Todd W. Schnider with 500GB of uncompressed CSV (i.e. 5 times larger than the average job Google ran in 2004) and benchmarked it on modern big data infrastructure. For instance, he ran his benchmark with a Spark 2.4.0 on 21-node m3.xlarge cluster, i.e. 4 vCPUs and 15 GB RAM per node (that's x2 CPUs and x4 RAM than the 2004 machines).

His benchmark includes several queries. The one below took 20.412 seconds to complete, so, overall, it seems like decent progress in 14 years (I know this isn't a real comparison - but the next one is!).

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

However, the truth is that this "1.1 billion rides" data set isn't a big data problem. As it happens, Mark also ran the same benchmark on a single core i5 laptop (16GB RAM) using Yandex's ClickHouse and the same query took only 12.748 seconds, meaning it was almost 40% faster.

Where I work today, one of our biggest sets is of millions of hours of patient hospitalizations at a minute resolution with hundreds of different features: vital signs, medicines, labs, and a slew of clinical features built on top of them — this still fits comfortably in the community edition of Vertica (i.e. < 1 TB of data).

In fact, for less than $7 per hour, Amazon will rent you a 1TB RAM machine with 64 cores, and they have machines that go up to 12TB of RAM. So a huge number of datasets and problems are actually "fit in memory" data.

Looking at another aspect, one of the basic ideas behind Map/Reduce and the slew of "big data" technologies that followed is that, since data is big, it is much more efficient to move the computation where the data is. The documentation for Hadoop, the poster child for big data systems, explains this nicely in the HDFS architecture section:

"A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located."

HDFS Architecture

In one of the previous companies I worked for, we handled 10 billion events per day. We were still able to store our data in S3 and read it into a Spark cluster to build reports like "weekly retention" that looked back at 6 months of data - i.e. all the data which was in the terabytes was read from remote storage. It was still what you'd call a big data job running on 150 servers with lots of RAM (something like 36TB cluster-wide), but, again, we could and did bring the data to the computation and not the other way around.

Hadoop itself now supports RAID-like erasure codes and not just 3x replications (which was very helpful in getting data locality) as well as provided storage (i.e. not managed by HDFS).

As a side-note, this inability to provide a real competitive-edge over cloud storage (and thus cloud offerings of Hadoop), along with the rise of Kubernetes, is probably what led Cloudera and HortonWorks to consolidate and merge — but that's for another post.

Anyway, the point I am trying to make is that while data is getting bigger, the threshold to big data is also moving further away. I often see questions on Stack Overflow where a person runs a 3 node Spark cluster and complains that he/she sees bad performance compared with whatever they were doing before. Of course that is what's going to happen, big data tools are built for big data (duh!) and when used with small data (and that can be rather big as we've seen) you are paying all of the overhead and get none of the benefits. The examples I gave here are for static data, but the same holds true for streaming solutions. Solutions such as Kafka are complex, they solve certain types of problems really well and when you need it is invaluable.

In both cases (streaming and batch), when you don't need big data tools they're just a burden. In a lot of cases "big data" isn't really all that big. 

Big data

Published at DZone with permission of Arnon Rotem-gal-oz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Cloud Performance Engineering
  • Building a Real-Time App With Spring Boot, Cassandra, Pulsar, React, and Hilla
  • How To Choose the Right Streaming Database
  • gRPC on the Client Side

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: