Hadoop Vs. Spark — Choosing the Right Big Data Framework
Which framework is best for you
Join the DZone community and get the full member experience.Join For Free
We are surrounded by data from all sides. With data getting doubled in size every two years, the digital universe is chasing the physical universe at a fast pace. It is estimated that by 2020, the digital universe will be as large as 44 zettabytes—as many digital bits as there are stars in the universe.
The data is increasing and we are not getting rid of it any time soon. And to digest all this data, there are an increasing number of distributed systems on the market. Among these systems, Hadoop and Spark are often pitted against one another as direct competitors.
When deciding which of these two frameworks is right for you, it’s important to compare them, based on the few essential parameters.
Spark is lightning-fast and has been found to outperform the Hadoop framework. It runs 100 times faster in-memory and 10 times faster on disk. Moreover, it is found that it sorts 100 TB of data 3 times faster than Hadoop using 10X fewer machines.
Spark is so fast is because it processes everything in memory. Thanks to Spark’s in-memory processing, it delivers real-time analytics for data from marketing campaigns, IoT sensors, machine learning, and social media sites.
However, if Spark, along with other shared services, is running on YARN, its performance might degrade. This could lead to RAM overhead memory leaks. Hadoop on the other hand, easily handles this issue. If a user has a tilt towards batch processing, Hadoop is much more efficient than Spark.
Bottom Line: Both Hadoop and Spark have a different way of processing. Thus, it entirely depends upon the requirement of the project, whether to go ahead with Hadoop or Spark in the Hadoop vs Spark performance battle.
Facebook and its Transitional Journey With Spark Framework
Data on Facebook increases with each passing second. In order to handle this data and use it to make an intelligent decision, Facebook uses analytics. And for that, it makes use of a number of platforms as follows:
- Hive platform to execute some of Facebook’s batch analytics.
- Corona platform for the custom MapReduce implementation.
- Presto footprint for ANSI-SQL-based queries.
The Hive platform discussed above was computationally “resource-intensive”. So, maintaining it was a huge challenge. Thus, Facebook decided to switch to Apache Spark framework to manage their data. Today, Facebook has deployed a faster manageable pipeline for the entity ranking systems by integration of Spark.
Spark’s security is still evolving, as it currently only supports authentication via shared secret (password authentication). Even Apache Spark’s official website claims that, “there are many different types of security concerns. Spark does not necessarily protect against all things.”
Hadoop, on the other hand, has the following security features: Hadoop Authentication, Hadoop Authorization, Hadoop Auditing, and Hadoop Encryption. All of these are integrated with Hadoop security projects like Knox Gateway and Sentry.
Bottom Line: In Hadoop vs Spark Security battle, Spark is a little less secure than Hadoop. However, on integrating Spark with Hadoop, Spark can use the security features of Hadoop.
First of all, both Hadoop and Spark are open-source frameworks, and thus, come for free. Both use commodity servers, run on the cloud, and seem to have somewhat similar hardware requirements:
So, how to evaluate them on the basis of cost?
Note that Spark makes use of huge amounts of RAM to run everything in memory. This could impact cost, given RAM's higher price than hard-disks.
On the other hand, Hadoop is disk-bound. Thus, your cost of buying an expensive RAM gets saved. However, Hadoop needs more systems to distribute the disk I/O.
Therefore, when comparing Spark and Hadoop framework on the parameters of cost, organizations will have to ponder at their requirements.
If the requirement tilts towards processing large amounts of big, historical data, Hadoop is the choice to go ahead with because hard disk space comes at a much cheaper price than memory space.
On the other hand, Spark can be cost-effective when we deal with the option of real-time data, as it makes use of less hardware to perform the same tasks at a much faster rate.
Bottom Line: In Hadoop vs Spark cost battle, Hadoop definitely costs less, but Spark is cost-effective when an organization has to deal with lower amounts of real-time data.
Ease of Use
One of the biggest USPs of the Spark framework is its ease of use. Spark has user-friendly and comfortable APIs for Scala Java, Python, and Spark SQL (also known as Shark).
The simple building blocks of Spark make it easy to write user-defined functions. Moreover, since Spark allows for batch processing and machine learning, it becomes easy to simplify the infrastructure for data processing. It even includes an interactive mode for running commands with immediate feedback.
Hadoop is written in Java and has a bad reputation of paving the way for the difficulty in writing a program with no interaction mode. Although Pig (an add-on tool) makes it easier to program, it demands some time to learn the syntax.
Bottom Line: In ‘Ease of Use’ Hadoop vs Spark battle, both of them have their own ways to make themselves user-friendly. However, if we have to choose one, Spark is easier to program and includes an interactive mode.
Is it Possible for Apache Hadoop and Spark to Have a Synergic Relationship?
Yes, it is very much possible and we recommend it. Let’s get into the details on how they can work in tandem.
Apache Hadoop ecosystem includes HDFS, Apache Query, and HIVE. Let’s see how Apache Spark can make use of them.
An Amalgamation of Apache Spark and HDFS
The purpose of Apache Spark is to process data. However, in order to process data, the engine needs the input of data from storage. And for this purpose, Spark uses HDFS. (This is not the only option, but the most popular one since Apache is the brain behind both of them).
A Blend of Apache Hive and Apache Spark
Apache Spark and Apache Hive are highly compatible, as together, they can solve many business problems.
For instance, let's say that a business is into analyzing consumer behavior. Now for this, the company will need to gather data from various sources like social media, comments, clickstream data, customer mobile apps, and many more.
The organization could make use of HDFS to store the data and Apache hive as a bridge between HDFS and Spark.
Uber and its Amalgamated Approach
To process consumer-data, Uber uses a combination of Spark and Hadoop. It uses real-time traffic situation to provide drivers in a particular time and location. To make this possible, Uber uses HDFS for uploading raw data into Hive and Spark for processing billions of events.
Hadoop vs Spark: And the Winner Is…..
While Spark is fast and easy to use, Hadoop comes with robust security, mammoth storage capacity, and low-cost batch processing capabilities. Choosing one out of two depends entirely upon the requirement of your project. A combination of the two would give birth to an invincible combination.
“Between two evils, choose neither; between two goods, choose both.”
Mix some attributes of Spark and some of Hadoop to come up with a brand new framework: Spoop.
Published at DZone with permission of Sunil Goyal. See the original article here.
Opinions expressed by DZone contributors are their own.