DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • RION - A Fast, Compact, Versatile Data Format
  • All You Need to Know About Apache Spark
  • The Future Trends Driving Open-Source Database Programs
  • Reporting in Microservices: How To Optimize Performance

Trending

  • While Performing Dependency Selection, I Avoid the Loss Of Sleep From Node.js Libraries' Dangers
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • Rethinking Recruitment: A Journey Through Hiring Practices
  • Fixing Common Oracle Database Problems
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Hadoop and Spark: Synergy Is Possible

Hadoop and Spark: Synergy Is Possible

They are different and often look like rivals. Still, it’s not always an either-or choice, as they can coexist perfectly.

By 
Alexander Bekker user avatar
Alexander Bekker
·
Updated Aug. 02, 22 · Opinion
Likes (7)
Comment
Save
Tweet
Share
15.1K Views

Join the DZone community and get the full member experience.

Join For Free

If somebody mentions Hadoop and Spark together, they usually contrast these two popular big data frameworks. According to Ahrefs, 1,200 Google visitors are searching for Spark vs. Hadoop each month, while only 90 are inquiring about Spark and Hadoop. It looks like the frameworks have gradually gained a reputation of being mutually exclusive. But this is not always the case. There are multiple ways for businesses to benefit from their synergy. Let’s take a closer look at Hadoop and Spark and discover scenarios where they can work together.

Apache Hadoop Defined

Apache Hadoop is an open-source framework for data storage and parallel processing. Initially released in 2011, Hadoop triggered big data evolvement. Distributed data storage allowed companies to cope with big data volumes. They didn’t need to buy extremely expensive custom hardware anymore. Instead, they could use multiple affordable computers to store data. Besides, this approach enabled the much-needed scalability of the solution. When the amount of data to be stored and processed increased, companies could solve that challenge by adding extra computers. With such an approach to data storage, parallel data processing was required. This became another distinctive feature of Hadoop.

Apache Spark Defined

Apache Spark is an open-source framework for parallel processing. Released in 2014, Spark was designed to cope with the shortcomings of Hadoop MapReduce, which was mainly the speed of processing. Unlike Hadoop MapReduce, which has to write interim analysis results back to the disk and then read the data again and again, Spark processes data in-memory. As a result, it is up to 100 times faster than Hadoop MapReduce.

By the way, if we talk about real alternatives, there are Apache Spark and Hadoop MapReduce (not entire Hadoop). And this is evident already from Spark’s definition.

Using Hadoop and Spark Together

One important remark: the Hadoop ecosystem consists of several components, among which is Hadoop Distributed File System (or HDFS for short), Apache Hive (a query engine), Hadoop MapReduce (a framework for the parallel processing of distributed large datasets), and more. With this information in mind, let’s take a look at possible synergy scenarios.

HDFS + Apache Spark

We have already clarified that Apache Spark’s intended purpose is data processing. But to process data, the engine needs to take it from some storage first. HDFS is not the only option available, but it's a quite frequent one. The reason is simple: both belonging to Apache Foundation family, HDFS and Spark are highly compatible.

An illustrative example of such a synergy is a word count (you can find the code example here). The sequence of operations is as follows: Apache Spark takes a text file from HDFS, divides each line into separate words, sets the value 1 for each word, calculates the sum of values for each word, and records the result to HDFS.

Apache Hive + Apache Spark

The combination of Apache Spark and Apache Hive (that is based on HDFS) allows solving many business tasks, for example, conducting customer behavior analytics. Imagine a company that cumulates data from multiple sources: clickstream data, comments, and posts on social media, data from customer mobile apps, etc. Let’s say that the company has chosen HDFS to store their data and Apache Hive to act as an intermediary between HDFS and Spark. Apache Hive makes it possible to query the data using a SQL-like language. As a result, Spark that has special support for Hive could easily access the data and process it. In the end, the company can understand the preferences and behavior patterns of each customer.

Real-Life Examples of Spark and Hadoop Duets

Real-life examples of using Hadoop and Spark together are not rare in big data consulting practices. The list of companies that adopt such an approach includes many well-known names. Undoubtedly, their solutions are of different complexity. And this is understandable, as these companies strive to solve different business tasks. Still, there is one thing that unites them: their big data technology stack includes both Hadoop and Spark. Let’s look at the following two examples.

  1. TripAdvisor uses Hadoop and Spark together to deliver a seamless customer experience. They introduced auto-tagging, which is based on the analysis of visitors’ reviews and tags. This feature allows TripAdvisor to predict whether a visitor’s impression of a particular location will be the same as that of the other visitors. Another interesting feature is improved photo selection. Now, a website visitor can get a more precise picture of any location thanks to a better choice of visuals. For instance, if a hotel has a pool, machine learning algorithms will pick the photo of the pool and show it to the visitor. 

  2. Uber is doing a great job of managing their big data to improve their service. They know the typical behavior of each customer (starting and destination points, usual day and time of their journeys, etc.). The company also uses real-time traffic situations to adjust the number of drivers needed at a particular time and in a particular location. To make this possible, Uber uses HDFS for loading raw data onto Hive for SQL-powered analysis and Spark for processing of millions of events.

Conclusion

Now, you can see that Hadoop and Spark can smoothly work together. We have supplied the article with real-life examples so that you can see that the synergy of these big data frameworks is possible not only in theory but also in practice. When making a choice, just remember one of the maxims of big data: your big data technology stack should suit your business goals.

hadoop Big data Apache Spark Apache Hive Database Open source

Opinions expressed by DZone contributors are their own.

Related

  • RION - A Fast, Compact, Versatile Data Format
  • All You Need to Know About Apache Spark
  • The Future Trends Driving Open-Source Database Programs
  • Reporting in Microservices: How To Optimize Performance

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!