DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

SBOMs are essential to circumventing software supply chain attacks, and they provide visibility into various software components.

Related

  • All You Need to Know About Apache Spark
  • Building a Data Warehouse for Traditional Industry
  • 7 Tips on Writing Good Technical Content
  • High-Performance Java Serialization to Different Formats

Trending

  • What Is Plagiarism? How to Avoid It and Cite Sources
  • Microservice Madness: Debunking Myths and Exposing Pitfalls
  • The QA Paradox: To Save Artificial Intelligence, We Must Stop Blindly Trusting Data—And Start Trusting Human Judgment
  • Stabilizing ETL Pipelines With Airflow, Presto, and Metadata Contracts
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How Does Spark Use MapReduce?

How Does Spark Use MapReduce?

Apache Spark does use MapReduce — but only the idea of it, not the exact implementation. Confused? Let's talk about an example.

By 
Anubhav Tarar user avatar
Anubhav Tarar
·
Jan. 04, 18 · Opinion
Likes (3)
Comment
Save
Tweet
Share
39.2K Views

Join the DZone community and get the full member experience.

Join For Free

In this article, we will talk about an interesting scenario: does Spark use MapReduce or not? The answer to the question is yes — but only the idea, not the exact implementation. Let's talk about an example. To read a text file from Spark, all we do is:

spark.sparkContext.textFile("fileName")

But do you know how does it actually works? Try to ctrl + click on this method text file. You will find this code:

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

As you can see, it is calling the Hadoop file method of the Hadoop API with four parameters: file path, input format, LongWritable, and text input format.

So, it doesn't matter that you are reading a text file from the local S3; HDFS will always use the Hadoop API to read it.

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)

Can you understand what this code is doing? The first parameter is the path of the file, the second parameter is the input format (which should be used to read this file), and the third and fourth parameters are similar to the record reader (which is the offset of the line itself).

Now, you might be thinking, "Why don't we get this offset back when we are reading from a file?" The reason for this is below:

.map(pair => pair._2.toString)

This is mapping over all the key-value pairs but only collecting the values.

hadoop MapReduce file IO

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • All You Need to Know About Apache Spark
  • Building a Data Warehouse for Traditional Industry
  • 7 Tips on Writing Good Technical Content
  • High-Performance Java Serialization to Different Formats

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: