Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How Does Spark Use MapReduce?

DZone's Guide to

How Does Spark Use MapReduce?

Apache Spark does use MapReduce — but only the idea of it, not the exact implementation. Confused? Let's talk about an example.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In this article, we will talk about an interesting scenario: does Spark use MapReduce or not? The answer to the question is yes — but only the idea, not the exact implementation. Let's talk about an example. To read a text file from Spark, all we do is:

spark.sparkContext.textFile("fileName")

But do you know how does it actually works? Try to ctrl + click on this method text file. You will find this code:

/**
 * Read a text file from HDFS, a local file system (available on all nodes), or any
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

As you can see, it is calling the Hadoop file method of the Hadoop API with four parameters: file path, input format, LongWritable, and text input format.

So, it doesn't matter that you are reading a text file from the local S3; HDFS will always use the Hadoop API to read it.

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)

Can you understand what this code is doing? The first parameter is the path of the file, the second parameter is the input format (which should be used to read this file), and the third and fourth parameters are similar to the record reader (which is the offset of the line itself).

Now, you might be thinking, "Why don't we get this offset back when we are reading from a file?" The reason for this is below:

.map(pair => pair._2.toString)

This is mapping over all the key-value pairs but only collecting the values.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,apache spark ,mapreduce

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}