Hadoop Revisited, Part II: 10 Key Concepts of Hadoop MapReduce
Learn the main building blocks and components that compose Hadoop MapReduce jobs and learn the different text objects that we use in order to write outputs.
Join the DZone community and get the full member experience.
Join For FreeThe Big Data world has seen many new projects over the years such as Apache Storm, Kafka, Samza, Spark, and more. However, the basics are rather the same. We have data files residing on disk (or for hotness on memory) and we query them. Hadoop is the basic platform that hosts these files and enables us with simple concepts to do map reduce on them. In this post, we shall revisit the ground workings of MapReduce as it's always a good idea to have the basics right!
Following are the ten key concepts of Hadoop MapReduce.
1. FileInputFormat Splits Your Data
This class can take a file and, if we have multiple blocks, split it to multiple mappers. This is important so that we can split work or data between multiple workers. If our workers are mappers, each should get its own data.
2. Main Is Not Mapper Nor Reducer
You should have a main
that is not mapper
and not reducer
to receive the command line parameters. FileInputFormat
splits the file (which usually has multiple blocks to multiple mappers
and uses the InputPath
. Similarly, we have FileOutputFormat
.
3. Use Text.class When Using Strings
When referring to Strings
, for example, in output, you refer to job.setOutputKeyClass(Text.class)
.
4. Use * to Scan Multiple Files
You can give the inputPath
multiple *
for all files in directory or all directories or just use the Jobs
api to add multiple paths.
5. Access Your Job From Main
You have access to the main Job
object, which handles the job from your main
.
6. Offset Is Given to Mapper as ID
The id
of the key
object brought to mapper
is the offset from the file (it has also the actual key
).
7. Write Results With Context.write
In mapper
, you use context.write
to write the result (context
is a parameter to mapper
).
8. Use new Text for Output
If you write a string as an output from mapper
, you write it with new Text(somestr)
.
9. Mapper Is a Pure Function
mapper
is a pure function it gets input key, value
and emits key, value
or multiple keys and values. So as its as much pure as possible its not intended to perform states meaning it will not combine results for mapping internally (see combiners
if you need that).
10. The Format Reducers Receive
Reducers
receive a key -> values
it will get called multiple times withsorted keys
.
Summary
In this tutorial, we scanned the main building blocks and components that compose Hadoop MapReduce jobs. We have seen the input to reducers, mappers, using * for multiple files, and the different text objects that we use in order to write outputs.
Opinions expressed by DZone contributors are their own.
Trending
-
Managing Data Residency, the Demo
-
Hyperion Essbase Technical Functionality
-
Execution Type Models in Node.js
-
Measuring Service Performance: The Whys and Hows
Comments