The Big Data world has seen many new projects over the years such as Apache Storm, Kafka, Samza, Spark, and more. However, the basics are rather the same. We have data files residing on disk (or for hotness on memory) and we query them. Hadoop is the basic platform that hosts these files and enables us with simple concepts to do map reduce on them. In this post, we shall revisit the ground workings of MapReduce as it's always a good idea to have the basics right!
Following are the ten key concepts of Hadoop MapReduce.
1. FileInputFormat Splits Your Data
This class can take a file and, if we have multiple blocks, split it to multiple mappers. This is important so that we can split work or data between multiple workers. If our workers are mappers, each should get its own data.
2. Main Is Not Mapper Nor Reducer
You should have a
main that is not
mapper and not
reducer to receive the command line parameters.
FileInputFormat splits the file (which usually has multiple blocks to multiple
mappers and uses the
InputPath. Similarly, we have
3. Use Text.class When Using Strings
When referring to
Strings, for example, in output, you refer to
4. Use * to Scan Multiple Files
You can give the
* for all files in directory or all directories or just use the
Jobs api to add multiple paths.
5. Access Your Job From Main
You have access to the main
Job object, which handles the job from your
6. Offset Is Given to Mapper as ID
id of the
key object brought to
mapper is the offset from the file (it has also the actual
7. Write Results With Context.write
mapper, you use
context.write to write the result (
context is a parameter to
8. Use new Text for Output
If you write a string as an output from
mapper, you write it with
9. Mapper Is a Pure Function
mapper is a pure function it gets input
key, value and emits
key, value or multiple keys and values. So as its as much pure as possible its not intended to perform states meaning it will not combine results for mapping internally (see
combiners if you need that).
10. The Format Reducers Receive
Reducers receive a
key -> values it will get called multiple times withsorted
In this tutorial, we scanned the main building blocks and components that compose Hadoop MapReduce jobs. We have seen the input to reducers, mappers, using * for multiple files, and the different text objects that we use in order to write outputs.