Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop Revisited, Part II: 10 Key Concepts of Hadoop MapReduce

DZone's Guide to

Hadoop Revisited, Part II: 10 Key Concepts of Hadoop MapReduce

Learn the main building blocks and components that compose Hadoop MapReduce jobs and learn the different text objects that we use in order to write outputs.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

The Big Data world has seen many new projects over the years such as Apache Storm, Kafka, Samza, Spark, and more. However, the basics are rather the same. We have data files residing on disk (or for hotness on memory) and we query them. Hadoop is the basic platform that hosts these files and enables us with simple concepts to do map reduce on them. In this post, we shall revisit the ground workings of MapReduce as it's always a good idea to have the basics right!

Following are the ten key concepts of Hadoop MapReduce. 

1. FileInputFormat Splits Your Data

This class can take a file and, if we have multiple blocks, split it to multiple mappers. This is important so that we can split work or data between multiple workers. If our workers are mappers, each should get its own data.

2. Main Is Not Mapper Nor Reducer

You should have a main that is not mapper and not reducer to receive the command line parameters. FileInputFormat splits the file (which usually has multiple blocks to multiple mappers and uses the InputPath. Similarly, we have FileOutputFormat.

3. Use Text.class When Using Strings

When referring to Strings, for example, in output, you refer to job.setOutputKeyClass(Text.class).

4. Use * to Scan Multiple Files

You can give the inputPath multiple * for all files in directory or all directories or just use the Jobs api to add multiple paths.

5. Access Your Job From Main

You have access to the main Job object, which handles the job from your main.

6. Offset Is Given to Mapper as ID

The id of the key object brought to mapper is the offset from the file (it has also the actual key).

7. Write Results With Context.write

In mapper, you use context.write to write the result (context is a parameter to mapper).

8. Use new Text for Output

If you write a string as an output from mapper, you write it with new Text(somestr).

9. Mapper Is a Pure Function

mapper is a pure function it gets input key, value and emits key, value or multiple keys and values. So as its as much pure as possible its not intended to perform states meaning it will not combine results for mapping internally (see combiners if you need that).

10. The Format Reducers Receive

Reducers receive a key -> values it will get called multiple times withsorted keys.

Summary

In this tutorial, we scanned the main building blocks and components that compose Hadoop MapReduce jobs. We have seen the input to reducers, mappers, using * for multiple files, and the different text objects that we use in order to write outputs.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
data science ,big data ,hadoop ,map reduce ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}