DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • Managing Data Residency, the Demo
  • Hyperion Essbase Technical Functionality
  • Execution Type Models in Node.js
  • Measuring Service Performance: The Whys and Hows
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Hadoop Revisited, Part II: 10 Key Concepts of Hadoop MapReduce

Hadoop Revisited, Part II: 10 Key Concepts of Hadoop MapReduce

Learn the main building blocks and components that compose Hadoop MapReduce jobs and learn the different text objects that we use in order to write outputs.

Tomer Ben David user avatar by
Tomer Ben David
·
Apr. 17, 17 · Tutorial
Like (13)
Save
Tweet
Share
5.91K Views

Join the DZone community and get the full member experience.

Join For Free

The Big Data world has seen many new projects over the years such as Apache Storm, Kafka, Samza, Spark, and more. However, the basics are rather the same. We have data files residing on disk (or for hotness on memory) and we query them. Hadoop is the basic platform that hosts these files and enables us with simple concepts to do map reduce on them. In this post, we shall revisit the ground workings of MapReduce as it's always a good idea to have the basics right!

Following are the ten key concepts of Hadoop MapReduce. 

1. FileInputFormat Splits Your Data

This class can take a file and, if we have multiple blocks, split it to multiple mappers. This is important so that we can split work or data between multiple workers. If our workers are mappers, each should get its own data.

2. Main Is Not Mapper Nor Reducer

You should have a main that is not mapper and not reducer to receive the command line parameters. FileInputFormat splits the file (which usually has multiple blocks to multiple mappers and uses the InputPath. Similarly, we have FileOutputFormat.

3. Use Text.class When Using Strings

When referring to Strings, for example, in output, you refer to job.setOutputKeyClass(Text.class).

4. Use * to Scan Multiple Files

You can give the inputPath multiple * for all files in directory or all directories or just use the Jobs api to add multiple paths.

5. Access Your Job From Main

You have access to the main Job object, which handles the job from your main.

6. Offset Is Given to Mapper as ID

The id of the key object brought to mapper is the offset from the file (it has also the actual key).

7. Write Results With Context.write

In mapper, you use context.write to write the result (context is a parameter to mapper).

8. Use new Text for Output

If you write a string as an output from mapper, you write it with new Text(somestr).

9. Mapper Is a Pure Function

mapper is a pure function it gets input key, value and emits key, value or multiple keys and values. So as its as much pure as possible its not intended to perform states meaning it will not combine results for mapping internally (see combiners if you need that).

10. The Format Reducers Receive

Reducers receive a key -> values it will get called multiple times withsorted keys.

Summary

In this tutorial, we scanned the main building blocks and components that compose Hadoop MapReduce jobs. We have seen the input to reducers, mappers, using * for multiple files, and the different text objects that we use in order to write outputs.

hadoop Concept (generic programming) MapReduce Big data

Opinions expressed by DZone contributors are their own.

Trending

  • Managing Data Residency, the Demo
  • Hyperion Essbase Technical Functionality
  • Execution Type Models in Node.js
  • Measuring Service Performance: The Whys and Hows

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: