DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

Let's revisit a couple of fundamental concepts in Hadoop: ToolRunner and the concepts of statefulness versus statelessness.

Tomer Ben David user avatar by
Tomer Ben David
·
Apr. 19, 17 · Tutorial
Like (7)
Save
Tweet
Share
4.60K Views

Join the DZone community and get the full member experience.

Join For Free

In Part III of this series, we coded Mapper and Reducer for the Hadoop platform. We also saw that Mappers transformed lines from one notation to another while having a shuffling process with the lines. The Reducers worked hard to get us the final answer for our word count job. In this post, we are going to revisit two key concepts: ToolRunner and statefulness vs. statelessness.

ToolRunner

ToolRunner formalizes the command line arguments into Hadoop jobs that so you don't need to mess with them yourself. They will have the same pattern. 

hadoop jar somejar.jar YourMain -D mapred.reduce.tasks=2 \
  someinputdir someoutputdir

When we use ToolRunner, we better formalize the arguments and get them from the Hadoop jobs.

Statefulness vs. Statelessness

MapReduce involves inherently stateless pure functions. It's very nice. You should stick to it as much as possible. However, if you need state persisted you will use the standard public void setup(Context context) setup method.

Likewise, you have a cleanup method which is called after MapReduce finishes. It's quite common that with cleanup, if you use some states, you will write out the state to disk. 

Note: In order to use ToolRunner, your job's main class should extend the configured implements tool. Then, instead of the main method, you will override the run method inherited from ToolRunner.

public class YourJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        return 0;
    }
    .
    .
}

You still need the main method as you need to explicitly run the run method.

public static void main(String[] args) throws Exception {
  int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
  System.exit(exitCode);
}

Summary

In this post, we have explained that the ToolRunner helps us deal with command line arguments. In addition, if you have the stored state in your Hadoop job, you need to make sure you are able to clean them up in case it's reused. That's why the cleanup method exists. It's always a better choice to stay with stateless as in general, stateless is usually simpler, but for performance optimization, you may find yourself in the need to have state. In these cases, Hadoop provides you with callbacks that you should implement and call in order to manage this state.

hadoop

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 3 Ways That You Can Operate Record Beyond DTO [Video]
  • Beginners’ Guide to Run a Linux Server Securely
  • GPT-3 Playground: The AI That Can Write for You
  • The Data Leakage Nightmare in AI

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: