Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

DZone's Guide to

Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

Let's revisit a couple of fundamental concepts in Hadoop: ToolRunner and the concepts of statefulness versus statelessness.

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

In Part III of this series, we coded Mapper and Reducer for the Hadoop platform. We also saw that Mappers transformed lines from one notation to another while having a shuffling process with the lines. The Reducers worked hard to get us the final answer for our word count job. In this post, we are going to revisit two key concepts: ToolRunner and statefulness vs. statelessness.

ToolRunner

ToolRunner formalizes the command line arguments into Hadoop jobs that so you don't need to mess with them yourself. They will have the same pattern. 

hadoop jar somejar.jar YourMain -D mapred.reduce.tasks=2 \
  someinputdir someoutputdir

When we use ToolRunner, we better formalize the arguments and get them from the Hadoop jobs.

Statefulness vs. Statelessness

MapReduce involves inherently stateless pure functions. It's very nice. You should stick to it as much as possible. However, if you need state persisted you will use the standard public void setup(Context context) setup method.

Likewise, you have a cleanup method which is called after MapReduce finishes. It's quite common that with cleanup, if you use some states, you will write out the state to disk. 

Note: In order to use ToolRunner, your job's main class should extend the configured implements tool. Then, instead of the main method, you will override the run method inherited from ToolRunner.

public class YourJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        return 0;
    }
    .
    .
}

You still need the main method as you need to explicitly run the run method.

public static void main(String[] args) throws Exception {
  int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
  System.exit(exitCode);
}

Summary

In this post, we have explained that the ToolRunner helps us deal with command line arguments. In addition, if you have the stored state in your Hadoop job, you need to make sure you are able to clean them up in case it's reused. That's why the cleanup method exists. It's always a better choice to stay with stateless as in general, stateless is usually simpler, but for performance optimization, you may find yourself in the need to have state. In these cases, Hadoop provides you with callbacks that you should implement and call in order to manage this state.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:
big data ,hadoop ,mapreduce ,toolrunner ,tutorial

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}