Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

DZone's Guide to

Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner

Let's revisit a couple of fundamental concepts in Hadoop: ToolRunner and the concepts of statefulness versus statelessness.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

In Part III of this series, we coded Mapper and Reducer for the Hadoop platform. We also saw that Mappers transformed lines from one notation to another while having a shuffling process with the lines. The Reducers worked hard to get us the final answer for our word count job. In this post, we are going to revisit two key concepts: ToolRunner and statefulness vs. statelessness.

ToolRunner

ToolRunner formalizes the command line arguments into Hadoop jobs that so you don't need to mess with them yourself. They will have the same pattern. 

hadoop jar somejar.jar YourMain -D mapred.reduce.tasks=2 \
  someinputdir someoutputdir

When we use ToolRunner, we better formalize the arguments and get them from the Hadoop jobs.

Statefulness vs. Statelessness

MapReduce involves inherently stateless pure functions. It's very nice. You should stick to it as much as possible. However, if you need state persisted you will use the standard public void setup(Context context) setup method.

Likewise, you have a cleanup method which is called after MapReduce finishes. It's quite common that with cleanup, if you use some states, you will write out the state to disk. 

Note: In order to use ToolRunner, your job's main class should extend the configured implements tool. Then, instead of the main method, you will override the run method inherited from ToolRunner.

public class YourJob extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        // TODO Auto-generated method stub
        return 0;
    }
    .
    .
}

You still need the main method as you need to explicitly run the run method.

public static void main(String[] args) throws Exception {
  int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
  System.exit(exitCode);
}

Summary

In this post, we have explained that the ToolRunner helps us deal with command line arguments. In addition, if you have the stored state in your Hadoop job, you need to make sure you are able to clean them up in case it's reused. That's why the cleanup method exists. It's always a better choice to stay with stateless as in general, stateless is usually simpler, but for performance optimization, you may find yourself in the need to have state. In these cases, Hadoop provides you with callbacks that you should implement and call in order to manage this state.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,hadoop ,mapreduce ,toolrunner ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}