Hadoop Revisited, Part IV: Statefulness vs. Statelessness and ToolRunner
Let's revisit a couple of fundamental concepts in Hadoop: ToolRunner and the concepts of statefulness versus statelessness.
Join the DZone community and get the full member experience.
Join For FreeIn Part III of this series, we coded Mapper
and Reducer
for the Hadoop platform. We also saw that Mappers
transformed lines from one notation to another while having a shuffling process with the lines. The Reducers
worked hard to get us the final answer for our word count job. In this post, we are going to revisit two key concepts: ToolRunner and statefulness vs. statelessness.
ToolRunner
ToolRunner formalizes the command line arguments into Hadoop jobs that so you don't need to mess with them yourself. They will have the same pattern.
hadoop jar somejar.jar YourMain -D mapred.reduce.tasks=2 \
someinputdir someoutputdir
When we use ToolRunner, we better formalize the arguments and get them from the Hadoop jobs.
Statefulness vs. Statelessness
MapReduce involves inherently stateless pure functions. It's very nice. You should stick to it as much as possible. However, if you need state persisted you will use the standard public void setup(Context context)
setup method.
Likewise, you have a cleanup
method which is called after MapReduce finishes. It's quite common that with cleanup
, if you use some states, you will write out the state to disk.
Note: In order to use ToolRunner, your job's main
class should extend the configured implements tool. Then, instead of the main
method, you will override the run
method inherited from ToolRunner.
public class YourJob extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
// TODO Auto-generated method stub
return 0;
}
.
.
}
You still need the main
method as you need to explicitly run the run
method.
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(exitCode);
}
Summary
In this post, we have explained that the ToolRunner helps us deal with command line arguments. In addition, if you have the stored state in your Hadoop job, you need to make sure you are able to clean them up in case it's reused. That's why the cleanup method exists. It's always a better choice to stay with stateless as in general, stateless is usually simpler, but for performance optimization, you may find yourself in the need to have state. In these cases, Hadoop provides you with callbacks that you should implement and call in order to manage this state.
Opinions expressed by DZone contributors are their own.
Comments