Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop Revisted, Part III: MapReduce Tutorial

DZone's Guide to

Hadoop Revisted, Part III: MapReduce Tutorial

Mappers take many pieces of data and transform them so that they can be digestible by a Reducer, which sees the whole picture and runs holistically on the data.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

In Part I and Part II of this series, we went through 10 key Hadoop concepts and checked the command line interface for the Hadoop file system, namely hadoop fs. We were examining how we should ingest text, output text, and read directories. In this part, we are going to create a map reduce job.  That is a Mapper  and a Reducer.  While there are many advanced tools for Hadoop manipulation, getting back to basics reminds us what Big Data manipulation is all about. It's about files — sometimes in memory, but mostly in the file system — and the operations that we run on them at scale.

Image title

What that, let's move on and create our MapReduce job, which can serve us as a skeleton for future work.

The Big Picture

Mappers take many pieces of data and transform them so that they can be digestible by our Reducer, which in general sees the whole picture and runs on the data overall as a holistic unit.

To have a job, you need to have three files:

  1. Job manager.
  2. Mapper.
  3. Reducer.

Let's see each of them in an example.

Job Manager

Below is an example job manager. As you can see, it has a main and it takes two arguments; in our case, the input path and the output path.  Then it's calling the Mappers andReducers; that's the essence of the job manager. 

/**
   * Setup the job.
   */
  public static void main(String[] args) throws Exception {

    // inputs
    if (args.length != 2) {
      System.out.printf("Usage: YourJob  \n");
      System.exit(-1);
    }

    // Set job inputs outputs.
    Job job = new Job();
    job.setJarByClass(YourJob.class);
    job.setJobName("Your Job Length");
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    job.setMapperClass(YourMapper.class);
    job.setReducerClass(YourReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(DoubleWritable.class);

    // exit..
    boolean success = job.waitForCompletion(true);
    System.exit(success ? 0 : 1);
  }

Mapper

Mapper takes a line of data with the key and manipulates and transforms it. It doesn't see the big picture; it just consumes pieces of data, runs on most of the cluster, and manipulates on it in most cases. Here is example of one that just counts words.

public class YourMapper extends Mapper<LongWritable, Text, Text, IntWritable> { // 

    @Override
    public void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {

      final String line = value.toString();
      for (String word: line.split("\\W+")) {
         context.write(context, new Text(word), new IntWritable(1)); // write mapper output.
      }
    }

Reducer

Reducer sees a more holistic picture. It receives pieces of information from the Mappers and combines them all. In our case. it would build upon the pieces it has received from Mappers and output the total count of words.

public class YourReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

  /**
   * Note you get a key and then a list of all values which were emitted for this key.
   * The keys which are handed to reducers are sorted.
   */
  @Override
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
      throws IOException, InterruptedException {

      int count = 0;
      for (IntWritable value: values) {
          count++;
      }
      context.write(new Text(key), new IntWritable(count)); // we use context to write output to file.

  }

Summary

In this third post about Hadoop, we have seen the three main components of a job: the job manager, Mappers, and Reducers. Job manager, as its name suggests, is just a main method in Java that receives user input and calls the Mappers and Reducers.  The Reducers transfrom each line in the data in our case. In general, they take something in and transform it out. Reducers see the whole picture and, in our case, have outputted the total word count.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
data science ,big data ,hadoop ,map reduce ,bigdata ,machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}