DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Using SQS With JMS for Legacy Applications
  • AWS Resources To Help You Get Started in the Cloud Journey
  • Mastering AWS Cost Management and Optimization
  • Top 6 Benefits of AWS Certification

Trending

  • Agentic AI for Automated Application Security and Vulnerability Management
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 1
  • Kullback–Leibler Divergence: Theory, Applications, and Implications
  • Optimizing Integration Workflows With Spark Structured Streaming and Cloud Services
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Amazon EMR Tutorial: Running a Hadoop MapReduce Job Using Custom JAR

Amazon EMR Tutorial: Running a Hadoop MapReduce Job Using Custom JAR

By 
Muhammad Ali Khojaye user avatar
Muhammad Ali Khojaye
·
Apr. 23, 12 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
58.6K Views

Join the DZone community and get the full member experience.

Join For Free

See original post at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html

Introduction

Amazon EMR is a web service which can be used to easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3.

Amazon EMR removes most of the cumbersome details of Hadoop while taking care of provisioning of Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop.

In this tutorial, we will use a developed WordCount Java example using Hadoop and thereafter, we execute our program on Amazon Elastic MapReduce.

Prerequisites

You must have valid AWS account credentials. You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice.

Step 1 – Develop MapReduce WordCount Java Program

In this section, we are first going to develop a WordCount application. A WordCount program will determine how many times different words appear in a set of files.

  1. In Eclipse (or whatever the IDE you are using), Create simple Java Project with the name "WordCount".
  2. Create a java class name Map and override the map method as follow, 
    public class Map extends Mapper<longwritable, text,="" intwritable=""> {
     private final static IntWritable one = new IntWritable(1);
     private Text word = new Text();
    
     @Override
     public void map(LongWritable key, Text value, Context context)
       throws IOException, InterruptedException {
       String line = value.toString();
       StringTokenizer tokenizer = new StringTokenizer(line);
       while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one);
       }
     }
    }
  3. Create a java class named Reduce and override the reduce method as shown below,
    public class Reduce extends Reducer<text, intwritable,="" text,="" intwritable=""> {
      @Override
      protected void reduce(Text key, java.lang.Iterable<intwritable> values,
        org.apache.hadoop.mapreduce.Reducer<text, intwritable,="" text,="" intwritable="">.Context context)
           throws IOException, InterruptedException {
         int sum = 0;
         for (IntWritable value : values) {
           sum += value.get();
         }
         context.write(key, new IntWritable(sum));
      }
    }
  4. Create a java class named WordCount and defined the main method as below,
    public static void main(String[] args) throws Exception {
       Configuration conf = new Configuration();
    
       Job job = new Job(conf, "wordcount");
       job.setJarByClass(WordCount.class);
    
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(IntWritable.class);
    
       job.setMapperClass(Map.class);
       job.setReducerClass(Reduce.class);
    
       job.setInputFormatClass(TextInputFormat.class);
       job.setOutputFormatClass(TextOutputFormat.class);
    
       FileInputFormat.addInputPath(job, new Path(args[0]));
       FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
       job.waitForCompletion(true);
    }
  5. Export the WordCount program in a jar using eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) during extraction ofu8u the jar file as shown below.extracing Jar from eclipse



               
Our jar is ready!!

Our jar is ready!!!

Step 2 – Upload the WordCount JAR and Input Files to Amazon S3

Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words.

Step 3 – Running an Elastic MapReduce job

Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. let's execute the steps below. (I encourage readers to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR ) 

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/
  2. Click Create New Job Flow.
  3. In the DEFINE JOB FLOW page, enter the following details,
  4. a) Job Flow Name = WordCountJob
    b) Select Run your own applications) Select Custom JAR in the drop-down list) Click Continue
  5. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.JAR Location = bucketName/jarFileLocationJAR Arguments =s3n://bucketName/inputFileLocations3n://bucketName/outputpath

Please note that the output path must be unique each time we execute the job. The Hadoop always create a folder with the same name specified here.

After executing the job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (can also depend on the size of the input). After completing the job, You can view results in the S3 Browser panel. You can also download the files from S3 and can analyze the outcome of the job. 

Amazon Elastic MapReduce Resources

  1. Amazon Elastic MapReduce Documentation,http://aws.amazon.com/documentation/elasticmapreduce/
  2. Amazon Elastic MapReduce Getting Started Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/
  3. Amazon Elastic MapReduce Developer Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/
  4. Apache Hadoop,http://hadoop.apache.org/


See more at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html

career hadoop JAR (file format) MapReduce AWS

Opinions expressed by DZone contributors are their own.

Related

  • Using SQS With JMS for Legacy Applications
  • AWS Resources To Help You Get Started in the Cloud Journey
  • Mastering AWS Cost Management and Optimization
  • Top 6 Benefits of AWS Certification

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!