Over a million developers have joined DZone.

Getting Started Quickly with Hadoop and MapReduce

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

So here’s the problem: You’ve finally found a block of time to set down and get your head around Hadoop and MapReduce. You do a quick Google search for a tutorial to get your started and immediately, your problems are two-fold:

  1. You are a 23 step process and a cloud deployment away from having your first Hadoop cluster spun up.
  2. The most interesting thing you will be able to do once you get your cluster up and running is to count all the words in the complete works of Shakespeare. Ho…hum.

Well, if this is your situation, you’ll be please to find that the first problem goes away immediately upon downloading Hadoop. Doug Cutting in his infinite wisdom understood that it was intimidating to spin up an entire cluster just so that you can get started learning the platform; because of this he built in a little feature that allows you to get started immediately. As an example, let’s say you have a giant 137 core cluster in the cloud and you’ve stored the complete and unabridged works of all the classic authors on HDFS in the books directory. You can run your WordCount MapReduce on the corpus and send the results to the words directory with the following command:

${HADOOP_HOME}/bin/hadoop jar WordCount.jar org.myorg.WordCount books words

On the other hand, if you have no such cluster, but you have Macbeth andRomeo and Juliet stored in the books directory on your local machine, then you can still run your WordCount MapReduce on your measly, wimpy corpus and send the results to the words directory (again, on your local machine) by issuing the exact same command.

${HADOOP_HOME}/bin/hadoop jar WordCount.jar org.myorg.WordCount books words

Pretty easy way to get started, eh?

Issue number 2 is a bit more nefarious. Why? Because word counting is easy to understand and it really is probably the most straight-forward application of MapReduce.

However I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World with a mathematical twist! Take a look!

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


Published at DZone with permission of John Berryman , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}