Over a million developers have joined DZone.

SpringData-Hadoop: Jumpstart Hadoop with Spring

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

These days there are lot of hype around jargons like HadoopHBaseHivePig and BigData. I was itching to learn what are these terms and how I can see them in the real world. I had 2 goals setup up for me,

  1. Create Hadoop Single Node instance
  2. Of course figure out how it is integrated with Spring/Spring Batch

As usual, I googled how to quickly set up and learn these tools. The journey was not smooth. For a Windows user there there are 2 ways you can setup Hadoop Single node cluster on your machine.

  1. Cygwin: The first approach is not easy to setup, I took few days to struggle thru this without much results on my Windows 7
  2. Open source and Commercial VM: EMC-GreenPlum (commercial), Cloudera / Yahoo (opensource) have created VMware instances with Hadoop, Hive, bundled into the VM and and they claim it works out of the box. Yahoo VM partially worked in my machine but it is outdated, it does not integrate with Spring. Cloudera VM did not work in my machine because of some 64bit conflicts.
  3. I got another VM instance from Cloudera for 32bit and it worked. This is a Ubuntu VM instance with all the above tools installed and preconfigured.

I started with Option 3, you can start the VM and do some quick tests as described in the tutorial. If you are in a real hurry, you can open the terminal and run this commands,

cd /usr/lib/hadoop
hadoop jar hadoop-examples.jar pi 10 1000000

Good luck, you ran your 1st Hadoop job.

Now in the same VM download Gradle and SpringData-Hadoop Installation. Unzip both of these in your Cloudera home directory. Go to your .profile file and Add the below line in the end,

export PATH=$PATH:/user/cloudera/gradle-1.0-rc-3/bin

Note your Gradle version maybe different and you should change it accordingly.

Now go to <SpringData-Hadoop Home>/samples/batch-wordcount and open build.gradle file and remove the repositories entries and add the following lines,

repositories {
// Public Spring artefacts
maven { url "http://repo.springsource.org/libs-release" }
maven { url "http://repo.springsource.org/libs-milestone" }
maven { url "http://repo.springsource.org/libs-snapshot" }
maven { url "http://www.datanucleus.org/downloads/maven2/" }
maven { url "http://oss.sonatype.org/content/repositories/snapshots" }
maven { url "http://people.apache.org/~rawson/repo" }
maven { url "https://repository.cloudera.com/artifactory/cloudera-repos/" }

Open <SpringData-Hadoop Home>/samples/batch-wordcount/gradle.properties and modify

hadoopVersion = 0.20.2-cdh3u3

Open <SpringData-Hadoop Home>/samples/batch-wordcount/src/main/resources/hadoop.properties and edit the below lines


Now go to command prompt and run gradle test, the test will be successful. Here is the documentation/tutorial on Spring Hadoop integration

If you want to learn more about Hadoop, there are good tutorials from Cloudera and YDN, please go thru it.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks


Published at DZone with permission of Krishna Prasad, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}