Over a million developers have joined DZone.

Spring Data - Apache Hadoop

DZone's Guide to

Spring Data - Apache Hadoop

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop.  This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent  of Hadoop programming – a Wordcount application.

1./ Launch an Amazon Web Services EC2 instance.

- Navigate to AWS EC2 Console (“https://console.aws.amazon.com/ec2/home”):

- Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09″ 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour.  (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section)

2./ Download Apache  Hadoop - as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz  and copied it into /home/ec2-user directory using pscp command from my PC running Windows:

c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz  ec2-user@ec2-176-34-201-185.eu-west-1.compute.amazonaws.com:/home/ec2-user

(the computer name above – ec2-ipaddress-region-compute.amazonaws.com – can be found on AWS EC2 console, Instance Description, public DNS field)

3./ Install Apache Hadoop:

As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide.  (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself:

$ cd  ~   # change directory to ec2-user home (/home/ec2-user)

$ tar xvzf hadoop-1.0.0.tar.gz

$ ln -s hadoop-1.0.0  hadoop

$ cd hadoop/conf

$ vi hadoop-env.sh   # edit as below

export JAVA_HOME=/opt/jdk1.6.0_29

$ vi core-site.xml    # edit as below – this defines the namenode to be running on localhost and listeing to port 9000.



$ vi hdsf-site.xml  # edit as below  this defines that file system replicate is 1 (in  production environment it is supposed to be 3 by default)



$ vi mapred-site.xml  # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001.



$ cd ~/hadoop

$ bin/hadoop namenode -format

$ bin/start-all.sh

At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running:

$ ps -ef | grep java

You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker.

4./ Install Spring Data Hadoop

Download Spring Data Hadoop package from SpringSource community download site.  As of writing this article, the latest stable version is spring-data-hadoop-1.0.0.M1.zip.

$ cd ~

$ tar xzvf spring-data-hadoop-1.0.0.M1.zip

$ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop

5./ Build and Run Spring Data Hadoop Wordcount example

$ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount

Spring Data Hadoop is using gradle as build tool. Check build.grandle  build file. The original version packaged in the tar.gz file does not compile,  it complains about thrift, version 0.2.0 and  jdo2-api, version2.3-ec.

Add datanucleus.org maven repository to the build.gradle file to support jdo2-api (http://www.datanucleus.org/downloads/maven2/) .

Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should  download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo“ and then add it to local maven repo.

$ mvn install:install-file -DgroupId=org.apache.thrift  -DartifactId=thrift  -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar  -Dpackaging=jar

$ vi build.grandle  # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift

repositories {
// Public Spring artefacts
maven { url “http://repo.springsource.org/libs-release” }
maven { url “http://repo.springsource.org/libs-milestone” }
maven { url “http://repo.springsource.org/libs-snapshot” }
maven { url “http://www.datanucleus.org/downloads/maven2/” }
maven { url “file:///home/ec2-user/.m2/repository” }

I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually:

$ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources

$vi META-INF/spring/context.xml   # remove clean-script and also the dependency on it for JobRunner.

<?xml version=”1.0″ encoding=”UTF-8″?>
xsi:schemaLocation=”http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd”>
<context:property-placeholder location=”hadoop.properties”/>


<hdp:job id=”wordcount-job” validate-paths=”false”
input-path=”${wordcount.input.path}” output-path=”${wordcount.output.path}”
<!– simple job runner –>
<bean id=”runner” class=”org.springframework.data.hadoop.mapreduce.JobRunner” p:jobs-ref=”wordcount-job”/>


Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory)

$ cd src/main/resources/data

$ hadoop fs -mkdir /user/ec2-user/input

$ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data

$ cd ../../../..   # go back to samples/wordcount directory

$ ../gradlew

Verify the result:

$ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more

“BY 1
“Beyond 1
“By 2
“Cheers 1
“DE 1
“Everywhere 1
“FROM” 1
“Flatterers 1
“Freedom 1


New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.


Published at DZone with permission of Istvan Szegedi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}