DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Spring Data - Apache Hadoop

Spring Data - Apache Hadoop

Istvan Szegedi user avatar by
Istvan Szegedi
·
Jul. 19, 12 · Interview
Like (0)
Save
Tweet
Share
10.71K Views

Join the DZone community and get the full member experience.

Join For Free

Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop.  This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent  of Hadoop programming – a Wordcount application.

1./ Launch an Amazon Web Services EC2 instance.

- Navigate to AWS EC2 Console (“https://console.aws.amazon.com/ec2/home”):

- Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09″ 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour.  (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section)

2./ Download Apache  Hadoop - as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz  and copied it into /home/ec2-user directory using pscp command from my PC running Windows:

c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz  ec2-user@ec2-176-34-201-185.eu-west-1.compute.amazonaws.com:/home/ec2-user

(the computer name above – ec2-ipaddress-region-compute.amazonaws.com – can be found on AWS EC2 console, Instance Description, public DNS field)

3./ Install Apache Hadoop:

As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide.  (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself:

$ cd  ~   # change directory to ec2-user home (/home/ec2-user)

$ tar xvzf hadoop-1.0.0.tar.gz

$ ln -s hadoop-1.0.0  hadoop

$ cd hadoop/conf

$ vi hadoop-env.sh   # edit as below

export JAVA_HOME=/opt/jdk1.6.0_29

$ vi core-site.xml    # edit as below – this defines the namenode to be running on localhost and listeing to port 9000.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

 

$ vi hdsf-site.xml  # edit as below  this defines that file system replicate is 1 (in  production environment it is supposed to be 3 by default)

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

 

$ vi mapred-site.xml  # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

 

$ cd ~/hadoop

$ bin/hadoop namenode -format

$ bin/start-all.sh

At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running:

$ ps -ef | grep java

You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker.

4./ Install Spring Data Hadoop

Download Spring Data Hadoop package from SpringSource community download site.  As of writing this article, the latest stable version is spring-data-hadoop-1.0.0.M1.zip.

$ cd ~

$ tar xzvf spring-data-hadoop-1.0.0.M1.zip

$ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop

5./ Build and Run Spring Data Hadoop Wordcount example

$ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount

Spring Data Hadoop is using gradle as build tool. Check build.grandle  build file. The original version packaged in the tar.gz file does not compile,  it complains about thrift, version 0.2.0 and  jdo2-api, version2.3-ec.

Add datanucleus.org maven repository to the build.gradle file to support jdo2-api (http://www.datanucleus.org/downloads/maven2/) .

Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should  download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo“ and then add it to local maven repo.

$ mvn install:install-file -DgroupId=org.apache.thrift  -DartifactId=thrift  -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar  -Dpackaging=jar

$ vi build.grandle  # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift

repositories {
// Public Spring artefacts
mavenCentral()
maven { url “http://repo.springsource.org/libs-release” }
maven { url “http://repo.springsource.org/libs-milestone” }
maven { url “http://repo.springsource.org/libs-snapshot” }
maven { url “http://www.datanucleus.org/downloads/maven2/” }
maven { url “file:///home/ec2-user/.m2/repository” }
}

I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually:

$ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources

$vi META-INF/spring/context.xml   # remove clean-script and also the dependency on it for JobRunner.

<?xml version=”1.0″ encoding=”UTF-8″?>
xmlns=”http://www.springframework.org/schema/beans”
xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xmlns:context=”http://www.springframework.org/schema/context”
xmlns:hdp=”http://www.springframework.org/schema/hadoop”
xmlns:p=”http://www.springframework.org/schema/p”
xsi:schemaLocation=”http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd”>
<context:property-placeholder location=”hadoop.properties”/>

<hdp:configuration>
fs.default.name=${hd.fs}
<!–hdp:configuration>

<hdp:job id=”wordcount-job” validate-paths=”false”
input-path=”${wordcount.input.path}” output-path=”${wordcount.output.path}”
mapper=”org.apache.hadoop.examples.WordCount.TokenizerMapper”
reducer=”org.apache.hadoop.examples.WordCount.IntSumReducer”/>
<!– simple job runner –>
<bean id=”runner” class=”org.springframework.data.hadoop.mapreduce.JobRunner” p:jobs-ref=”wordcount-job”/>

</beans>

Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory)

$ cd src/main/resources/data

$ hadoop fs -mkdir /user/ec2-user/input

$ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data

$ cd ../../../..   # go back to samples/wordcount directory

$ ../gradlew

Verify the result:

$ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more

“AWAY 1
“BY 1
“Beyond 1
“By 2
“Cheers 1
“DE 1
“Everywhere 1
“FROM” 1
“Flatterers 1
“Freedom 1

 

hadoop Spring Framework Spring Data Data (computing)

Published at DZone with permission of Istvan Szegedi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Deploying Java Serverless Functions as AWS Lambda
  • Utilize OpenAI API to Extract Information From PDF Files
  • Why Open Source Is Much More Than Just a Free Tier
  • Why You Should Automate Code Reviews

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: