Platinum Partner
java,bigdata,tutorial,hadoop,cluster,tools & methods,big data

How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 2

In Part-1 we have successfully created, launched and connected to Amazon Ubuntu Instances. In Part-2 I will show how to install and setup Hadoop cluster. If you are seeing this page first time, I would strongly advise you to go over Part-1.

In this article

  • HadoopNameNode will be referred as master,
  • HadoopSecondaryNameNode will be referred as SecondaryNameNode or SNN
  • HadoopSlave1 and HadoopSlave2 will be referred as slaves (where data nodes will reside)

So, let’s begin.

1. Apache Hadoop Installation and Cluster Setup

1.1 Update the packages and dependencies.

Let’s update the packages , I will start with master , repeat this for SNN and 2 slaves.

$ sudo apt-get update

Once its complete, let’s install java

1.2 Install Java

Add following PPA and install the latest Oracle Java (JDK) 7 in Ubuntu

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update && sudo apt-get install oracle-jdk7-installer

Check if Ubuntu uses JDK 7

java_installaation

Repeat this for SNN and 2 slaves.

1.3 Download Hadoop

I am going to use haddop 1.2.1 stable version from apache download page and here is the 1.2.1 mirror

issue wget command from shell

$ wget http://apache.mirror.gtcomm.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

download_hadoop

Unzip the files and review the package content and configuration files.

$ tar -xzvf hadoop-1.2.1.tar.gz

dir_listing

For simplicity, rename the ‘hadoop-1.2.1’ directory to ‘hadoop’ for ease of operation and maintenance.

$ mv hadoop-1.2.1 hadoop

dir_listing_2

1.4 Setup Environment Variable

Setup Environment Variable for ‘ubuntu’ user

Update the .bashrc file to add important Hadoop paths and directories.

Navigate to home directory

$cd

Open .bashrc file in vi edito

$ vi .bashrc

Add following at the end of file

export HADOOP_CONF=/home/ubuntu/hadoop/conf

export HADOOP_PREFIX=/home/ubuntu/hadoop

#Set JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

# Add Hadoop bin/ directory to path
export PATH=$PATH:$HADOOP_PREFIX/bin

Save and Exit.

To check whether its been updated correctly or not, reload bash profile, use following commands

source ~/.bashrc
echo $HADOOP_PREFIX
echo $HADOOP_CONF
Repeat 1.3 and 1.4  for remaining 3 machines (SNN and 2 slaves).

1.5 Setup Password-less SSH on Servers

Master server remotely starts services on salve nodes, whichrequires password-less access to Slave Servers. AWS Ubuntu server comes with pre-installed OpenSSh server.
Quick Note:
The public part of the key loaded into the agent must be put on the target system in ~/.ssh/authorized_keys. This has been taken care of by the AWS Server creation process
Now we need to add the AWS EC2 Key Pair identity haddopec2cluster.pem to SSH profile. In order to do that we will need to use following ssh utilities
  • ‘ssh-agent’ is a background program that handles passwords for SSH private keys.
  •  ‘ssh-add’ command prompts the user for a private key password and adds it to the list maintained by ssh-agent. Once you add a password to ssh-agent, you will not be asked to provide the key when using SSH or SCP to connect to hosts with your public key.

Amazon EC2 Instance  has already taken care of ‘authorized_keys’ on master server, execute following commands to allow password-less SSH access to slave servers.

First of all we need to protect our keypair files, if the file permissions are too open (see below) you will get an error

ssh_error

To fix this problem, we need to issue following commands

$ chmod 644 authorized_keys

Quick Tip: If you set the permissions to ‘chmod 644′, you get a file that can be written by you, but can only be read by the rest of the world.

$ chmod 400 haddoec2cluster.pem

Quick Tip: chmod 400 is a very restrictive setting giving only the file onwer read-only access. No write / execute capabilities for the owner, and no permissions what-so-ever for anyone else.

To use ssh-agent and ssh-add, follow the steps below:

  1. At the Unix prompt, enter: eval `ssh-agent`Note: Make sure you use the backquote ( ` ), located under the tilde ( ~ ), rather than the single quote ( ' ).
  2. Enter the command: ssh-add hadoopec2cluster.pem

if you notice .pem file has “read-only” permission now and this time it works for us.

ssh_success
Keep in mind ssh session will be lost upon shell exit and you have repeat ssh-agent and ssh-add commands.

Remote SSH

Let’s verify that we can connect into SNN and slave nodes from master

ssh_remote_connection

$ ssh ubuntu@<your-amazon-ec2-public URL>
On successful login the IP address on the shell will change.

1.6 Hadoop Cluster Setup

This section will cover the hadoop cluster configuration.  We will have to modify

  • hadoop-env.sh - This file contains some environment variable settings used by Hadoop. You can use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the maximum amount of heap used etc. The only variable you should need to change at this point is in this file is JAVA_HOME, which specifies the path to the Java 1.7.x installation used by Hadoop.
  • core-site.xml –  key property fs.default.name – for namenode configuration for e.g hdfs://namenode/
  • hdfs-site.xml – key property - dfs.replication – by default 3
  • mapred-site.xml - key property  mapred.job.tracker for jobtracker configuration for e.g jobtracker:8021

We will first start with master (NameNode) and then copy above xml changes to remaining 3 nodes (SNN and slaves)

Finally, in section 1.6.2 we will have to configure conf/masters and conf/slaves.

  • masters - defines on which machines Hadoop will start secondary NameNodes in our multi-node cluster.
  • slaves - defines the lists of hosts, one per line, where the Hadoop slave daemons (datanodes and tasktrackers) will run.

Lets go over one by one. Start with masters (namenode).

hadoop-env.sh

$ vi $HADOOP_CONF/hadoop-env.sh  and add JAVA_HOME shown below and save changes.

hadoop_env.sh

core-site.xml

This file contains configuration settings for Hadoop Core (for e.g I/O) that are common to HDFS and MapReduce Default file system configuration property – fs.default.name  goes here it could for e.g hdfs / s3 which will be used by clients.

$ sudo $HADOOP_CONF/core-site.xml

We are going to add two properties

  • fs.default.name  will point to NameNode URL and port (usually 8020)
  • hadoop.tmp.dir  - A base for other temporary directories. Its important to note that every node needs hadoop tmp directory.  I am going to create a new directory “hdfstmp”  as below in all 4 nodes. Ideally you can write a shell script to do this for you, but for now going the manual way.

$ cd

$ mkdir hdfstmp

Quick Tip:  Some of the important directories are dfs.name.dir, dfs.data.dir in hdfs-site.xml. The default value for the dfs.name.dir is ${hadoop.tmp.dir}/dfs/data and dfs.data.dir is${hadoop.tmp.dir}/dfs/data. It is critical that you choose your directory location wisely in production environment.

<configuration>

<property>
<name>fs.default.name</name>
<value>hdfs://ec2-54-209-221-112.compute-1.amazonaws.com:8020</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/home/ubuntu/hdfstmp</value>
</property>

</configuration>

hdfs-site.xml

This file contains the configuration for HDFS daemons, the NameNode, SecondaryNameNode  and data nodes.

We are going to add 2 properties

  • dfs.permissions.enabled  with value false,  This means that any user, not just the “hdfs” user, can do anything they want to HDFS so do not do this in production unless you have a very good reason. if “true”, enable permission checking in HDFS. If “false”, permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories. Be very careful before you set this
  • dfs.replication  – Default block replication is 3. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we have 2 slave nodes we will set this value to 2.
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>

hdfs-site

mapred-site.xml

This file contains the configuration settings for MapReduce daemons; the job tracker and the task-trackers.
The mapred.job.tracker parameter is a hostname (or IP address) and port pair on which the Job Tracker listens for RPC communication. This parameter specify the location of the Job Tracker for Task Trackers and MapReduce clients.

JobTracker will be running on master (NameNode)

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://ec2-54-209-221-112.compute-1.amazonaws.com:8021</value>
</property>
</configuration>

1.6.1 Move configuration files to Slaves

Now, we are done with hadoop xml files configuration master, lets copy the files to remaining 3 nodes using secure copy (scp)

start with SNN, if you are starting a new session, follow ssh-add as per section 1.5

from master’s unix shell issue below command

$ scp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml ubuntu@ec2-54-209-221-47.compute-1.amazonaws.com:/home/ubuntu/hadoop/conf

repeat this for slave nodes

scp_configurations

1.6.2 Configure Master and Slaves

Every hadoop distribution comes with master and slaves files. By default it contains one entry for localhost, we have to modify these 2 files on both “masters” (HadoopNameNode) and “slaves” (HadoopSlave1 and HadoopSlave2) machines – we have a dedicated machine for HadoopSecondaryNamdeNode.

masters_slaves

slaves_file

1.6.3 Modify masters file on Master machine

conf/masters file defines on which machines Hadoop will start Secondary NameNodes in our multi-node cluster. In our case, there will be two machines HadoopNameNode and HadoopSecondaryNameNode

Hadoop HDFS user guide : “The secondary NameNode merges the fsimage and the edits log files periodically and keeps edits log size within a limit. It is usually run on a different machine than the primary NameNode since its memory requirements are on the same order as the primary NameNode. The secondary NameNode is started by “bin/start-dfs.sh“ on the nodes specified in “conf/masters“ file.

$ vi $HADOOP_CONF/masters and provide an entry for the hostename where you want to run SecondaryNameNode daemon. In our case HadoopNameNode and HadoopSecondaryNameNode

m1

1.6.4 Modify the slaves file on master machine

The slaves file is used for starting DataNodes and TaskTrackers

$ vi $HADOOP_CONF/slaves

slaves_config

1.6.5 Copy masters and slaves to SecondaryNameNode

Since SecondayNameNode configuration will be same as NameNode, we need to copy master and slaves to HadoopSecondaryNameNode.

copy_master_slaves

1.6.7 Configure master and slaves on “Slaves” node

Since we are configuring slaves (HadoopSlave1 & HadoopSlave2) , masters file on slave machine is going to be empty

$ vi $HADOOP_CONF/masters

master_file_on_slaves

Next, update the ‘slaves’ file on Slave server (HadoopSlave1) with the IP address of the slave node. Notice that the ‘slaves’ file at Slave node contains only its own IP address and not of any other Data Node in the cluster.

$ vi $HADOOP_CONF/slaves

slaves_file_on_slave

Similarly update masters and slaves for HadoopSlave2

1.7 Hadoop Daemon Startup

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which runs on top of your , which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this will cause all your data to be erased.

To format the namenode

$ hadoop namenode format

namenode_format

Lets start all hadoop daemons from HadoopNameNode

$ cd $HADOOP_CONF

$ start-all.sh

This will start

  • NameNode, JobTracker and SecondaryNameNode daemons on HadoopNameNode

strat-all

  • SecondaryNameNode daemons on HadoopSecondaryNameNode

snn

  • and DataNode and TaskTracker daemons on slave nodes HadoopSlave1 and HadoopSlave2

Screen shot 2014-01-13 at 9.59.38 AM

Screen shot 2014-01-13 at 9.59.56 AM

We can check the namenode status from http://ec2-54-209-221-112.compute-1.amazonaws.com:50070/dfshealth.jsp

namenode_stattus

Check Jobtracker status : http://ec2-54-209-221-112.compute-1.amazonaws.com:50030/jobtracker.jsp

jobtracker_status

Slave Node Status for HadoopSlave1 : http://ec2-54-209-223-7.compute-1.amazonaws.com:50060/tasktracker.jsp

tasktracker1_status

Slave Node Status for HadoopSlave2 : http://ec2-54-209-219-2.compute-1.amazonaws.com:50060/tasktracker.jsp

tasktracker2_status

To quickly verify our setup, run the hadoop pi example

ubuntu@ec2-54-209-221-112:~/hadoop$ hadoop jar hadoop-examples-1.2.1.jar pi 10 1000000

Number of Maps  = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
14/01/13 15:44:12 INFO mapred.FileInputFormat: Total input paths to process : 10
14/01/13 15:44:13 INFO mapred.JobClient: Running job: job_201401131425_0001
14/01/13 15:44:14 INFO mapred.JobClient:  map 0% reduce 0%
14/01/13 15:44:32 INFO mapred.JobClient:  map 20% reduce 0%
14/01/13 15:44:33 INFO mapred.JobClient:  map 40% reduce 0%
14/01/13 15:44:46 INFO mapred.JobClient:  map 60% reduce 0%
14/01/13 15:44:56 INFO mapred.JobClient:  map 80% reduce 0%
14/01/13 15:44:58 INFO mapred.JobClient:  map 100% reduce 20%
14/01/13 15:45:03 INFO mapred.JobClient:  map 100% reduce 33%
14/01/13 15:45:06 INFO mapred.JobClient:  map 100% reduce 100%
14/01/13 15:45:09 INFO mapred.JobClient: Job complete: job_201401131425_0001
14/01/13 15:45:09 INFO mapred.JobClient: Counters: 30
14/01/13 15:45:09 INFO mapred.JobClient:   Job Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Launched reduce tasks=1
14/01/13 15:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=145601
14/01/13 15:45:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/13 15:45:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/01/13 15:45:09 INFO mapred.JobClient:     Launched map tasks=10
14/01/13 15:45:09 INFO mapred.JobClient:     Data-local map tasks=10
14/01/13 15:45:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=33926
14/01/13 15:45:09 INFO mapred.JobClient:   File Input Format Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Bytes Read=1180
14/01/13 15:45:09 INFO mapred.JobClient:   File Output Format Counters
14/01/13 15:45:09 INFO mapred.JobClient:     Bytes Written=97
14/01/13 15:45:09 INFO mapred.JobClient:   FileSystemCounters
14/01/13 15:45:09 INFO mapred.JobClient:     FILE_BYTES_READ=226
14/01/13 15:45:09 INFO mapred.JobClient:     HDFS_BYTES_READ=2740
14/01/13 15:45:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=622606
14/01/13 15:45:09 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=215
14/01/13 15:45:09 INFO mapred.JobClient:   Map-Reduce Framework
14/01/13 15:45:09 INFO mapred.JobClient:     Map output materialized bytes=280
14/01/13 15:45:09 INFO mapred.JobClient:     Map input records=10
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce shuffle bytes=280
14/01/13 15:45:09 INFO mapred.JobClient:     Spilled Records=40
14/01/13 15:45:09 INFO mapred.JobClient:     Map output bytes=180
14/01/13 15:45:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=2039111680
14/01/13 15:45:09 INFO mapred.JobClient:     CPU time spent (ms)=9110
14/01/13 15:45:09 INFO mapred.JobClient:     Map input bytes=240
14/01/13 15:45:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1560
14/01/13 15:45:09 INFO mapred.JobClient:     Combine input records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce input records=20
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce input groups=20
14/01/13 15:45:09 INFO mapred.JobClient:     Combine output records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1788379136
14/01/13 15:45:09 INFO mapred.JobClient:     Reduce output records=0
14/01/13 15:45:09 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=10679681024
14/01/13 15:45:09 INFO mapred.JobClient:     Map output records=20
Job Finished in 57.825 seconds
Estimated value of Pi is 3.14158440000000000000
You can check the job tracker status page to look at complete job status
complete_job
Drill down into completed job and you can see more details on Map Reduce tasks.
detail_job_output
At last do not forget to terminate your amazon ec2 instances or you will be continued to get charged
terminate_ec2
That’s it for this article, hope you find it useful
Happy Hadoop Year!

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}