Platinum Partner
bigdata,apache,amazon ec2,apache hadoop,big data

How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

Learn how to set up a four node Hadoop cluster using AWS EC2, PuTTy(gen), and WinSCP.

After spending some time playing around on Single-Node pseudo-distributed cluster, it's time to get into real world hadoop. Depending on what works best – Its important to note that there are multiple ways to achieve this and I am going to cover how to setup multi-node hadoop cluster on Amazon EC2. We are going to setup 4 node hadoop cluster as below.

  • NameNode (Master)
  • SecondaryNameNode
  • DataNode (Slave1)
  • DataNode (Slave2)

Here’s what you will need

  1. Amazon AWS Account
  2. PuTTy Windows Client (to connect to Amazon EC2 instance)
  3. PuTTYgen (to generate private key – this will be used in putty to connect to EC2 instance)
  4. WinSCP (secury copy)

This will be a two part series

In Part-1 I will cover infrastructure side as below

  1. Setting up Amazon EC2 Instances
  2. Setting up client access to Amazon Instances (using Putty.)
  3. Setup WinSCP access to EC2 instances

In Part-2 I will cover the hadoop multi node cluster installation

  1. Hadoop Multi-Node Installation and setup

1. Setting up Amazon EC2 Instances

With 4 node clusters and minimum volume size of 8GB there would be an average $2 of charge per day with all 4 running instances. You can stop the instance anytime to avoid the charge, but you will loose the public IP and host and restarting the instance will create new ones,. You can also terminate your Amazon EC2 instance anytime and by default it will delete your instance upon termination, so just be careful what you are doing.

1.1 Get Amazon AWS Account

If you do not already have a account, please create a new one. I already have AWS account and going to skip the sign-up process. Amazon EC2 comes with eligible free-tier instances.

Screen shot 2014-01-10 at 1.33.28 PM

1.2 Launch Instance

Once you have signed up for Amazon account. Login to Amazon Web Services, click on My Account and navigate to Amazon EC2 Console

Launch_Instance

1.3 Select AMI

I am picking Ubuntu Server 12.04.3  Server 64-bit OS

Step_1_choose_AMI1.4 Select Instance Type

Select the micro instance

Step_2_Instance_Type

1.5 Configure Number of Instances

As mentioned we are setting up 4 node hadoop cluster, so please enter 4 as number of instances. Please check Amazon EC2 free-tier requirements, you may setup 3 node cluster with < 30GB storage size to avoid any charges.  In production environment you want to have SecondayNameNode as separate machine

Step_3_Instance_Details

1.6 Add Storage

Minimum volume size is 8GB

Step_4_Add_Storage

1.7 Instance Description

Give your instance name and description

Step_5_Instance_description

1.8 Define a Security Group

Create a new security group, later on we are going to modify the security group with security rules.

Step_6_Security_Group

1.9 Launch Instance and Create Security Pair

Review and Launch Instance.

Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. Public–key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair.

Create a new keypair and give it a name “hadoopec2cluster” and download the keypair (.pem) file to your local machine. Click Launch Instance

hadoopec2cluster_keypair

1.10 Launching Instances

Once you click “Launch Instance” 4 instance should be launched with “pending” state

launching_instance

Once in “running” state we are now going to rename the instance name as below.

  1. HadoopNameNode (Master)
  2. HadoopSecondaryNameNode
  3. HadoopSlave1 (data node will reside here)
  4. HaddopSlave2  (data node will reside here)

running_instances

Please note down the Instance ID, Public DNS/URL (ec2-54-209-221-112.compute-1.amazonaws.com)  and Public IP for each instance for your reference.. We will need it later on to connect from Putty client.  Also notice we are using “HadoopEC2SecurityGroup”.

Public_DNS_IP_instance_id

You can use the existing group or create a new one. When you create a group with default options it add a rule for SSH at port 22.In order to have TCP and ICMP access we need to add 2 additional security rules. Add ‘All TCP’, ‘All ICMP’ and ‘SSH (22)’ under the inbound rules to “HadoopEC2SecurityGroup”. This will allow ping, SSH, and other similar commands among servers and from any other machine on internet. Make sure to “Apply Rule changes” to save your changes.

These protocols and ports are also required to enable communication among cluster servers. As this is a test setup we are allowing access to all for TCP, ICMP and SSH and not bothering about the details of individual server port and security.

Security_Group_Rule

2. Setting up client access to Amazon Instances

Now, lets make sure we can connect to all 4 instances.For that we are going to use Putty client We are going setup password-less SSH access among servers to setup the cluster. This allows remote access from Master Server to Slave Servers so Master Server can remotely start the Data Node and Task Tracker services on Slave servers.

We are going to use downloaded hadoopec2cluster.pem file to generate the private key (.ppk). In order to generate the private key we need Puttygen client. You can download the putty and puttygen and various utilities in zip from here.

2.1 Generating Private Key

Let’s launch PUTTYGEN client and import the key pair we created during launch instance step – “hadoopec2cluster.pem”

Navigate to Conversions and “Import Key”

import_key

load_private_keyOnce you import the key You can enter passphrase to protect your private key or leave the passphrase fields blank to use the private key without any passphrase. Passphrase protects the private key from any unauthorized access to servers using your machine and your private key.

Any access to server using passphrase protected private key will require the user to enter the passphrase to enable the private key enabled access to AWS EC2 server.

2.2 Save Private Key

Now save the private key by clicking on “Save Private Key” and click “Yes” as we are going to leave passphrase empty.

Save_PrivateKey

Save the .ppk file and give it a meaningful name

save_ppk_file

Now we are ready to connect to our Amazon Instance Machine for the first time.

2.3 Connect to Amazon Instance

Let’s connect to HadoopNameNode first. Launch Putty client, grab the public URL , import the .ppk private key that we just created for password-less SSH access. As per amazon documentation, for Ubuntu machines username is “ubuntu”

2.3.1 Provide private key for authentication

connect_to_HadoopNameNode3

2.3.2 Hostname and Port and Connection Type

and “Open” to launch putty session

connect_to_HadoopNameNode4

when you launch the session first time, you will see below message, click “Yes”connect_to_HadoopNameNode

and will prompt you for the username, enter ubuntu, if everything goes well you will be presented welcome message with Unix shell at the end.

connect_to_HadoopNameNode2

If there is a problem with your key, you may receive below error messageputty_error_message

Similarly connect to remaining 3 machines HadoopSecondaryNameNode, HaddopSlave1,HadoopSlave2 respectively to make sure you can connect successfully.

4_connnected_amazon_instances

2.4 Enable Public Access

Issue ifconfig command and note down the ip address. Next, we are going to update the hostname with ec2 public URL and finally we are going to update /etc/hosts file to map  the ec2 public URL with ip address. This will help us to configure master ans slaves nodes with hostname instead of ip address.

Following is the output on HadoopNameNode ifconfig

ifconfig

host_name_ip_address_mapping

now, issue the hostname command, it will display the ip address same as inet address from ifconfig command.

hostname_by_ip

We need to modify the hostname to ec2 public URL with below command

prebuffer_0nbsp;sudo hostname ec2-54-209-221-112.compute-1.amazonaws.com

update_hostname

2.5 Modify /etc/hosts

Lets change the host to EC2 public IP and hostname.

Open the /etc/hosts in vi, in a very first line it will show 127.0.0.1 localhost, we need to replace that with amazon ec2 hostname and ip address we just collected.

open_etc_hosts_in_vi

Modify the file and save your changes

modify_vi_hosts

Repeat 2.3 and 2.4 sections for remaining 3 machines.

3. Setup WinSCP access to EC2 instances

In order to securely transfer files from your windows machine to Amazon EC2 WinSCP is a handy utility.

Provide hostname, username and private key file and save your configuration and Login

winscp

winscp_error

If you see above error, just ignore and you upon successful login you will see unix file system of a logged in user /home/ubuntu your Amazon EC2 Ubuntu machine.

winscp_view

Upload the .pem file to master machine (HadoopNameNode). It will be used while connecting to slave nodes during hadoop startup daemons.

If you have made this far, good work! Things will start to get more interesting in Part-2.

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}