Platinum Partner
architects,bigdata,firefox,web ui,amazon ec2,apache hadoop,big data,amazon aws,ganglia,apache whirr,foxyproxy,yarn,cdh4,mrv1,install hadoop

Hadoop Hangover: Launch a Hadoop Cluster CDH4 Using Apache Whirr

This post is about how-to launch a CDH4 MRv1 or CDH4 Yarn cluster on EC2 instances. It's said that you can launch a cluster with the help of Whirr and in a matter of 5 minutes! This is very true if and only if everything works out well! ;) 

Hopefully, this article helps you in that regard.
So, let's row the boat...

  • Download the stable version of Apache Whirr  ie. whirr-0.8.1.tar.gz from the following link whirr-0.8.1.tar.gz
  • Extract from the tarball and generate the key 
  • $ tar -xzvf whirr-0.8.1.tar.gz
    $ cd whirr-0.8.1
  • Generate the key
  • $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr 
    $ cd whirr-0.8.1 
  • Make a properties file to launch the cluster with that configuration. # Cluster name goes here
    whirr.cluster-name=testcluster
     
    # Change the number of machines in the cluster here
    # Using 3 DN and TT and 1JT and NN
    # Ganglia is configured
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+ganglia-monitor+ganglia-metad,3 hadoop-datanode+hadoop-tasktracker+ganglia-monitor
     
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
     
    ## Install CDH4 MRV1
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.env.REPO=cdh4
     
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
     
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
    
  • Now let me tell you how to avoid getting headaches!
    • cluster name: Keep your cluster name simple. Avoid testCluster, testCluster1 etc. ie. No Caps, numerics..
    • Decide on the number of datanodes you want judiciously.
    • Your launch may not be successful, if java is not installed. Make sure the image has Java. However, this properties file takes care of that.
    • It will be good to go ahead with MRv1 for now and later switch to MRv2, when we get a production stable release.
    • This is the minimal set of configurations for launching a Hadoop cluster. But, you can do a lot performance tuning upon this.
    • I had launched this cluster from an ec2 instance, Initially i faced errors, regarding user. Setting the configuration below, solved the problem.
    • whirr.cluster-user=whirr
    • Set proper permissions for ~/.ssh and whirr-0.8.1 folder before launching.
  •  Well, we are ready to launch the cluster. Name the properties file as "whirr_cdh.properties".
  • $ cd whirr-0.8.1
    $ bin/whirr launch-cluster --config whirr_cdh.properties
    In the console you can see, links to Namenode and JobTracker Web UI. It also prints how to ssh to the instances in the end.

  • Now, you should be having the files generated. You will be able to see  these files: instances, hadoop-proxy.sh and hadoop-site.xml
  • Starting the proxy
  • $ sh hadoop-proxy.sh
  • Open another terminal, and type
  • You should be able to access the HDFS.
  • $ export HADOOP_CONF_DIR=~/.whirr/testcluster/hadoop-site.xml
    $ hadoop fs -ls /
  • You can alternatively download hadoop tarball and launch with 
  • $ bin/hadoop --config ~/.whirr/testcluster fs -ls /
  •  Okay! So I know that you will not be satisfied unless you a web UI
  • Now, Launch Firefox (3.0v+)
  • Download the FoxyProxy extension by clicking this link.
  • Steps to configure and access the UI
  • Select Tools > FoxyProxy > Options
  • Click the “Add New Proxy” button.
  • Select “Manual Proxy Configuration”
  • Enter “localhost” for the “Host or IP Address” field.
  • Enter “6666″ for the “Port” field.
  • Click on the “General” tab at the top of the dialog box.
  • Enter “EC2″ for the “Proxy Name” field.
  • Click on the “URL Patterns” tab at the top of the dialog box.
  • Click the “Add New Pattern” button.
  • Enter “EC2″ for the “Pattern Name” field.
  • Enter “*compute-1.amazonaws.com*, *.ec2.internal*, *.compute-1.internal*” for the “URL pattern” field (not case sensitive)
  • Select the “Whitelist” and “Wildcards” radio buttons.
  • Click the “OK” button to dismiss the new URL pattern dialog box.
  • Click the “OK” button to dismiss the new proxy dialog box.
  • Completely disable the Foxyproxy for now.
  • You should be able to see 2 proxy names after closing, default and EC2.
  • Click on “Use proxy EC2 for all URLs” from the pop-up menu of FoxyProxy
  • Copy the URL of JobTracker (can be seen while running proxy, ec2-***-**-***-**.********.amazonaws.com)  and paste it in the browser.

So, we are good to go! 
  •   If you want to launch MRv2,  use this.
  • ## Cluster name goes here.
    whirr.cluster-name=yarncluster
    # Change the number of machines in the cluster here
    whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager
    
    # Install JAVA
    whirr.java.install-function=install_openjdk
    whirr.java.install-function=install_oab_java
    
    ## Install CDH4 Yarn 
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    whirr.yarn.configure-function=configure_cdh_yarn
    whirr.yarn.start-function=start_cdh_yarn
    whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory
    whirr.env.REPO=cdh4
    whirr.env.MAPREDUCE_VERSION=2
    
    # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
    whirr.provider=aws-ec2
    whirr.hardware-id=c1.xlarge
    
    # Credentials should go here
    whirr.identity=XXXXXXXXXXXXXXXXX
    whirr.credential=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    whirr.cluster-user=whirr
    whirr.private-key-file=/home/ubuntu/.ssh/yourKey
    whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
  • and the same process! 
Happy Learning! :)

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}