{{announcement.body}}
{{announcement.title}}

Install a Hadoop Cluster on Ubuntu 18.04.1

DZone 's Guide to

Install a Hadoop Cluster on Ubuntu 18.04.1

We explore how to install a multiple node Hadoop 3.1.1 cluster on your Ubuntu-based system. Read on to get started with this powerful big data framework!

· Big Data Zone ·
Free Resource

In this post, I will be installing a three-node Hadoop Cluster on Ubuntu 18.04.1 including HDFS.

In a previous post called Install Hadoop on Ubuntu 17.10, I walked through how to install a single node Hadoop server.

This post will go further by installing a multiple node Hadoop 3.1.1 cluster on Ubuntu 18.04.1.

Getting Started

To begin, we will need three virtual machines created for our cluster.

We will create a Hadoop Master server with 4 vCPUs, 4 GB of memory, and 40 GB of hard drive space.

We will also create two Hadoop Nodes with 4 vCPUs, 8 GB of memory, and 40 GB of hard drive space for each node.

For this article, I installed Ubuntu Server 18.04.1 on all three servers, installed all updates and rebooted.

Also make sure that you configure each server with a static IP address and either Internal DNS resolution or add each server to the /etc/hosts file.

We are now ready to prepare our servers for running Hadoop.

Preparing the Hadoop Servers

You will need to perform the steps in this section on all your servers.

First we need to install Oracle Java 8 since as of Ubuntu 18.04.1 Open Java 8 is no longer available.

# add-apt-repository ppa:webupd8team/java
# apt update
# apt install -y oracle-java8-set-default

Accept the license terms.

Next, download the Hadoop Binaries

# wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

Untar the archive and move it to /usr/local/

# tar -xzvf hadoop-3.1.1.tar.gz
# mv hadoop-3.1.1 /usr/local/hadoop

Next, we need to update our default environment variables to include the JAVA_HOME and Hadoop binary directories.

First we need to know where Java was installed to. Run the following command to find out.

# update-alternatives --display java
java - manual mode
  link best version is /usr/lib/jvm/java-8-oracle/jre/bin/java
  link currently points to /usr/lib/jvm/java-8-oracle/jre/bin/java
  link java is /usr/bin/java
  slave java.1.gz is /usr/share/man/man1/java.1.gz
/usr/lib/jvm/java-8-oracle/jre/bin/java - priority 1081
  slave java.1.gz: /usr/lib/jvm/java-8-oracle/man/man1/java.1.gz

As you can see JAVA_HOME should be set to /usr/lib/jvm/java-8-oracle/jre.

Open /etc/environment and update the PATH line to include the Hadoop binary directories.

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"

Also add a line for the JAVA_HOME variable.

JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"

Make sure the directory matches the output from update-alternatives above minus the bin/java part.

Next, we will add a hadoop user and give them the correct permissions.

# adduser hadoop
# usermod -aG hadoop hadoop
# chown hadoop:root -R /usr/local/hadoop
# chmod g+rwx -R /usr/local/hadoop

Login as the hadoop user and generate an SSH Key. You only need to complete this step on the Hadoop Master.

# su - hadoop
# ssh-keygen -t rsa

Accept all the defaults for ssh-keygen.

Now login as the hadoop user and copy the SSH key to all Hadoop Nodes. Again, you only need to complete this step on the Hadoop Master.

# su - hadoop
$ ssh-copy-id hadoop@hadoop1.admintome.lab
$ ssh-copy-id hadoop@hadoop2.admintome.lab
$ ssh-copy-id hadoop@hadoop3.admintome.lab

Configuring the Hadoop Master

Open the /usr/local/hadoop/etc/hadoop/core-site.xml file and enter the following:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop1.admintome.lab:9000</value>
  </property>
</configuration>

Save and exit.

Next, open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file and add the following:

<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop/data/nameNode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/data/dataNode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
</configuration>

Save and exit.

Open the /usr/local/hadoop/etc/hadoop/workers file and add these two lines (one for each of your Hadoop Nodes)

hadoop2.admintome.lab
hadoop3.admintome.lab

Save and exit.

Copy the configuration files to each of your Hadoop Nodes from your Hadoop Master.

# scp /usr/local/hadoop/etc/hadoop/* hadoop2.admintome.lab:/usr/local/hadoop/etc/hadoop/
# scp /usr/local/hadoop/etc/hadoop/* hadoop3.admintome.lab:/usr/local/hadoop/etc/hadoop/

Format the HDFS file system

$ source /etc/environmnet
$ hdfs namenode -format

Now you can start HDFS

hadoop@hadoop1:~$ start-dfs.sh
Starting namenodes on [hadoop1.admintome.lab]
Starting datanodes
Starting secondary namenodes [hadoop1]
hadoop@hadoop1:~$

Validate that everything started right by running the jps command as the Hadoop user on all your Hadoop servers.

On the Hadoop Master you should see this.

hadoop@hadoop1:~$ jps
13634 Jps
13478 SecondaryNameNode
13174 NameNode

and on each of your Hadoop Nodes you should see:

hadoop@hadoop2:~$ jps
8672 Jps
8579 DataNode

HDFS Web UI

You can now access the HDFS web UI by browsing to your Hadoop Master Server port 9870.

http://hadoop1.admintome.lab:9870

You should see the UI:

As you can see we have almost 60 GB free on our HDFS file system.

Starting Yarn

Now that HDFS is running we are ready to start the Yarn scheduler.

Hadoop, on its own, can schedule any jobs so we need to run Yarn so we can schedule jobs on our Hadoop cluster.

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

On each of your hadoop slaves (hadoop2 and hadoop3) you will need to add these lines to /usr/local/hadoop/etc/hadoop/yarn-site.xml:

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop1</value>
  </property>

Save and exit the file.

Run this command to start yarn

$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

We can verify that it started correctly by running this command:

$ yarn node -list
2018-08-15 04:40:26,688 INFO client.RMProxy: Connecting to ResourceManager at hadoop1.admintome.lab/192.168.1.35:8032
Total Nodes:2
         Node-Id       Node-State  Node-Http-Address  Number-of-Running-Containers
hadoop3.admintome.lab:35337          RUNNING  hadoop3.admintome.lab:8042                             0
hadoop2.admintome.lab:38135          RUNNING  hadoop2.admintome.lab:8042                             0

There are not any running containers because we haven't started any jobs yet.

Hadoop Web UI

You can view the Hadoop Web UI by going to this URL:

http://hadoop1.admintome.lab:8088/cluster

Replace the host name for your Hadoop Master's host name.

Running an Example Hadoop Job

We can now run a sample Hadoop job and schedule it on our cluster.

The example we will run is to use MapReduce to calculate PI.

Run the following command to run the job:

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar pi 16 1000

It will take several minutes to complete.

When it is done you should see that it has calculated PI

Job Finished in 72.973 seconds
Estimated value of Pi is 3.14250000000000000000

Conclusion

You have now installed a Hadoop Cluster on Ubuntu 18.04.1.

I hope you have enjoyed this post. If you did then please share it and comment below.

Click here for other great articles from AdminTome Blog.

Topics:
big data ,hadoop cluster ,tutorial ,ubuntu

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}