Install a Hadoop Cluster on Ubuntu 18.04.1
We explore how to install a multiple node Hadoop 3.1.1 cluster on your Ubuntu-based system. Read on to get started with this powerful big data framework!
Join the DZone community and get the full member experience.
Join For FreeIn this post, I will be installing a three-node Hadoop Cluster on Ubuntu 18.04.1 including HDFS.
In a previous post called Install Hadoop on Ubuntu 17.10, I walked through how to install a single node Hadoop server.
This post will go further by installing a multiple node Hadoop 3.1.1 cluster on Ubuntu 18.04.1.
Getting Started
To begin, we will need three virtual machines created for our cluster.
We will create a Hadoop Master server with 4 vCPUs, 4 GB of memory, and 40 GB of hard drive space.
We will also create two Hadoop Nodes with 4 vCPUs, 8 GB of memory, and 40 GB of hard drive space for each node.
For this article, I installed Ubuntu Server 18.04.1 on all three servers, installed all updates and rebooted.
Also make sure that you configure each server with a static IP address and either Internal DNS resolution or add each server to the /etc/hosts file.
We are now ready to prepare our servers for running Hadoop.
Preparing the Hadoop Servers
You will need to perform the steps in this section on all your servers.
First we need to install Oracle Java 8 since as of Ubuntu 18.04.1 Open Java 8 is no longer available.
# add-apt-repository ppa:webupd8team/java
# apt update
# apt install -y oracle-java8-set-default
Accept the license terms.
Next, download the Hadoop Binaries
# wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
Untar the archive and move it to /usr/local/
# tar -xzvf hadoop-3.1.1.tar.gz
# mv hadoop-3.1.1 /usr/local/hadoop
Next, we need to update our default environment variables to include the JAVA_HOME and Hadoop binary directories.
First we need to know where Java was installed to. Run the following command to find out.
# update-alternatives --display java
java - manual mode
link best version is /usr/lib/jvm/java-8-oracle/jre/bin/java
link currently points to /usr/lib/jvm/java-8-oracle/jre/bin/java
link java is /usr/bin/java
slave java.1.gz is /usr/share/man/man1/java.1.gz
/usr/lib/jvm/java-8-oracle/jre/bin/java - priority 1081
slave java.1.gz: /usr/lib/jvm/java-8-oracle/man/man1/java.1.gz
As you can see JAVA_HOME should be set to /usr/lib/jvm/java-8-oracle/jre.
Open /etc/environment and update the PATH line to include the Hadoop binary directories.
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
Also add a line for the JAVA_HOME variable.
JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"
Make sure the directory matches the output from update-alternatives above minus the bin/java part.
Next, we will add a hadoop user and give them the correct permissions.
# adduser hadoop
# usermod -aG hadoop hadoop
# chown hadoop:root -R /usr/local/hadoop
# chmod g+rwx -R /usr/local/hadoop
Login as the hadoop user and generate an SSH Key. You only need to complete this step on the Hadoop Master.
# su - hadoop
# ssh-keygen -t rsa
Accept all the defaults for ssh-keygen.
Now login as the hadoop user and copy the SSH key to all Hadoop Nodes. Again, you only need to complete this step on the Hadoop Master.
# su - hadoop
$ ssh-copy-id hadoop@hadoop1.admintome.lab
$ ssh-copy-id hadoop@hadoop2.admintome.lab
$ ssh-copy-id hadoop@hadoop3.admintome.lab
Configuring the Hadoop Master
Open the /usr/local/hadoop/etc/hadoop/core-site.xml file and enter the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop1.admintome.lab:9000</value>
</property>
</configuration>
Save and exit.
Next, open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file and add the following:
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
Save and exit.
Open the /usr/local/hadoop/etc/hadoop/workers file and add these two lines (one for each of your Hadoop Nodes)
hadoop2.admintome.lab
hadoop3.admintome.lab
Save and exit.
Copy the configuration files to each of your Hadoop Nodes from your Hadoop Master.
# scp /usr/local/hadoop/etc/hadoop/* hadoop2.admintome.lab:/usr/local/hadoop/etc/hadoop/
# scp /usr/local/hadoop/etc/hadoop/* hadoop3.admintome.lab:/usr/local/hadoop/etc/hadoop/
Format the HDFS file system
$ source /etc/environmnet
$ hdfs namenode -format
Now you can start HDFS
hadoop@hadoop1:~$ start-dfs.sh
Starting namenodes on [hadoop1.admintome.lab]
Starting datanodes
Starting secondary namenodes [hadoop1]
hadoop@hadoop1:~$
Validate that everything started right by running the jps
command as the Hadoop user on all your Hadoop servers.
On the Hadoop Master you should see this.
hadoop@hadoop1:~$ jps
13634 Jps
13478 SecondaryNameNode
13174 NameNode
and on each of your Hadoop Nodes you should see:
hadoop@hadoop2:~$ jps
8672 Jps
8579 DataNode
HDFS Web UI
You can now access the HDFS web UI by browsing to your Hadoop Master Server port 9870.
http://hadoop1.admintome.lab:9870
You should see the UI:
As you can see we have almost 60 GB free on our HDFS file system.
Starting Yarn
Now that HDFS is running we are ready to start the Yarn scheduler.
Hadoop, on its own, can schedule any jobs so we need to run Yarn so we can schedule jobs on our Hadoop cluster.
export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
On each of your hadoop slaves (hadoop2 and hadoop3) you will need to add these lines to /usr/local/hadoop/etc/hadoop/yarn-site.xml:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop1</value>
</property>
Save and exit the file.
Run this command to start yarn
$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
We can verify that it started correctly by running this command:
$ yarn node -list
2018-08-15 04:40:26,688 INFO client.RMProxy: Connecting to ResourceManager at hadoop1.admintome.lab/192.168.1.35:8032
Total Nodes:2
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
hadoop3.admintome.lab:35337 RUNNING hadoop3.admintome.lab:8042 0
hadoop2.admintome.lab:38135 RUNNING hadoop2.admintome.lab:8042 0
There are not any running containers because we haven't started any jobs yet.
Hadoop Web UI
You can view the Hadoop Web UI by going to this URL:
http://hadoop1.admintome.lab:8088/cluster
Replace the host name for your Hadoop Master's host name.
Running an Example Hadoop Job
We can now run a sample Hadoop job and schedule it on our cluster.
The example we will run is to use MapReduce to calculate PI.
Run the following command to run the job:
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar pi 16 1000
It will take several minutes to complete.
When it is done you should see that it has calculated PI
Job Finished in 72.973 seconds
Estimated value of Pi is 3.14250000000000000000
Conclusion
You have now installed a Hadoop Cluster on Ubuntu 18.04.1.
I hope you have enjoyed this post. If you did then please share it and comment below.
Published at DZone with permission of Bill Ward, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments