Install a Hadoop Cluster on Ubuntu 18.04.1
We explore how to install a multiple node Hadoop 3.1.1 cluster on your Ubuntu-based system. Read on to get started with this powerful big data framework!
Join the DZone community and get the full member experience.Join For Free
In this post, I will be installing a three-node Hadoop Cluster on Ubuntu 18.04.1 including HDFS.
In a previous post called Install Hadoop on Ubuntu 17.10, I walked through how to install a single node Hadoop server.
This post will go further by installing a multiple node Hadoop 3.1.1 cluster on Ubuntu 18.04.1.
To begin, we will need three virtual machines created for our cluster.
We will create a Hadoop Master server with 4 vCPUs, 4 GB of memory, and 40 GB of hard drive space.
We will also create two Hadoop Nodes with 4 vCPUs, 8 GB of memory, and 40 GB of hard drive space for each node.
For this article, I installed Ubuntu Server 18.04.1 on all three servers, installed all updates and rebooted.
Also make sure that you configure each server with a static IP address and either Internal DNS resolution or add each server to the /etc/hosts file.
We are now ready to prepare our servers for running Hadoop.
Preparing the Hadoop Servers
You will need to perform the steps in this section on all your servers.
First we need to install Oracle Java 8 since as of Ubuntu 18.04.1 Open Java 8 is no longer available.
# add-apt-repository ppa:webupd8team/java # apt update # apt install -y oracle-java8-set-default
Accept the license terms.
Next, download the Hadoop Binaries
# wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
Untar the archive and move it to /usr/local/
# tar -xzvf hadoop-3.1.1.tar.gz # mv hadoop-3.1.1 /usr/local/hadoop
Next, we need to update our default environment variables to include the JAVA_HOME and Hadoop binary directories.
First we need to know where Java was installed to. Run the following command to find out.
# update-alternatives --display java java - manual mode link best version is /usr/lib/jvm/java-8-oracle/jre/bin/java link currently points to /usr/lib/jvm/java-8-oracle/jre/bin/java link java is /usr/bin/java slave java.1.gz is /usr/share/man/man1/java.1.gz /usr/lib/jvm/java-8-oracle/jre/bin/java - priority 1081 slave java.1.gz: /usr/lib/jvm/java-8-oracle/man/man1/java.1.gz
As you can see JAVA_HOME should be set to /usr/lib/jvm/java-8-oracle/jre.
Open /etc/environment and update the PATH line to include the Hadoop binary directories.
Also add a line for the JAVA_HOME variable.
Make sure the directory matches the output from update-alternatives above minus the bin/java part.
Next, we will add a hadoop user and give them the correct permissions.
# adduser hadoop # usermod -aG hadoop hadoop # chown hadoop:root -R /usr/local/hadoop # chmod g+rwx -R /usr/local/hadoop
Login as the hadoop user and generate an SSH Key. You only need to complete this step on the Hadoop Master.
# su - hadoop # ssh-keygen -t rsa
Accept all the defaults for ssh-keygen.
Now login as the hadoop user and copy the SSH key to all Hadoop Nodes. Again, you only need to complete this step on the Hadoop Master.
# su - hadoop $ ssh-copy-id email@example.com $ ssh-copy-id firstname.lastname@example.org $ ssh-copy-id email@example.com
Configuring the Hadoop Master
Open the /usr/local/hadoop/etc/hadoop/core-site.xml file and enter the following:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://hadoop1.admintome.lab:9000</value> </property> </configuration>
Save and exit.
Next, open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file and add the following:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>
Save and exit.
Open the /usr/local/hadoop/etc/hadoop/workers file and add these two lines (one for each of your Hadoop Nodes)
Save and exit.
Copy the configuration files to each of your Hadoop Nodes from your Hadoop Master.
# scp /usr/local/hadoop/etc/hadoop/* hadoop2.admintome.lab:/usr/local/hadoop/etc/hadoop/ # scp /usr/local/hadoop/etc/hadoop/* hadoop3.admintome.lab:/usr/local/hadoop/etc/hadoop/
Format the HDFS file system
$ source /etc/environmnet $ hdfs namenode -format
Now you can start HDFS
hadoop@hadoop1:~$ start-dfs.sh Starting namenodes on [hadoop1.admintome.lab] Starting datanodes Starting secondary namenodes [hadoop1] hadoop@hadoop1:~$
Validate that everything started right by running the
jps command as the Hadoop user on all your Hadoop servers.
On the Hadoop Master you should see this.
hadoop@hadoop1:~$ jps 13634 Jps 13478 SecondaryNameNode 13174 NameNode
and on each of your Hadoop Nodes you should see:
hadoop@hadoop2:~$ jps 8672 Jps 8579 DataNode
HDFS Web UI
You can now access the HDFS web UI by browsing to your Hadoop Master Server port 9870.
You should see the UI:
As you can see we have almost 60 GB free on our HDFS file system.
Now that HDFS is running we are ready to start the Yarn scheduler.
Hadoop, on its own, can schedule any jobs so we need to run Yarn so we can schedule jobs on our Hadoop cluster.
export HADOOP_HOME="/usr/local/hadoop" export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME
<property> <name>yarn.resourcemanager.hostname</name> <value>hadoop1</value> </property>
Run this command to start yarn
$ start-yarn.sh Starting resourcemanager Starting nodemanagers
We can verify that it started correctly by running this command:
$ yarn node -list 2018-08-15 04:40:26,688 INFO client.RMProxy: Connecting to ResourceManager at hadoop1.admintome.lab/192.168.1.35:8032 Total Nodes:2 Node-Id Node-State Node-Http-Address Number-of-Running-Containers hadoop3.admintome.lab:35337 RUNNING hadoop3.admintome.lab:8042 0 hadoop2.admintome.lab:38135 RUNNING hadoop2.admintome.lab:8042 0
There are not any running containers because we haven't started any jobs yet.
Hadoop Web UI
You can view the Hadoop Web UI by going to this URL:
Replace the host name for your Hadoop Master's host name.
Running an Example Hadoop Job
We can now run a sample Hadoop job and schedule it on our cluster.
The example we will run is to use MapReduce to calculate PI.
Run the following command to run the job:
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar pi 16 1000
It will take several minutes to complete.
When it is done you should see that it has calculated PI
Job Finished in 72.973 seconds Estimated value of Pi is 3.14250000000000000000
You have now installed a Hadoop Cluster on Ubuntu 18.04.1.
I hope you have enjoyed this post. If you did then please share it and comment below.
Published at DZone with permission of Bill Ward, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.