DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Kubernetes Cluster Setup on Ubuntu, Explained
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • The Production-Ready Kubernetes Service Checklist

Trending

  • Beyond ChatGPT, AI Reasoning 2.0: Engineering AI Models With Human-Like Reasoning
  • Testing SingleStore's MCP Server
  • Debugging With Confidence in the Age of Observability-First Systems
  • How to Format Articles for DZone
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Install a Hadoop Cluster on Ubuntu 18.04.1

Install a Hadoop Cluster on Ubuntu 18.04.1

We explore how to install a multiple node Hadoop 3.1.1 cluster on your Ubuntu-based system. Read on to get started with this powerful big data framework!

By 
Bill Ward user avatar
Bill Ward
·
Aug. 16, 18 · Tutorial
Likes (9)
Comment
Save
Tweet
Share
49.9K Views

Join the DZone community and get the full member experience.

Join For Free

In this post, I will be installing a three-node Hadoop Cluster on Ubuntu 18.04.1 including HDFS.

In a previous post called Install Hadoop on Ubuntu 17.10, I walked through how to install a single node Hadoop server.

This post will go further by installing a multiple node Hadoop 3.1.1 cluster on Ubuntu 18.04.1.

Getting Started

To begin, we will need three virtual machines created for our cluster.

We will create a Hadoop Master server with 4 vCPUs, 4 GB of memory, and 40 GB of hard drive space.

We will also create two Hadoop Nodes with 4 vCPUs, 8 GB of memory, and 40 GB of hard drive space for each node.

For this article, I installed Ubuntu Server 18.04.1 on all three servers, installed all updates and rebooted.

Also make sure that you configure each server with a static IP address and either Internal DNS resolution or add each server to the /etc/hosts file.

We are now ready to prepare our servers for running Hadoop.

Preparing the Hadoop Servers

You will need to perform the steps in this section on all your servers.

First we need to install Oracle Java 8 since as of Ubuntu 18.04.1 Open Java 8 is no longer available.

# add-apt-repository ppa:webupd8team/java
# apt update
# apt install -y oracle-java8-set-default

Accept the license terms.

Next, download the Hadoop Binaries

# wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz

Untar the archive and move it to /usr/local/

# tar -xzvf hadoop-3.1.1.tar.gz
# mv hadoop-3.1.1 /usr/local/hadoop

Next, we need to update our default environment variables to include the JAVA_HOME and Hadoop binary directories.

First we need to know where Java was installed to. Run the following command to find out.

# update-alternatives --display java
java - manual mode
  link best version is /usr/lib/jvm/java-8-oracle/jre/bin/java
  link currently points to /usr/lib/jvm/java-8-oracle/jre/bin/java
  link java is /usr/bin/java
  slave java.1.gz is /usr/share/man/man1/java.1.gz
/usr/lib/jvm/java-8-oracle/jre/bin/java - priority 1081
  slave java.1.gz: /usr/lib/jvm/java-8-oracle/man/man1/java.1.gz

As you can see JAVA_HOME should be set to /usr/lib/jvm/java-8-oracle/jre.

Open /etc/environment and update the PATH line to include the Hadoop binary directories.

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"

Also add a line for the JAVA_HOME variable.

JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"

Make sure the directory matches the output from update-alternatives above minus the bin/java part.

Next, we will add a hadoop user and give them the correct permissions.

# adduser hadoop
# usermod -aG hadoop hadoop
# chown hadoop:root -R /usr/local/hadoop
# chmod g+rwx -R /usr/local/hadoop

Login as the hadoop user and generate an SSH Key. You only need to complete this step on the Hadoop Master.

# su - hadoop
# ssh-keygen -t rsa

Accept all the defaults for ssh-keygen.

Now login as the hadoop user and copy the SSH key to all Hadoop Nodes. Again, you only need to complete this step on the Hadoop Master.

# su - hadoop
$ ssh-copy-id hadoop@hadoop1.admintome.lab
$ ssh-copy-id hadoop@hadoop2.admintome.lab
$ ssh-copy-id hadoop@hadoop3.admintome.lab

Configuring the Hadoop Master

Open the /usr/local/hadoop/etc/hadoop/core-site.xml file and enter the following:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://hadoop1.admintome.lab:9000</value>
  </property>
</configuration>

Save and exit.

Next, open the /usr/local/hadoop/etc/hadoop/hdfs-site.xml file and add the following:

<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop/data/nameNode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/data/dataNode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
</configuration>

Save and exit.

Open the /usr/local/hadoop/etc/hadoop/workers file and add these two lines (one for each of your Hadoop Nodes)

hadoop2.admintome.lab
hadoop3.admintome.lab

Save and exit.

Copy the configuration files to each of your Hadoop Nodes from your Hadoop Master.

# scp /usr/local/hadoop/etc/hadoop/* hadoop2.admintome.lab:/usr/local/hadoop/etc/hadoop/
# scp /usr/local/hadoop/etc/hadoop/* hadoop3.admintome.lab:/usr/local/hadoop/etc/hadoop/

Format the HDFS file system

$ source /etc/environmnet
$ hdfs namenode -format

Now you can start HDFS

hadoop@hadoop1:~$ start-dfs.sh
Starting namenodes on [hadoop1.admintome.lab]
Starting datanodes
Starting secondary namenodes [hadoop1]
hadoop@hadoop1:~$

Validate that everything started right by running the jps command as the Hadoop user on all your Hadoop servers.

On the Hadoop Master you should see this.

hadoop@hadoop1:~$ jps
13634 Jps
13478 SecondaryNameNode
13174 NameNode

and on each of your Hadoop Nodes you should see:

hadoop@hadoop2:~$ jps
8672 Jps
8579 DataNode

HDFS Web UI

You can now access the HDFS web UI by browsing to your Hadoop Master Server port 9870.

http://hadoop1.admintome.lab:9870

You should see the UI:

As you can see we have almost 60 GB free on our HDFS file system.

Starting Yarn

Now that HDFS is running we are ready to start the Yarn scheduler.

Hadoop, on its own, can schedule any jobs so we need to run Yarn so we can schedule jobs on our Hadoop cluster.

export HADOOP_HOME="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

On each of your hadoop slaves (hadoop2 and hadoop3) you will need to add these lines to /usr/local/hadoop/etc/hadoop/yarn-site.xml:

  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop1</value>
  </property>

Save and exit the file.

Run this command to start yarn

$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

We can verify that it started correctly by running this command:

$ yarn node -list
2018-08-15 04:40:26,688 INFO client.RMProxy: Connecting to ResourceManager at hadoop1.admintome.lab/192.168.1.35:8032
Total Nodes:2
         Node-Id       Node-State  Node-Http-Address  Number-of-Running-Containers
hadoop3.admintome.lab:35337          RUNNING  hadoop3.admintome.lab:8042                             0
hadoop2.admintome.lab:38135          RUNNING  hadoop2.admintome.lab:8042                             0

There are not any running containers because we haven't started any jobs yet.

Hadoop Web UI

You can view the Hadoop Web UI by going to this URL:

http://hadoop1.admintome.lab:8088/cluster

Replace the host name for your Hadoop Master's host name.

Running an Example Hadoop Job

We can now run a sample Hadoop job and schedule it on our cluster.

The example we will run is to use MapReduce to calculate PI.

Run the following command to run the job:

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar pi 16 1000

It will take several minutes to complete.

When it is done you should see that it has calculated PI

Job Finished in 72.973 seconds
Estimated value of Pi is 3.14250000000000000000

Conclusion

You have now installed a Hadoop Cluster on Ubuntu 18.04.1.

I hope you have enjoyed this post. If you did then please share it and comment below.

Click here for other great articles from AdminTome Blog.

hadoop cluster ubuntu

Published at DZone with permission of Bill Ward, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Kubernetes Cluster Setup on Ubuntu, Explained
  • Can You Run a MariaDB Cluster on a $150 Kubernetes Lab? I Gave It a Shot
  • How Kubernetes Cluster Sizing Affects Performance and Cost Efficiency in Cloud Deployments
  • The Production-Ready Kubernetes Service Checklist

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!