Setting up Multi-Node Hadoop Cluster Just Got Easy
Setting up Multi-Node Hadoop Cluster Just Got Easy
If you are searching for exact steps to configure a Multi-Node Hadoop Cluster, look no more. This article has step-by-step details to set up a Multi-Node cluster for Hadoop 2.7.3 and Spark 1.6.2.
Join the DZone community and get the full member experience.Join For Free
Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download
In this blog, we are going to embark the journey of how to setup the Hadoop Multi-Node Cluster in a distributed environment. So let's not waste any time! Here are steps you need to perform.
Download & install Hadoop for local machine (Single Node Setup) from
http://hadoop.apache.org/releases.html – 2.7.3
Use java: jdk1.8.0_111
Download Apache Spark from
Choose Spark release: 1.6.2
Mapping the Nodes
First of all, we have to edit hosts file in /etc/ folder on all nodes, specify the IP address of each system followed by their host names.
# vi /etc/hosts enter the following lines in the /etc/hosts file. 192.168.1.xxx hadoop-master 192.168.1.xxx hadoop-slave-1 192.168.56.xxx hadoop-slave-2
Passwordless Login Through ssh
Then we need to setup ssh passwordless login. For this, we need to Configure Key Based Login.
Setup ssh in every node such that they can communicate with one another without any prompt for a password.
# su hduser $ ssh-keygen -t rsa $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-master $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-slave-1 $ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoop-slave-2
Note: ssh folder should have permission: 700 & authorised_key should have 644 and hduser should have 755 permission in both master & slaves. (This is very important as I wasted a lot of my time trying to figure this out.)
Set up Java Environment for Master and Slave
Folder structure for both Master and Slave must be the same.
Extract your java in /home/hduser/software and set the path in hduser’s .bashrc as:
- Install Hadoop in /usr/local
- Set $HADOOP_HOME in bashrc as:
export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin
- Create a directory named hadoop_data in opt folder and dfs in $HADOOP_HOME
- Inside dfs create a directory called name and inside name create a directory named data
- The permissions for name and dfs should be 777.
- Make sure that hadoop_data folder in opt folder is owned by hduser and its permissions should be 777
- Your core-site.xml file should look like:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop_data</value> <description>directory for hadoop data</description> </property> <property> <name>fs.default.name</name> <value>hdfs://hadoop-master:54311</value> <description> data to be put on this URI</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-master:54311</value> <description>Use HDFS as file storage engine</description> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
- Your hdfs-site.xml file should look like:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.data.dir</name> <value>/usr/local/hadoop/dfs/name/data</value> <final>true</final> </property> <property> <name>dfs.name.dir</name> <value>/usr/local/hadoop/dfs/name</value> <final>true</final> </property> </configuration>
- Your mapred-site.xml should look like:
<configuration> <property> <name>mapred.job.tracker</name> <value>hadoop-master:9001</value> </property> </configuration>
- Your yarn-site.xml should look like:
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hadoop-master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hadoop-master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>hadoop-master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hadoop-master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>hadoop-master:8033</value> </property> </configuration>
Now, set JAVA_HOME in hadoop-env.sh
Next, in the master node, set slaves IP address in $HADOOP_HOME/etc/hadoop/slaves file
hadoop-master hadoop-slave-1 hadoop-slave-2 Remove localhost entry from the above file.
Important Note: Location of Hadoop and Spark should be same in master and slaves.
Install Spark in /home/hduser/software
Set your $SPARK_HOME in bashrc as:
Add the following line in spark-env.sh
export SPARK_MASTER_IP=192.168.2.xxx //IP address of master
Copy your hdfs-site.xml and core-site.xml file from $HADOOP_HOME/etc/hadoop and put it in $SPARK_HOME/conf folder.
In master node, add IP addresses of slaves in slaves file located in $SPARK_HOME/conf.
- To run Hadoop, go to $HADOOP_HOME in the master and run: hadoop namenode -format
- Next, change directory by executing the command:
Start-dfs.shwill start NameNode, SecondaryNamenode, DataNode on master and DataNode on all slaves node.
Start-yarn.shwill start NodeManager, ResourceManager on the master node and NodeManager on slaves.
Hadoop namenode -formatonly once otherwise you will get an incompatible cluster_id exception. To resolve this error clear temporary data location for datanode i.e, remove the files present in $HADOOP_HOME/dfs/name/data folder.
Use the following command:
rm -rf filename
- Start spark. Go to $SPARK_HOME/sbin and run
- Start thrift server and log into beeline using hduser as username and password.
- To start the thrift server use the following command inside $SPARK_HOME
./bin/spark-submit –master spark master IP –conf spark.sql.hive.thriftServer.singleSession=true –class pathOfClassToRun pathToYourApplicationJar hdfs://hadoop-master:54311/pathToStoreLocation
Look for spark master IP in master node at this address- hadoop-master:8080
If you face any issue, then refer to the below section.
In case, you spot any of the below issues, see in the Hadoop logs folder located in $HADOOP_HOME/logs
Error for Incompatible cluster ids: Clear temporary data location for datanode
Failed to start database: if you face this problem, then remove metastore _db:
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7962a746 : remove metastore_db/dbex.l
HiveSQLException: org.apache.hadoop.security.AccessControlException: Permission denied: user=anonymous, access=WRITE : Login to beeline with user as hduser and password as hduser
Published at DZone with permission of Rao Swati , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.