First Hadoop Cluster Challenges
Yesterday we were working on setting up our first Hadoop cluster. Though there is much online documentation on this even then we faced a few challenges getting with it. In this post I am providing details on the faced problems and solutions:
Passwordless login from NameNode to DataNode and vice versa:
Though setting paswordless login from NameNode to DataNode was easy, we had to just follow the steps mentioned at different tutorials:
ssh-keygen -t rsa .ssh/id_rsa.pub nsinfra@datanode1:~nsinfra/.ssh/authorized_keys .ssh/id_rsa.pub nsinfra@datanode2:~nsinfra/.ssh/authorized_keys
We executed above three commands and assigned 700 permission on both
.ssh & authorized_keys on NameNode before copying it to DataNodes.
And we were able to ssh datanode from namenode without a password.
But we struggled with the reverse process, we referred to one of the tutorials from net to setup Hadoop cluster. Below were the commands specified:
ssh-keygen -t rsa .ssh/id_rsa.pub nsinfra@datanode1:~nsinfra/.ssh/authorized_keys2
Since we had 3 datanode in the cluster, looking at above command structure, we made an assumption that we need to have three different auth keys and thereby executed the below:
ssh-keygen -t rsa .ssh/id_rsa.pub nsinfra@datanode1:~nsinfra/.ssh/authorized_keys1 .ssh/id_rsa.pub nsinfra@datanode2:~nsinfra/.ssh/authorized_keys2 .ssh/id_rsa.pub nsinfra@datanode3:~nsinfra/.ssh/authorized_keys3
Also, we assigned 700 permission on the .ssh and authorized_keyX as before.
To our surprise only 2nd worked, while the datanode2 and datanode3 could not be SSHed without password. When dwelling deep in SSH tutorials we realized the issue. We have taken authorized_keyX file for granted and changed the name. But authorized_key2 is the only file which is used by the SSH, and creating 1 and 3 versions makes no sense. Now the question was how do we setup 3 machines? We found that version2 is common for all the machine and hence copying 1's and 3's contents in 2 shall resolve the problem.
cat authorized_keys1 >> authorized_keys2 cat authorized_keys3 >> authorized_keys2
java.io.IOException: Incompatible namespaceIDs
After successfully setting nodes with passwordless SSH access we started the cluster. NameNode got started correctly but the DataNodes were not. We analyzed this using the JPS utility, processes named datanode were not started on DataNodes. We further looked at logs of data node, and found logged exceptions:
... ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/dfs/data: ... at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:281) at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:121) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:230) at org.apache.hadoop.dfs.DataNode.(DataNode.java:199) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1202) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1146) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1167) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1326)
On googling we found that this is a common problem and being refered as
HDFS-107 (formerly known as HADOOP-1212). Below are the solution which
were mentioned by experts:
QuickFix#1: Cleanup and restart
- Stop the cluster
- Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /app/hadoop/tmp/dfs/data
- Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
- Restart the cluster
Fix#2: Updating namespaceID of problematic DataNodes
This workaround is “minimally invasive” as you only have to edit one file on the problematic DataNodes:
- Stop the DataNode
- Edit the value of namespaceID in /current/VERSION to match the value of the current NameNode
- Restart the DataNode