In my earlier post about Hadoop cluster planning for data nodes, I mentioned the steps required for setting up a Hadoop cluster for 100 TB data in a year. Now let’s a take a step forward and plan for name nodes. For name nodes, we need to set up a failover name node, as well (also called a secondary name node). The secondary name node should be an exact or approximate replica of the primary name node.
Both name node servers should have highly reliable storage for their namespace storage and edit-log journaling. That’s why — contrary to the recommended JBOD for data nodes — RAID is recommended for name nodes.
Master servers should have at least four redundant storage volumes — some local and some networked — but each can be relatively small (typically 1TB).
It is easy to determine the memory needed for both name node and secondary name node. The memory needed by name node to manage the HDFS cluster metadata in memory and the memory needed for the OS must be added together. Typically, the memory needed by the secondary name node should be identical to the name node.
The amount of memory required for the master nodes depends on the number of file system objects (files and block replicas) to be created and tracked by the name node. We can do memory sizing as:
64 GB of RAM supports approximately 100 million files. So if you know the number of files to be processed by data nodes, use these parameters to get RAM size.
We can go for memory based on the cluster size, as well. For a small cluster of 5-50 nodes, 64 GB RAM should be fair enough. For medium-to-large sized clusters, 50 to 1,000 128 GB RAM can be recommended.
Or use this formula: Memory amount = HDFS cluster management memory + NameNode memory + OS memory.
OS memory 8 GB-16 GB, name node memory 8-32 GB, and HDFS cluster management memory 8-64 GB should be enough!
Name node memory and HDFS cluster management memory can be calculated based on the data nodes, and files to be processed. Use 1 and 2 to estimate these values.
Name nodes and their clients are very chatty. We, therefore, recommend providing 16 or even 24 CPU cores for handling messaging traffic for the master nodes.
Providing multiple network ports and 10 GB bandwidth to the switch is also acceptable (if the switch can handle it).
Previously, YARN was configured based on mapper and reducer slots to control the amount of memory on each node. Later, this practice was discarded. Resources are now configured in terms of amounts of memory (in megabytes) and CPU (v-cores). YARN uses auto-tuned
yarn.nodemanager.resource.cpu-vcores that control the amount of memory and CPU on each node for both mappers and reducers. In any case , f configuring these manually, simply set these to the amount of memory and number of cores on the machine after subtracting out resources needed for other services. Normally, we reserve two cores per CPU, one for task tracker, and one for HDFS. REST can be assigned to parameter
yarn.nodemanager.resource.cpu-vcores=total cpu cores - 2
To set up
yarn.nodemanager.resource.memory-mb=HDFS cluster management memory, see memory sizing.
I hope this blog is helpful to you and you enjoyed reading it!