Over a million developers have joined DZone.

Refactor of Hadoop

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.


In Hadoop 1.x, there are some problems, for example HA and too many small files.

In Hadoop 2.x Yarn, there are no HA problems. But only one of the masters in yarn can be active and serve to clients, and other masters are stand by. I think this is a waste of masters.


The Hadoop master nodes do not store metadata and add locks to the file HDFS is writing.

The metadata should be stored in other roles called Metadata Node cluster, all the masters can access the Metadata Node cluster.

The Metadata Node cluster can use Zookeeper to implement. We can use zookeeper tree model as file system tree model. If the znode is a file node, it should be have children nodes in znode and the children nodes describe data block information, the file znode has its data to describe the file itself, such as file length and file access information and others.

Zookeeper is naturally a support cluster, so we don't worry about HA.

If there are too many small files, we should use several Metadata Node Clusters to store metadata, and use shard rule to determine which cluster the file should be stored in. The Metadata Node Cluster should can be added to system transparently.

When the client wishes to write a file, the master should check if there is a write lock flag in the file metadata.

If there are no write lock flags, the master should add a write lock flag in the file znode data then the client write the file as usual. After the client is done, the Master remove the write lock flag.

If there is a write lock flag, the master should refuse client's request or delay the request and put the request in a queue and let the client wait until the write lock is removed.

If we use Pessimistic Strategy for the reading operation, if client reads the file, the master should check if there is a write flag existing of the file. If no write flag exists, the client reads the file as usual  else warn the client the file is locked for writing.

If we use Optimism Strategy for the reading operation, the master can ignore the write lock flag and read as usual.


After the refactoring, all the master nodes can provide service to clients at the same time and the small file problem is solved too.

What about your opinion?

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison


The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}