Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Refactor of Hadoop

DZone's Guide to

Refactor of Hadoop

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Problem

In Hadoop 1.x, there are some problems, for example HA and too many small files.

In Hadoop 2.x Yarn, there are no HA problems. But only one of the masters in yarn can be active and serve to clients, and other masters are stand by. I think this is a waste of masters.

Solution

The Hadoop master nodes do not store metadata and add locks to the file HDFS is writing.

The metadata should be stored in other roles called Metadata Node cluster, all the masters can access the Metadata Node cluster.

The Metadata Node cluster can use Zookeeper to implement. We can use zookeeper tree model as file system tree model. If the znode is a file node, it should be have children nodes in znode and the children nodes describe data block information, the file znode has its data to describe the file itself, such as file length and file access information and others.

Zookeeper is naturally a support cluster, so we don't worry about HA.

If there are too many small files, we should use several Metadata Node Clusters to store metadata, and use shard rule to determine which cluster the file should be stored in. The Metadata Node Cluster should can be added to system transparently.

When the client wishes to write a file, the master should check if there is a write lock flag in the file metadata.

If there are no write lock flags, the master should add a write lock flag in the file znode data then the client write the file as usual. After the client is done, the Master remove the write lock flag.

If there is a write lock flag, the master should refuse client's request or delay the request and put the request in a queue and let the client wait until the write lock is removed.

If we use Pessimistic Strategy for the reading operation, if client reads the file, the master should check if there is a write flag existing of the file. If no write flag exists, the client reads the file as usual  else warn the client the file is locked for writing.

If we use Optimism Strategy for the reading operation, the master can ignore the write lock flag and read as usual.

Conclusion

After the refactoring, all the master nodes can provide service to clients at the same time and the small file problem is solved too.

What about your opinion?


Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}