Over a million developers have joined DZone.

The Problem with Hadoop in HPC

· Big Data Zone

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

When it comes to handling big data, Hadoop is a major player – but it doesn't seem to have much traction in the high-performance computing community. In a thoughtful and detailed blog post, high-end computing enthusiast Glenn K. Lockwood dissects this disparity, pointing out the result of Hadoop's commercial origins and intended use in non-scientific communities.

I think what makes Hadoop uncomfortable to the HPC community is that, unlike virtually every other technology that has found successful adoption within research computing, Hadoop was not designed by HPC people ... By contrast, Hadoop was developed by Yahoo, and the original MapReduce was developed by Google.  They were not created to solve problems in fundamental science or national defense; they were created to provide a service for the masses.

Hadoop is also written in Java, a decision that made sense in the context of commercial application and web services but that clashes with the supercomputing world.  Given the descriptor "high performance," Java tends to give off the opposite perception of being slow and inefficient or with performance issues.

The idea of running Java applications on supercomputers is beginning to look less funny nowadays with the explosion of cheap genome sequencing... With that being said though, Java is still a very strange way to interact with a supercomputer.  Java applications don't compile, look, or feel like normal applications in UNIX as a result of their cross-platform compatibility... For the vast majority of HPC users coming from traditional domain sciences and the professionals who support their infrastructure, Java applications remain unconventional and foreign.

Lockwood also points out that Hadoop re-invents the wheel in terms of functionality by taking technology that has existed within high-performance computing for decades in ways that frustrate supercomputing professionals.

...these poor reinventions are not the result of ignorance; rather, Hadoop's reinvention of a lot of HPC technologies arises from reason #1 above: Hadoop was not designed to run on supercomputers and it was not designed to fit into the existing matrix of technologies available to traditional HPC.  Rather, it was created to interoperate with web-oriented infrastructure.

The way in which Hadoop has evolved has been counter to how technologies develop from high-performance computing, which may be another source of frustration. It attempts to answer a question that doesn't exist in high-performance computing. 

The evolution of Hadoop has very much been a backwards one; it entered HPC as a solution to a problem which, by and large, did not yet exist.  As a result, it followed a common, but backwards, pattern by which computer scientists, not domain scientists, get excited by a new toy and invest a lot of effort into creating proof-of-concept codes and use cases.  Unfortunately, this sort of development is fundamentally unsustainable because of its nucleation in a vacuum, and in the case of Hadoop, researchers moved on to the next big thing and largely abandoned their model applications as the shine of Hadoop faded.

However, there are ways to help Hadoop fit more snugly into high-performance computing, which include but aren't limited to working with MapReduce to make it more high performance-oriented and implementing more high-performance computing technologies within Hadoop. Ultimately, it's about overcoming bias against its origins and working out kinks. While not front-and-center in supercomputing, Hadoop shouldn't be dismissed either – it could help solve specific (if as yet nonexistant) problems.

I think I have a pretty good idea about why Hadoop has received a lukewarm, and sometimes cold, reception in HPC circles, and much of these underlying reasons are wholly justified.  Hadoop's from the wrong side of the tracks from the purists' perspective, and it's not really changing the way the world will do its high-performance computing.  There is a disproportionate amount of hype surrounding it as a result of its revolutionary successes in the commercial data sector.

For more information, read Lockwood's original post here .

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}