Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How Apache Hadoop 3 Adds Value Over Apache Hadoop 2

DZone's Guide to

How Apache Hadoop 3 Adds Value Over Apache Hadoop 2

A discussion of what the Hadoop team has done to up their game with Hadoop 3, including time to market concerns, scalability issues, and more.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Thank you to Vinod Vavilapalli and Saumitra Buragohain for contributing to this blog.

Everyone is asking - What is the difference between Apache Hadoop 3 versus Apache Hadoop 2? What does all this commotion and ruckus mean? What is Hadoop 3 paving the way toward?

Where to start?! Hadoop 3 combines the efforts of hundreds of contributors over the last five years since Hadoop 2 launched. Several of these committers work at Hortonworks.

Let's start with your top value propositions around Hadoop 3 and how it can help your organization.

Although Hadoop 2 uses Agility and Time to Market containers, Hadoop 3 containerization brings agility and package isolation à la Docker. A container-based service makes it possible to build apps quickly and roll one out in minutes. It also brings faster time to market for services.

Total Cost of Ownership

Hadoop 2 has a lot more storage overhead than Hadoop 3. For example, in Hadoop 2, if there are 6 blocks and 3x replication of each block, the result will be 18 blocks of space.

With erasure coding in Hadoop 3, if there are 6 blocks, it will occupy a 9 block space - 6 blocks and 3 for parity - resulting in less storage overhead. The end result is that, instead of the 3x hit on storage, the erasure coding storage method will incur an overhead of 1.5x, while maintaining the same level of data recoverability. It halves the storage cost of HDFS while also retaining data durability. Storage overhead can be reduced from 200% to 50%. In addition, you benefit from the tremendous cost savings.

Scalability and Availability

Hadoop 2 and Hadoop 1 only use a single NameNode to manage all Namespaces. Hadoop 3 has multiple Namenodes for multiple namespaces for NameNode Federation which improves scalability.

In Hadoop 2, there is only one standby NameNode. Hadoop 3 supports multiple standby NameNodes. If one standby node goes down over the weekend, you have the benefit of other standby NameNodes so the cluster can continue to operate. This feature gives you a longer servicing window.

Hadoop 2 uses an old timeline service which has scalability issues. Hadoop 3 improves the timeline service v2 and improves the scalability and reliability of timeline service.

New Use Cases

Hadoop 2 doesn't support GPUs. Hadoop 3 enables the scheduling of additional resources, such as disks and GPUs for better integration with containers, deep learning, and machine learning. This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for Data Science and AI use cases.

Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to uneven disk space in each server. With intra-node disk balancing, the space in each disk is evenly distributed.

Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption which goes to the next level time by allowing preemption between applications within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority

In conclusion, we are very excited about the upcoming releases on Hadoop 3. The accelerated release schedule plans anticipated for this year will bring even more capabilities into the hands of the users as soon as possible. If you look at the blog published last year called Data Lake 3.0: The Ez Button To Deploy In Minutes And Cut TCO By Half, we will see many of the Data Lake 3.0 architecture and innovations from the Apache Hadoop community come to life in our next release of the Hortonworks Data Platform.

Learn More About Hadoop 3:

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,hadoop ,data scalability ,data availability

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}