Over a million developers have joined DZone.

How Apache Hadoop 3 Adds Value Over Apache Hadoop 2

DZone's Guide to

How Apache Hadoop 3 Adds Value Over Apache Hadoop 2

Hadoop 3 combines the efforts of hundreds of contributors over the last five years since Hadoop 2 launched. Learn how Hadoop 3 can help your organization/

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Everyone is asking about what the difference is between Apache Hadoop 3 versus Apache Hadoop 2. What's all this commotion and ruckus mean? What is Hadoop 3 paving the way towards?

Where to start?! Hadoop 3 combines the efforts of hundreds of contributors over the last five years since Hadoop 2 launched. Several of these committers work at Hortonworks.

Let's start with your top-value propositions around Hadoop 3 and how it can help your organization.

Although Hadoop 2 uses agility and time to market containers, Hadoop 3 containerization brings agility and the package isolation story of Docker. A container-based service makes it possible to build apps quickly and roll one out in minutes. It also brings faster time to market for services.

Total Cost of Ownership

Hadoop 2 has a lot more storage overhead than Hadoop 3. For example, in Hadoop 2, if there are six blocks and 3x replication of each block, the result will be 18 blocks of space.

With erasure coding in Hadoop 3, if there are six blocks, it will occupy a nine-block space — six blocks and three for parity — resulting in less storage overhead. The end result is that instead of the 3x hit on storage, the erasure coding storage method will incur an overhead of 1.5x while maintaining the same level of data recoverability. It halves the storage cost of HDFS while also retaining data durability. Storage overhead can be reduced from 200% to 50%. In addition, you benefit from tremendous cost savings.

Scalability and Availability

Hadoop 2 and Hadoop 1 only use a single NameNode to manage all Namespaces. Hadoop 3 has multiple Namenodes for multiple namespaces for NameNode Federation, which improves scalability.

In Hadoop 2, there is only one standby NameNode. Hadoop 3 supports multiple standby NameNodes. If one standby node goes down over the weekend, you have the benefit of other standby NameNodes so the cluster can continue to operate. This feature gives you a longer servicing window.

Hadoop 2 uses an old timeline service, which has scalability issues. Hadoop 3 improves the timeline service v2 and improves the scalability and reliability of timeline service.

New Use Cases

Hadoop 2 doesn't support GPUs. Hadoop 3 enables scheduling of additional resources such as disks and GPUs for better integration with containers, deep learning, and machine learning. This feature provides the basis for supporting GPUs in Hadoop clusters, which enhances the performance of computations required for data science and AI use cases.

Hadoop 2 cannot accommodate intra-node disk balancing. Hadoop 3 has intra-node disk balancing. If you are repurposing or adding new storage to an existing server with older capacity drives, this leads to unevenly distributed disk space on each server. With intra-node disk balancing, the space in each disk is evenly distributed.

Hadoop 2 has only inter-queue preemption across queues. Hadoop 3 introduces intra-queue preemption, which goes to the next level time by allowing preemption between applications within a single queue. This means that you can prioritize jobs within the queue based on user limits and/or application priority.

In conclusion, we are very excited about the upcoming release of Hadoop 3. The accelerated release schedule plans anticipated for this year will bring even more capabilities into the hands of the users as soon as possible. If you look at the article published last year called Data Lake 3.0: The Ez Button to Deploy in Minutes and Cut TCO By Half, you will see many of the Data Lake 3.0 architecture and innovations from the Apache Hadoop community come to life in our next release of the Hortonworks Data Platform.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,hadoop 3 ,hadoop 2 ,apache hadoop

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}