Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Getting to Know the Apache Hadoop 3 Alpha

DZone's Guide to

Getting to Know the Apache Hadoop 3 Alpha

The Apache Hadoop project recently announced its 3.0.0-alpha1 release. Come check out what's new!

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

The Apache Hadoop project recently announced its 3.0.0-alpha1 release.

Given the scope of a new major release, the Apache Hadoop community decided to release a series of alpha and beta releases leading up to 3.0.0 GA. This gives downstream applications and end users an opportunity to test and provide feedback on the changes, which can be incorporated during the alpha and beta process.

The 3.0.0-alpha1 release incorporates thousands of new fixes, improvements, and features since the previous minor release, 2.7.0, which was released over a year ago. The full changelog and release notes are available on the Hadoop website, but we’d like to drill into the major new changes that landed in 3.0.0-alpha1.

Disclaimer: As this release is an alpha, there are no guarantees regarding API stability or quality. The feature set and behavior are subject to change during the course of development.

HDFS Erasure Coding

HDFS Erasure Coding is a major new feature, and one of the driving features for releasing Hadoop 3.0.0. As described in this Cloudera blog post, erasure coding can reduce storage costs by up to 50% compared to 3x replication while maintaining the same or better durability. Moreover, as shown in a joint Intel-Cloudera blog post, erasure coding adds minimal overhead when using Intel’s optimized ISA-L routines, and can actually improve read and write performance in some conditions.

Considering trends in data growth and datacenter hardware, we foresee HDFS erasure coding being an important feature in years to come.

YARN Timeline Service v.2

3.0.0-alpha1 includes an early preview of YARN Timeline Service v.2, including. This includes container events, metrics, and application-specific information like the number of map and reduce tasks.

The Timeline Service v2 is a new implementation that improves upon the scalability and reliability of TSv1, and enhances usability by introducing flows and aggregation.

More details are available in the YARN Timeline Service v.2 docs.

Shell Script Rewrite

The Hadoop shell scripts have been rewritten with an eye toward unifying behavior, addressing numerous long-standing bugs, improving documentation, as well as adding new functionality. This update affects users who use Hadoop environment variables or integrate with the shell commands.

See HADOOP-9902 for incompatible changes and other details, as well as the the Unix Shell Guide.

Support for Multiple Standby NameNodes

HDFS NameNode high availability with QuorumJournalManager uses a Paxos quorum to store the NameNode edit log. With a three-node quorum, this change means we can tolerate the loss of any one node and still continue operation.

However, business-critical deployments may wish to run with higher levels of fault-tolerance, e.g. a five-node quorum to be able to tolerate the loss of any two nodes.

QuorumJournalManager already supports an arbitrary number of nodes, but fault tolerance was limited since HDFS was only able to run a single active and single standby NameNode. Hadoop 3 eliminates this restriction by supporting running multiple standby NameNodes. This improves the fault tolerance of HDFS.

See the updated HDFS HA docs for information on configuring multiple standby NameNodes.

Java 8 Minimum Runtime Version

Another major motivation for a new major release was bumping the minimum supported Java version to Java 8. This is critically important as Java 7 was end-of-lifed in April 2015, meaning Oracle would no longer publicly support it with security fixes.

Thus, all Hadoop 3 JARs are compiled for a Java 8 language version, and require a Java 8 runtime version. This is also important because many libraries only support Java 8, so this bump allowing Hadoop to upgrade its dependencies to more modern versions.

New Default Ports for Several Services

To avoid bind errors on startup, the default ports of the NameNode, Secondary NameNode, DataNode, and KMS have been moved out of the Linux ephemeral range (32768-61000). See the release notes for HDFS-9427 and HADOOP-12811 for a full list of port changes.

This update should improve the reliability of rolling restarts on large clusters.

Intra-DataNode Balancer

Intra-DataNode balancing functionality addresses the intra-node skew that can occur when disks are added or replaced. See the disk balancer section in the HDFS Commands Guide for more information.

Stay tuned for more information on this feature in a future blog post!

Miscellaneous

The 3.0.0-alpha1 release is also the first release that incorporates a number of changes to the release process.

  • We are now using Apache Yetus (incubating) to automatically generate our release notes and changelog from JIRA, a vast improvement over our previous system of manually editing a text file.
  • Building the release artifacts has been Docker-ized and streamlined with a new create-release script, which improves correctness and enables us to make more frequent releases.
  • Hadoop’s versioning scheme has been adjusted, letting us more easily release from multiple concurrent branches in parallel (e.g. 2.6.x, 2.7.x, 3..0.x), as well as the addition of -alphaX/-betaX tags.

Conclusion

Apache Hadoop 3.0.0-alpha1 marks a major development milestone on the road to a 3.0.0 GA release. The features and improvements listed here will eventually be incorporated into CDH after our usual rigorous testing and integration process.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
apache ,hadoop ,big data

Published at DZone with permission of Andrew Wang. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}