Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop 3.0 and the Decoupling of Hadoop Compute From Storage

DZone's Guide to

Hadoop 3.0 and the Decoupling of Hadoop Compute From Storage

The network infrastructure in the typical enterprise data center of the past couldn't move large amounts of data between servers. Now, data no longer has to be co-located with the compute.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The traditional Hadoop architecture was founded upon the belief that the only way to get good performance with large-scale distributed data processing was to bring the compute to the data. And in the early part of this century, that was true. The network infrastructure in the typical enterprise data center of that time was not up to the task of moving large amounts of data between servers. The data had to be co-located with the compute.

But now, the network infrastructure in enterprise data centers, as well as that of the public cloud providers, is no longer a bottleneck for big data computing. It's time to decouple Hadoop compute from storage. I've written about this in previous blog posts and our partners like Dell EMC have weighed in on this topic, as well. Industry analysts have recognized this, too, as noted in this recent IDC report on the benefits of separating compute and storage for big data deployments:

"Decoupling compute and storage is proving to be useful in Big Data deployments. It provides increased resource utilization, increased flexibility, and lower costs." — Ritu Jyoti, IDC

In 2018, discussions about big data infrastructure no longer revolve around methods to reduce network traffic through the use of clever data placement algorithms; instead, there are now more discussions about how to reduce the cost of reliable, distributed storage.

The Hadoop open-source community has brought this discussion to the forefront with the recent introduction of Apache Hadoop version 3.0. One of the key features of Hadoop 3 is Erasure Coding for the Hadoop Distributed File System (HDFS), as an alternative to the venerable HDFS 3x data replication. Under typical configurations, Erasure Coding reduces HDFS storage cost by ~50% compared with the traditional 3x data replication.

Over the past few years, the Hadoop community has discussed the potential storage cost savings that Erasure Coding will bring to HDFS; and many have questioned whether 3x data replication still makes sense, given the advancements in hardware and networks over the last ten years. And now that it's here, HDFS Erasure Coding does fundamentally change Hadoop storage economics — as outlined in this recent article in Datanami, including a brief interview with yours truly. It's also acknowledgment (finally!) by the Hadoop community that the data doesn't have to be co-located with compute.

To get an appreciation of just how dramatic the shift in this conversation has been, let's compare the performance numbers in a presentation from Yahoo in 2010 on Hadoop scaling and compare it to a more recent one describing HDFS with Erasure Coding.

In the first presentation, the Benchmarks slide below refers to the DFSIO benchmark with read throughput of 66 MB/s and write throughput of 40 MB/s. The performance numbers for the Sort benchmark are given with the caveat of being based on a "Very carefully tuned user application." The use of 3x replication in HDFS was considered a powerful tool for data protection and performance.

The comparable Benchmark slide in the second presentation refers to the same DFSIO benchmark. The read throughput is 1,262 MB/s for HDFS with 3x replication compared to 2,321 MB/s on HDFS with Erasure Coding (6+3 Striping). This is with 30 simultaneous mappers and no mention of careful application tuning! The 3x replication used by HDFS is now viewed as an archaic, expensive, and unnecessary overhead for achieving (limited) data reliability.

HDFS with Erasure Coding (EC) utilizes the network for each file read and write. This is an implicit acknowledgment that the network is not a bottleneck to performance. Indeed, the primary performance impact of HDFS EC is due to its CPU cycle consumption rather than network latency. The use of Intel's open-source Intelligent Storage Acceleration Library (ISA-L) is addressing that concern; this brief video provides a high-level overview of how using EC with Intel ISA-L can ensure high performance.

A recent blog post on Hadoop 3.0 from our partners at HPE also highlights the advantages of separating storage and compute resources for big data, including performance results with HDFS EC and the use of ISA-L. Bottom line, this demonstrates the potential for significant cost savings in storage (more than 6x lower $/TB in this case) without sacrificing performance.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,hadoop ,decoupling ,storage ,compute ,erasure coding ,hdfs

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}