Apache Spark: Resilient Distributed Datasets

DZone 's Guide to

Apache Spark: Resilient Distributed Datasets

Everything you need to know about RDDs.

· Big Data Zone ·
Free Resource

RDDs represent both the idea of how a large dataset is represented in Apache Spark and the abstraction for working with it. This section will cover the former, and the following sections will cover the latter. According to the seminal paper on Spark, "RDDs are immutable, fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators." Let’s dissect this description to truly understand the ideas behind the RDD concept.


 RDDs are designed to be immutable, which means you can’t specifically modify a particular row in the dataset represented by that RDD. You can call one of the available RDD operations to manipulate the rows in the RDD into the way you want, but that operation will return a new RDD. The basic RDD will stay unchanged, and the new RDD will contain the data in the way that you altered it. The immutability requires an RDD to carry its lineage information that Spark leverages to efficiently provide fault tolerance capabilities.

Apache Spark architecture


The ability to process multiple datasets in parallel usually requires a cluster of machines to host and execute computational logic. If one or more of those machines dies or becomes extremely slow because of unexpected circumstances, then how will that affect the overall data processing of those datasets? The good news is that Spark automatically takes care of handling the failure on behalf of its users by rebuilding the failed portion using the lineage information.

Parallel Data Structures

Imagine the use case where someone gives you a large log file that is 1TB in size and you are asked to find out how many log statements contain the word "exception" in it. A slow solution would be to iterate through that log file from the beginning to the end and execute the logic of determining whether a particular log statement contains the word exception. A faster solution would be to divide that 1TB file into several chunks and execute the aforementioned logic on each chunk in a parallelized manner to speed up the overall processing time. Each chunk contains a collection of rows. The collection of rows is essentially the data structure that holds a set of rows and provides the ability to iterate through each row. Each chunk contains a collection of rows, and all the chunks are being processed in parallel. This is where the phrase parallel data structures comes from.

In-Memory Computing

 The idea of speeding up the computation of large datasets that reside on disks in a parallelized manner using a cluster of machines was introduced by a MapReduce paper2 from Google. This idea was implemented and is made available in the Hadoop open source project. Building on that solid foundation, RDD pushes the speed boundary by introducing the ability to do distributed in-memory computation.

It is always fascinating to examine the stories that led up the creation of an innovative idea. In the world of big data processing, once you are able to extract insights from large datasets in a reliable manner using a set of rudimentary techniques, you want to use more sophisticated techniques to reduce the amount of time it takes to do that. This is where distributed in-memory computation helps.

The sophisticated technique I am referring to is using machine learning to perform various predictions or to extract patterns out of large datasets. Machine learning algorithms are iterative in nature, meaning they need to go through many iterations to arrive at an optimal state. This is where distributed in-memory computation can help in reducing the completion time from days to hours. Another use case that can hugely benefit from distributed in-memory computation is interactive data mining, where multiple ad hoc queries are performed on the same subset of data. If that subset of data is persisted in memory, those queries will take seconds and not minutes to complete.

apache spark ,big data ,rdd ,parallael data structure ,fault-tolerant ,opinion

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}