'mapPartitions' in Apache Spark: 5 Key Benefits
'mapPartitions' is a powerful transformation giving Spark programmers the flexibility to process partitions as a whole by writing custom logic on lines of single-threaded programming. This article highlights the 5 key benefits of 'mapPartitions'.
Join the DZone community and get the full member experience.Join For Free
'mapPartitions' is the only narrow transformation, being provided by Apache Spark Framework, to achieve partition-wise processing, meaning, process data partitions as a whole. All the other narrow transformations, such as map, flatmap, etc. process partitions record-wise. 'mapPartitions', if used judiciously, can speed up the performance and efficiency of the underlying Spark Job manifold.
'mapPartitions' provides an iterator to the partition data to the computing function and expects an iterator to a new data collection as the return value from the computing function. Below is the 'mapPartitions' API applicable on a Dataset of type <T> expecting a functional interface of type 'MapPartitionsFunction' to process each data partition as a whole along with an Encoder of the type <U>, <U> being representing the returned data type in the returned Dataset.
public <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f,Encoder<U> encoder)
One has to provide the partition processing routine of the below type while implementing their custom 'MapPartitionsFunction':
It should be clearly evident from the signature of the partition processing routine that it takes an iterator on the partition data having data type as <T> and after processing return an iterator of a new data collection with data type as <U>.
There are several benefits of using 'mapPartitions' in a Spark application, I am listing the five of them which I feel are important:
1. Low processing overhead: For data processing tasks doable via the 'map', 'flatMap', or 'filter' transformations, one can always opt for 'mapPartitions' if data processing operations involved are light on memory demand. This is due to the fact that transformations, such as 'map', 'flatMap', etc. involve the overhead of invoking a function call for each of the elements of data collection residing in a partition. On the contrary, mapPartitions incurs only one function invocation overhead for the data collection residing in a data partition. Further, transformations such as 'map' and 'flatmap' may require additional filtering transformation to filter null records from the output, but 'mapPartitions' does not require such explicit filtering step because it could return a data collection of a size different than the input one.
2. Combined Processing Opportunity: 'mapPartitions' provides the opportunity to combine multiple 'map', 'flatMap', or 'filter' operations. The combined processing would yield higher efficiency for Spark Applications since the overhead of setting up and managing multiple data transformation steps is avoided.
3. Stateful Partition Wise Processing: 'mapPartitions' provides the power to the user to process a data partition in a way that involves partition specific states/correlation, meaning processing is dependent on some correlation or state shared between various data records of a partition.
4. Efficient Local and Global Aggregation Opportunity: Although Spark provides explicit aggregation APIs to perform aggregation on Spark RDDs and Datasets, 'mapPartitions' can also be used for performing custom local and global aggregation efficiently and simplistically as compared to UDAF way of implementing custom aggregation in Spark. Local aggregation is very important in aggregation (reduce) operations of large data sets because it greatly reduces the amount of shuffled data. The amount of reduction depends on the distribution of data across partitions, measured against the aggregation key(s). In cases, where each of the partition carries multiple data entries against most of the aggregation key(s), there is a significant reduction in the amount of shuffled data due to local aggregation. Global aggregation opportunity is available with 'mapPartitions' when data is partitioned in such a way that all the data records (to be aggregated) against each of the aggregation key resides in a single partition.
5. Avoidance of Repetitive Heavy Initialization: 'mapPartitions' also comes to the rescue in cases where a similar heavy initialization is mandatory before the processing of each data record residing in a data partition. A usual narrow transformation like map and flatMap in such cases becomes very inefficient since they incur a major overhead of repetitive initialization/de-initialization. However, in the case of mapPartitions usage, this heavy initialization would be executed only once and would suffice for all the data records residing in a partition. An example of a heavy initialization could be the initialization of a DB connection to update/insert a record.
All the above benefits of 'mapPartitions' are mostly dependent on three key factors, the type of processing, the number of partitions, and the data distribution across partitions. If the processing is favorable for 'mapPartitions', partitioning can be rightly achieved in one or the other ways. ( To know more about the Spark partitioning tuning, you could refer to, “Guide to Spark Partitioning: Spark Partitioning Explained in Depth”)
Published at DZone with permission of Ajay Gupta. See the original article here.
Opinions expressed by DZone contributors are their own.