Over a million developers have joined DZone.

Apache Spark: Spark Union Adds Up the Partition of Input RDDs

DZone's Guide to

Apache Spark: Spark Union Adds Up the Partition of Input RDDs

Learn about the behavior of Apache Spark's RDD partitions during a union operation and the different cases in which you might find unknown results during the operation.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Back when I was doing a union of two-pair RDDs, I found a strange behavior regarding the number of partitions. The output RDD got a different number of partitions than the input RDD.

For example, suppose rdd1 and rdd2 each have two partitions. After the union of these RDDs, I was expecting the same number of partitions for the output RDD. But the output RDD showed the number of partitions as being the sum of the partitions of the input RDDs.

After doing some research, I found that the partitioner has an impact on this strange behavior. More details on the partitioner can be found here.

I found two cases, which are described below, in which that strange behavior can happen.

Case 1

If the partitioner is different for both input RDDs, then the output RDD will have partitions as the sum of the partitions of the input RDD. To check the partitioner, do the following:


Here, different partitioners have two meanings:

  1. Both input RDDs have different partitioners. For example, rdd1 has a hash partition and rdd2 has a range partitioner.
  2. One RDD has partitioners and the other one doesn't.


You can see in above example, that rdd1 has HashPartitioner with two partitions while rdd2 does not have any partitioner with four partitions. After the union of these two, the output rdd3 has 6 (2 + 4) partitions.

Case 2

If the partitioner is the same for both input RDDs but the number of partitions is different, then the output RDD will get the partitions as the sum of the partitions of input RDDs.


As you can see in the above example, both input RDDs have the same partitioner but with a different number of partitions — rdd1 has 2 while rdd2 has 3. So, after union, the output rdd3 has 5 (2 + 3) partitions.

You can see below that what happens when both partitioners and the number of partitions are the same for both input RDDs.


In above example, the output rdd3 has the same number of partitions as the input RDD.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,data science ,rdds ,apache spark ,unions

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}