Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Spark: Spark Union Adds Up the Partition of Input RDDs

DZone's Guide to

Apache Spark: Spark Union Adds Up the Partition of Input RDDs

Learn about the behavior of Apache Spark's RDD partitions during a union operation and the different cases in which you might find unknown results during the operation.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Back when I was doing a union of two-pair RDDs, I found a strange behavior regarding the number of partitions. The output RDD got a different number of partitions than the input RDD.

For example, suppose rdd1 and rdd2 each have two partitions. After the union of these RDDs, I was expecting the same number of partitions for the output RDD. But the output RDD showed the number of partitions as being the sum of the partitions of the input RDDs.

After doing some research, I found that the partitioner has an impact on this strange behavior. More details on the partitioner can be found here.

I found two cases, which are described below, in which that strange behavior can happen.

Case 1

If the partitioner is different for both input RDDs, then the output RDD will have partitions as the sum of the partitions of the input RDD. To check the partitioner, do the following:

1

Here, different partitioners have two meanings:

  1. Both input RDDs have different partitioners. For example, rdd1 has a hash partition and rdd2 has a range partitioner.
  2. One RDD has partitioners and the other one doesn't.

2

You can see in above example, that rdd1 has HashPartitioner with two partitions while rdd2 does not have any partitioner with four partitions. After the union of these two, the output rdd3 has 6 (2 + 4) partitions.

Case 2

If the partitioner is the same for both input RDDs but the number of partitions is different, then the output RDD will get the partitions as the sum of the partitions of input RDDs.

3

As you can see in the above example, both input RDDs have the same partitioner but with a different number of partitions — rdd1 has 2 while rdd2 has 3. So, after union, the output rdd3 has 5 (2 + 3) partitions.

You can see below that what happens when both partitioners and the number of partitions are the same for both input RDDs.

4

In above example, the output rdd3 has the same number of partitions as the input RDD.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,data science ,rdds ,apache spark ,unions

Published at DZone with permission of Rishi Khandelwal, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}