DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Apache Spark for the Impatient
  • An Approach to Process Skewed Dataset in High Volume Distributed Data Processing
  • Identify and Resolve Stragglers in Your Spark Application
  • Mastering Concurrency: An In-Depth Guide to Java's ExecutorService

Trending

  • The Cypress Edge: Next-Level Testing Strategies for React Developers
  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  • Automating Data Pipelines: Generating PySpark and SQL Jobs With LLMs in Cloudera
  • Measuring the Impact of AI on Software Engineering Productivity
  1. DZone
  2. Data Engineering
  3. Data
  4. Apache Spark Performance Tuning – Degree of Parallelism

Apache Spark Performance Tuning – Degree of Parallelism

Today we learn about improving performance and increasing speed through partition tuning in a Spark application running on YARN.

By 
Rathnadevi Manivannan user avatar
Rathnadevi Manivannan
·
Jun. 30, 17 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
101.8K Views

Join the DZone community and get the full member experience.

Join For Free

This is the third article of a four-part series about Apache Spark on YARN. Apache Spark allows developers to run multiple tasks in parallel across machines in a cluster, or across multiple cores on a desktop. A partition, or split, is a logical chunk of a distributed data set. Apache Spark builds a Directed Acyclic Graph (DAG) with jobs, stages, and tasks for the submitted application. The number of tasks will be determined based on the number of partitions.

A few performance bottlenecks were identified in the SFO Fire Department call service dataset use case with YARN cluster manager. To understand the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. The resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application, as discussed in our previous blog on Apache Spark on YARN – Resource Planning.

In this blog post, let us discuss the partition problem and tuning the partitions of the use case Spark application.

Spark Partition Principles

The general principles to be followed when tuning partition for Spark application are as follows:

  • Too few partitions – Cannot utilize all cores available in the cluster.
  • Too many partitions – Excessive overhead in managing many small tasks.
  • Reasonable partitions – Helps us to utilize the cores available in the cluster and avoids excessive overhead in managing small tasks.

Understanding Use Case Performance 

The performance duration (without any performance tuning) based on different API implementations of the use case Spark application running on YARN is shown in the below diagram:SparkApplicationWithDefaultConfigurationPerformance

The performance duration after tuning the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application is shown in the below diagram:SparkApplicationAfterResourceTuningPerformance

For tuning of the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application, refer our previous blog on Apache Spark on YARN – Resource Planning.

Let us understand the Spark data partitions of the use case application and decide on increasing or decreasing the partition using Spark configuration properties.

Understanding Spark Data Partitions

The two configuration properties in Spark to tune the number of partitions at runtime are as follows:

select

Default parallelism and shuffle partition problems in both RDD and DataFrame API based application implementation are shown in the below diagram:FireServiceCallAnalysisDataFrameTest1StagesStats

The count () action stage using default parallelism (12 partitions) is shown in the below diagram:SparkDefaultParallelism12_200Stats

From the Summary Metrics for Input Size/Records section, the Max partition size is ~128 MB.

On considering the event timeline to understand those 200 shuffled partition tasks, there are tasks with more scheduler delay and less computation time. It indicates that 200 tasks are not necessary here and can be tuned to decrease the shuffle partition to reduce scheduler burden.HighNumberOfTasksProblem

The Stages view in Spark UI indicates that most of the tasks are simply launched and terminated without any computation, as shown in the below diagram:NumberOfTasksProblem

Spark Partition Tuning

Let us first decide the number of partitions based on the input dataset size. The rule of thumb to decide the partition size while working with HDFS is 128 MB. As our input dataset size is about 1.5 GB (1500 MB) and going with 128 MB per partition, the number of partitions will be:

Total input dataset size / partition size => 1500 / 128 = 11.71 = ~12 partitions.

This is equal to the Spark default parallelism (spark.default.parallelism) value. The metrics based on default parallelism are shown in the above section.

Now, let us perform a test by reducing the partition size and increasing the number of partitions.

Consider partition size as 64 MB.

Number of partitions = Total input dataset size / partition size => 1500 / 64 = 23.43 = ~23 partitions.

DataFrame API implementation is executed using the below partition configurations:select

The RDD API implementation is executed using the below partition configurations:select

Note: spark.sql.shuffle.partitions property is not applicable for RDD API-based implementation.

Running Spark on YARN With Partition Tuning

DataFrame API Spark-Submit

./bin/spark-submit --name FireServiceCallAnalysisDataFramePartitionTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Note: Update the values of spark.default.parallelism and spark.sql.shuffle.partitions property as testing has to be performed with the different number of partitions.

The Stages view based on spark.default.parallelism=23 and spark.sql.shuffle.partitions=23 is shown in the below diagram:Partition23_23_StagesStats

Consider the Tasks: Succeeded/Total column in the above diagram. Both default and shuffle partitions are applied and the number of tasks is 23.

The count () action stage using default parallelism (23 partitions) is shown in the below screenshot:SparkParallelism23_23_Stats

On considering Summary Metrics for Input Size/Records section, the max partition size is ~66 MB.

On looking into the shuffle stage tasks, the scheduler has launched 23 tasks and most of the times are occupied by shuffle (Read/Write). There are no tasks without computation.

Partition23_23_ShuflleEventStatsPartition23_23_TasksStatsThe output obtained after executing Spark application with the different number of partitions is shown in the below diagram:PartitionTuningOutput

In this blog, we discussed partition principles and understood the use case performance, deciding the number of partitions, and partition tuning using Spark configuration properties.

The Resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application, as discussed in our previous blog on Apache Spark on YARN – Resource Planning. To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks. But, the performance of spark application remains unchanged.

In our upcoming blog, let us discuss the final bottleneck of the use case in “ApacheSpark Performance Tuning – Straggler Tasks.”

The final performance achieved after resource tuning, partition tuning, and straggler tasks problem fixing is shown in the below diagram:Straggler Fix Output

Apache Spark Partition (database) Use case application Task (computing) Diagram Blog

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Apache Spark for the Impatient
  • An Approach to Process Skewed Dataset in High Volume Distributed Data Processing
  • Identify and Resolve Stragglers in Your Spark Application
  • Mastering Concurrency: An In-Depth Guide to Java's ExecutorService

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!