DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Three Ways to Make Spark Data Processing Faster

Three Ways to Make Spark Data Processing Faster

We take a look at how developers and data scientists can use the popular open source Apache Spark framework to make data processing more efficient.

Ashok Reddy user avatar by
Ashok Reddy
·
Nov. 15, 18 · Analysis
Like (3)
Save
Tweet
Share
11.20K Views

Join the DZone community and get the full member experience.

Join For Free

Why Spark?

Apache Spark is a general purpose, distributed data processing framework that supports three types of processing options as opposed to traditional MapReduce, which reads the data from the disk all the time making the data processing very time consuming.

  1. In-memory Deserialized (Not Efficient)

  2. In-Memory Serialized (Better)

  3. Off-the-Disk (Best suited for large volume of data)

MapReduce jobs create two stages for each job which makes it clunky whereas Spark enables developers to create multiple stages which enables the scheduler to create multiple parallel tasks.

Spark as a framework would take care of many aspects of clustered computation, however, applying the below techniques can help achieve better parallelism. 

1. Sizing the YARN Resources

Developers should have a deep understanding of the workings of Spark, besides being proficient using the APIs. Just increasing the number of executors is not going to give us better parallelism, unless we make sure  that we have enough executors based on the volume of data that we are intending to process. Seconldy,  allocate a minimum of 4GB and two cores for each executor, and limiting the RDD partitions so they're equal to the number of Executors multiplied by the number of cores. Getting this sizing right is an iterative process, though it can be made easier with the help of monitoring tools and Spark UI.

2. Choose the Right Join

Spark Core supports a variety of joins to be able get insights and derive key metrics. Leveraging the right join makes it possible to overcome uneven sharding and limited parallelism issues.

  • Make the RDDs partition aware while using functions like reduceBykey or groupByKey.

  • Use the Hash Shuffle Join (default) while joining the dataframes. Make sure it has a more or less equal number of rows.

  • Apply a broadcast join when joining tables that have an uneven number of rows; for example, customer  orders.

3. Choose the Right Data Format

Spark has built-in support to read and write data into standard formats like txt, JSON, parquet, etc. However, a good understanding of these formats and partition techniques is necessary to avoid unnecessary memory footprints and saves quite a lot of processing time.

  • Ensure blocks are a minimum of 256MB.

  • Partition the data by the key which distributes the data evenly.

  • Use parquet format if you are interested in a few columns.

  • Serialize the data (Kryo) to make processing memory efficient.

This is not the end of it, however, these are some of the real-time tuning techniques that we applied which helped us achieve a great performance boost.

Data processing Database Apache Spark

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • The Role of Data Governance in Data Strategy: Part II
  • What Was the Question Again, ChatGPT?
  • Taming Cloud Costs With Infracost
  • Continuous Development: Building the Thing Right, to Build the Right Thing

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: