Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Three Ways to Make Spark Data Processing Faster

DZone's Guide to

Three Ways to Make Spark Data Processing Faster

We take a look at how developers and data scientists can use the popular open source Apache Spark framework to make data processing more efficient.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Why Spark?

Apache Spark is a general purpose, distributed data processing framework that supports three types of processing options as opposed to traditional MapReduce, which reads the data from the disk all the time making the data processing very time consuming.

  1. In-memory Deserialized (Not Efficient)

  2. In-Memory Serialized (Better)

  3. Off-the-Disk (Best suited for large volume of data)

MapReduce jobs create two stages for each job which makes it clunky whereas Spark enables developers to create multiple stages which enables the scheduler to create multiple parallel tasks.

Spark as a framework would take care of many aspects of clustered computation, however, applying the below techniques can help achieve better parallelism. 

1. Sizing the YARN Resources

Developers should have a deep understanding of the workings of Spark, besides being proficient using the APIs. Just increasing the number of executors is not going to give us better parallelism, unless we make sure  that we have enough executors based on the volume of data that we are intending to process. Seconldy,  allocate a minimum of 4GB and two cores for each executor, and limiting the RDD partitions so they're equal to the number of Executors multiplied by the number of cores. Getting this sizing right is an iterative process, though it can be made easier with the help of monitoring tools and Spark UI.

2. Choose the Right Join

Spark Core supports a variety of joins to be able get insights and derive key metrics. Leveraging the right join makes it possible to overcome uneven sharding and limited parallelism issues.

  • Make the RDDs partition aware while using functions like reduceBykey or groupByKey.

  • Use the Hash Shuffle Join (default) while joining the dataframes. Make sure it has a more or less equal number of rows.

  • Apply a broadcast join when joining tables that have an uneven number of rows; for example, customer  orders.

3. Choose the Right Data Format

Spark has built-in support to read and write data into standard formats like txt, JSON, parquet, etc. However, a good understanding of these formats and partition techniques is necessary to avoid unnecessary memory footprints and saves quite a lot of processing time.

  • Ensure blocks are a minimum of 256MB.

  • Partition the data by the key which distributes the data evenly.

  • Use parquet format if you are interested in a few columns.

  • Serialize the data (Kryo) to make processing memory efficient.

This is not the end of it, however, these are some of the real-time tuning techniques that we applied which helped us achieve a great performance boost.

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,apache spark ,data processing ,data partitions ,data science

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}