Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Making Apache Spark Four Times Faster

DZone's Guide to

Making Apache Spark Four Times Faster

How to improve the performance of Apache Spark, and how Java is not good at dealing with over 100GB of memory.

Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

Apache SparkThis is a followup to my previous post Apache Spark with Air ontime performance data.

To recap an interesting point in that post: when using 48 cores with the server, the result was worse than with 12 cores. I wanted to understand the reason, so I started digging. My primary suspicion was that Java (I never trust Java) was not good at dealing with 100GB of memory.

There are few links pointing to the potential issues with a huge HEAP:

Following the last article’s advice, I ran four instances of Spark’s slaves. This is an old technique to better utilize resources, as often (as is well known from old MySQL times) one instance doesn’t scale well.

I added the following to the config:

export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=12
export SPARK_WORKER_MEMORY=25g

The full description of the test can be found in my previous post Apache Spark with Air ontime performance data.

The results:
Apache Spark

Although the results for four instances still don’t scale much after using 12 cores, at least there is no extra penalty for using more.

It could be that the dataset is just not big enough to show the setup’s full potential.

I think there is a clear indication that with the 25GB HEAP size, Java performs much better than with 100GB — at least with Oracle’s JDK (there are comments that a third-party commercial JDK may handle this better).

This is something to keep in mind when working with Java-based servers (like Apache Spark) on high end servers.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:
big data ,spark performance monitoring ,spark ,spark core ,java

Published at DZone with permission of Peter Zaitsev, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}