Making Apache Spark Four Times Faster

DZone 's Guide to

Making Apache Spark Four Times Faster

How to improve the performance of Apache Spark, and how Java is not good at dealing with over 100GB of memory.

· Big Data Zone ·
Free Resource

Apache SparkThis is a followup to my previous post Apache Spark with Air ontime performance data.

To recap an interesting point in that post: when using 48 cores with the server, the result was worse than with 12 cores. I wanted to understand the reason, so I started digging. My primary suspicion was that Java (I never trust Java) was not good at dealing with 100GB of memory.

There are few links pointing to the potential issues with a huge HEAP:

Following the last article’s advice, I ran four instances of Spark’s slaves. This is an old technique to better utilize resources, as often (as is well known from old MySQL times) one instance doesn’t scale well.

I added the following to the config:


The full description of the test can be found in my previous post Apache Spark with Air ontime performance data.

The results:
Apache Spark

Although the results for four instances still don’t scale much after using 12 cores, at least there is no extra penalty for using more.

It could be that the dataset is just not big enough to show the setup’s full potential.

I think there is a clear indication that with the 25GB HEAP size, Java performs much better than with 100GB — at least with Oracle’s JDK (there are comments that a third-party commercial JDK may handle this better).

This is something to keep in mind when working with Java-based servers (like Apache Spark) on high end servers.

big data, java, spark, spark core, spark performance monitoring

Published at DZone with permission of Peter Zaitsev , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}