Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Understanding Apache Spark's Execution Model Using SparkListeners – Part 1

DZone's Guide to

Understanding Apache Spark's Execution Model Using SparkListeners – Part 1

With the listener, your Spark operation toolbox now has another tool to fight against bottlenecks in Spark applications, beside WebUI or logs.

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

When you execute an action on an RDD, Apache Spark runs a job that in turn triggers tasks using DAGScheduler and TaskScheduler, respectively. They are all low-level details that may be often useful to understand when a simple transformation is no longer simple performance-wise and takes ages to complete.

There are a few ways to monitor Spark and WebUI is the most obvious choice with toDebugString and logs being at the other side of the spectrum – still useful, but require more skills than opening a browser at http://localhost:4040 and looking at the Details for Stage in the Stages tab for a given job.

Details for Stage in WebUI

There are however other ways that are not so often used which I’m going to present in this blog post – Scheduler Listeners.

Scheduler Listeners

A Scheduler listener (also known as SparkListener) is a class that listens to execution events from Spark’s DAGScheduler – the main part of the execution engine in Spark. It extends org.apache.spark.scheduler.SparkListener.

A SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.

spark.extraListeners

By default, Spark starts with no listeners but the one for WebUI. You can however change the default behaviour using the  spark.extraListeners (default: empty) setting.

spark.extraListeners is a comma-separated list of listener class names that are registered with Spark’s listener bus when SparkContext is initialized.

You can be informed about the extra listeners being registered in the logs as follows:

INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener

StatsReportListener – Logging summary statistics

Interestingly, Spark comes with two listeners that are worth knowing about – org.apache.spark.scheduler.StatsReportListener  and org.apache.spark.scheduler.EventLoggingListener .

Let’s focus on StatsReportListener first, and leave EventLoggingListener for the next blog post.

org.apache.spark.scheduler.StatsReportListener (see the class’ scaladoc) is a SparkListener that logs summary statistics when a stage completes.

It listens to SparkListenerTaskEnd and SparkListenerStageCompleted events, and prints out the summary as INFOs to the logs:

15/11/04 15:39:45 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@4d3956a4
15/11/04 15:39:45 INFO StatsReportListener: task runtime:(count: 8, mean: 36.625000, stdev: 5.893588, max: 52.000000, min: 33.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 33.0 ms 33.0 ms 33.0 ms 34.0 ms 35.0 ms 36.0 ms 52.0 ms 52.0 ms 52.0 ms
15/11/04 15:39:45 INFO StatsReportListener: task result size:(count: 8, mean: 953.000000, stdev: 0.000000, max: 953.000000, min: 953.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B
15/11/04 15:39:45 INFO StatsReportListener: executor (non-fetch) time pct: (count: 8, mean: 17.660220, stdev: 1.948627, max: 20.000000, min: 13.461538)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 13 % 13 % 13 % 17 % 18 % 20 % 20 % 20 % 20 %
15/11/04 15:39:45 INFO StatsReportListener: other time pct: (count: 8, mean: 82.339780, stdev: 1.948627, max: 86.538462, min: 80.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 80 % 80 % 80 % 82 % 82 % 83 % 87 % 87 % 87 %

To enable the listener, you register it to SparkContext. You can do it using SparkContext.addSparkListener(listener: SparkListener) method inside your Spark application or --conf command-line option.

$ ./bin/spark-shell --conf  spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
...
INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener

When you do it, you should see the INFO message and the above summary after every stage completes.

With the listener, your Spark operation toolbox now has another tool to fight against bottlenecks in Spark applications, beside WebUI or logs. Happy tuning!

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
apache spark ,spark listeners

Published at DZone with permission of deepsense.io Blog, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}