Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Understanding Apache Spark's Execution Model Using SparkListeners – Part 1

DZone's Guide to

Understanding Apache Spark's Execution Model Using SparkListeners – Part 1

With the listener, your Spark operation toolbox now has another tool to fight against bottlenecks in Spark applications, beside WebUI or logs.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

When you execute an action on an RDD, Apache Spark runs a job that in turn triggers tasks using DAGScheduler and TaskScheduler, respectively. They are all low-level details that may be often useful to understand when a simple transformation is no longer simple performance-wise and takes ages to complete.

There are a few ways to monitor Spark and WebUI is the most obvious choice with toDebugString and logs being at the other side of the spectrum – still useful, but require more skills than opening a browser at http://localhost:4040 and looking at the Details for Stage in the Stages tab for a given job.

Details for Stage in WebUI

There are however other ways that are not so often used which I’m going to present in this blog post – Scheduler Listeners.

Scheduler Listeners

A Scheduler listener (also known as SparkListener) is a class that listens to execution events from Spark’s DAGScheduler – the main part of the execution engine in Spark. It extends org.apache.spark.scheduler.SparkListener.

A SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.

spark.extraListeners

By default, Spark starts with no listeners but the one for WebUI. You can however change the default behaviour using the  spark.extraListeners (default: empty) setting.

spark.extraListeners is a comma-separated list of listener class names that are registered with Spark’s listener bus when SparkContext is initialized.

You can be informed about the extra listeners being registered in the logs as follows:

INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener

StatsReportListener – Logging summary statistics

Interestingly, Spark comes with two listeners that are worth knowing about – org.apache.spark.scheduler.StatsReportListener  and org.apache.spark.scheduler.EventLoggingListener .

Let’s focus on StatsReportListener first, and leave EventLoggingListener for the next blog post.

org.apache.spark.scheduler.StatsReportListener (see the class’ scaladoc) is a SparkListener that logs summary statistics when a stage completes.

It listens to SparkListenerTaskEnd and SparkListenerStageCompleted events, and prints out the summary as INFOs to the logs:

15/11/04 15:39:45 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@4d3956a4
15/11/04 15:39:45 INFO StatsReportListener: task runtime:(count: 8, mean: 36.625000, stdev: 5.893588, max: 52.000000, min: 33.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 33.0 ms 33.0 ms 33.0 ms 34.0 ms 35.0 ms 36.0 ms 52.0 ms 52.0 ms 52.0 ms
15/11/04 15:39:45 INFO StatsReportListener: task result size:(count: 8, mean: 953.000000, stdev: 0.000000, max: 953.000000, min: 953.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B
15/11/04 15:39:45 INFO StatsReportListener: executor (non-fetch) time pct: (count: 8, mean: 17.660220, stdev: 1.948627, max: 20.000000, min: 13.461538)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 13 % 13 % 13 % 17 % 18 % 20 % 20 % 20 % 20 %
15/11/04 15:39:45 INFO StatsReportListener: other time pct: (count: 8, mean: 82.339780, stdev: 1.948627, max: 86.538462, min: 80.000000)
15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100%
15/11/04 15:39:45 INFO StatsReportListener: 80 % 80 % 80 % 82 % 82 % 83 % 87 % 87 % 87 %

To enable the listener, you register it to SparkContext. You can do it using SparkContext.addSparkListener(listener: SparkListener) method inside your Spark application or --conf command-line option.

$ ./bin/spark-shell --conf  spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
...
INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener

When you do it, you should see the INFO message and the above summary after every stage completes.

With the listener, your Spark operation toolbox now has another tool to fight against bottlenecks in Spark applications, beside WebUI or logs. Happy tuning!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
apache spark ,spark listeners

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}