Over a million developers have joined DZone.

Monitoring Apache Spark Streaming: Understanding Key Metrics

Swaroop Ramachandra explains in the second part of his series how to understand key metrics for performance and health in Apache Spark.

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

This post is part 2 of the Monitoring Apache Spark series. Here's a link to part 1. In this post, we will dig deeper into the specific Apache Spark metrics and understand their behavior. Finally, we will discuss how you can automate the task of configuring monitoring for  Spark and provide visibility into key concerns.

Key Spark Metrics and Behavior Patterns

Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.

  1. In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.spark-gc
  2. Performance issues mostly show up during Shuffles. Although avoiding shuffles and reducing the amount of data being moved between nodes is nice to have, it is often times unavoidable. It would be wise to track the shuffle read and write bytes on each executor and driver as well as the total Input bytes. As a thumb rule, more the shuffles, lower the performance.spark-shuffle
  3. Track the Latency of your Spark application. This is the time it takes to complete a single job on your Spark cluster.
  4. Measure the Throughput of your Spark application. This gives you the number of records that are being processed per second.
  5. Monitor the number of File Descriptors that are open. Shuffles sometimes cause too many file descriptors to be opened resulting in “Too many open files” exception. Check your ulimit as well.
  6. Monitor Scheduling delay in streaming mode. It gives the time taken by the scheduler to submit jobs to the cluster.

At OpsClarity, we understand Spark in-depth and even developed our software to handle all the aformentioned problems and have a series of alerts for performance. Here, we’ve curated a list of metrics you should consider monitoring and alerting on.

Spark Driver (one driver for each application)Spark Executor (multiple executors for each application)
Block manager

•   MemoryUsed

•   MemoryRemaining


•   active jobs

•   total jobs


•   # of receivers

•   # of completed batches

•   # of running batches

•   # of records received

•   # of records processed

•   # of unprocessed batches

•    average message processing time

# of RDD blocks

Memory Used

Max Memory Allocated

Disk Used

# of Active Tasks

# of Failed Tasks

# of Completed Tasks

Total Duration for which the Executor has run

Total Input Bytes

Total Shuffle Read Bytes

Total Shuffle Write Bytes


Spark MasterSpark Worker
# of Workers Alive in the Cluster

# of Apps Running and Waiting

# of Apps Waiting

# of Workers Provisioned in the Cluster

# of Free Cores

# of Used Cores

# of Executors Currently Running

Memory Used

Memory Free


Live Data Nodes

Down Data Nodes

HDFS disk %

Corrupted Blocks

UnderReplicated Blocks

Memory %

Network %

CPU Load

NameNode Heap %

NameNode RPC

Application Manager



Node Manager

Active Nodes

Lost Nodes

Unhealthy Nodes


Available GB

Available VCores

Containers Running

Containers Failed


How Do I Troubleshoot Issues in Spark?

 While you can hit a number of bumps when running Apache Spark, here are a few things to ask yourself during the debugging process.

  1. Is the Spark pipeline healthy? Are there affected components that could lead to degraded performance?
  2. Does the cluster have enough resources (cores, memory) available for jobs to run?
  3. How well is the data parallelized? Are there too many or too little partitions?
  4. How much of memory is the application consuming per executor?
  5. What is the garbage collection pattern on the executors and the driver? Too much GC correlating with increase in ingestion rate could cause havoc on the executors
  6. Are you using the right serializer?
  7. Has there been an increase in shuffle reads / writes?
  8. What is the throughput and latency of my application?
Spark Topology discovered automatically along with health of each component overlaid on top

The Spark UI provides details of a single job in great depth by breaking it down to stages and tasks. However, if you need a production ready monitoring tool to get an aggregated view of how the cluster is performing, and understand how a dynamic infrastructure like Spark with different moving components affect availability and performance, you need a more complete monitoring tool.  Understanding what’s important to be alerted on and not, can greatly help you focus on things that matter and increase your agility on building applications on Spark.




It is important to have a software (like OpsClarity) that enables connection of components, such as Hadoop services and messaging systems like Kafka.

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison

apache spark,big data,performance,integration,measurement

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}