This post is part 2 of the Monitoring Apache Spark series. Here's a link to part 1. In this post, we will dig deeper into the specific Apache Spark metrics and understand their behavior. Finally, we will discuss how you can automate the task of configuring monitoring for Spark and provide visibility into key concerns.
Key Spark Metrics and Behavior Patterns
Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affects the performance of your application as well as your infrastructure. Let’s dig deeper into some of the things you should care about monitoring.
- In Spark, it is well known that Memory related issues are typical if you haven’t paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.
- Performance issues mostly show up during Shuffles. Although avoiding shuffles and reducing the amount of data being moved between nodes is nice to have, it is often times unavoidable. It would be wise to track the shuffle read and write bytes on each executor and driver as well as the total Input bytes. As a thumb rule, more the shuffles, lower the performance.
- Track the Latency of your Spark application. This is the time it takes to complete a single job on your Spark cluster.
- Measure the Throughput of your Spark application. This gives you the number of records that are being processed per second.
- Monitor the number of File Descriptors that are open. Shuffles sometimes cause too many file descriptors to be opened resulting in “Too many open files” exception. Check your ulimit as well.
- Monitor Scheduling delay in streaming mode. It gives the time taken by the scheduler to submit jobs to the cluster.
At OpsClarity, we understand Spark in-depth and even developed our software to handle all the aformentioned problems and have a series of alerts for performance. Here, we’ve curated a list of metrics you should consider monitoring and alerting on.
|Spark Driver (one driver for each application)||Spark Executor (multiple executors for each application)|
|Block manager |
• active jobs
• total jobs
• # of receivers
• # of completed batches
• # of running batches
• # of records received
• # of records processed
• # of unprocessed batches
• average message processing time
|# of RDD blocks |
Max Memory Allocated
# of Active Tasks
# of Failed Tasks
# of Completed Tasks
Total Duration for which the Executor has run
Total Input Bytes
Total Shuffle Read Bytes
Total Shuffle Write Bytes
|Spark Master||Spark Worker|
|# of Workers Alive in the Cluster |
# of Apps Running and Waiting
# of Apps Waiting
# of Workers Provisioned in the Cluster
|# of Free Cores |
# of Used Cores
# of Executors Currently Running
|Live Data Nodes |
Down Data Nodes
HDFS disk %
NameNode Heap %
|Application Manager |
How Do I Troubleshoot Issues in Spark?
While you can hit a number of bumps when running Apache Spark, here are a few things to ask yourself during the debugging process.
- Is the Spark pipeline healthy? Are there affected components that could lead to degraded performance?
- Does the cluster have enough resources (cores, memory) available for jobs to run?
- How well is the data parallelized? Are there too many or too little partitions?
- How much of memory is the application consuming per executor?
- What is the garbage collection pattern on the executors and the driver? Too much GC correlating with increase in ingestion rate could cause havoc on the executors
- Are you using the right serializer?
- Has there been an increase in shuffle reads / writes?
- What is the throughput and latency of my application?
The Spark UI provides details of a single job in great depth by breaking it down to stages and tasks. However, if you need a production ready monitoring tool to get an aggregated view of how the cluster is performing, and understand how a dynamic infrastructure like Spark with different moving components affect availability and performance, you need a more complete monitoring tool. Understanding what’s important to be alerted on and not, can greatly help you focus on things that matter and increase your agility on building applications on Spark.
It is important to have a software (like OpsClarity) that enables connection of components, such as Hadoop services and messaging systems like Kafka.