DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • What Is Istio Service Mesh?
  • Working on an Unfamiliar Codebase
  • VPN Architecture for Internal Networks
  • Design Patterns for Microservices: Ambassador, Anti-Corruption Layer, and Backends for Frontends
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Monitoring Apache Spark Streaming: Understanding Key Metrics

Monitoring Apache Spark Streaming: Understanding Key Metrics

In order to monitor Apache Spark effectively, you need to know what you can and should monitor. Read on to find out what the key metrics for Spark installations are.

Swaroop Ramachandra user avatar by
Swaroop Ramachandra
·
Aug. 03, 16 · Tutorial
Like (3)
Save
Tweet
Share
9.39K Views

Join the DZone community and get the full member experience.

Join For Free

This post is part 2 of the Monitoring Apache Spark series. In part 1 – Monitoring Apache Spark: Why is it Challenging?, we discussed the basics of Spark architecture and why it is challenging to monitor a distributed and dynamic Spark cluster. In this post, we will dig deeper into the specific Apache Spark metrics and understand their behavior. Finally, we will discuss how you can automate the task of configuring monitoring for  Spark and provide visibility into key concerns.

Key Spark Metrics and Behavior Patterns

Once you have identified and broken down the Spark and associated infrastructure and application components you want to monitor, you need to understand the metrics that you should really care about that affect the performance of your application as well as your infrastructure. Let's dig deeper into some of the things you should care about monitoring.

  1. In Spark, it is well known that memory-related issues are typical if you haven't paid attention to the memory usage when building your application. Make sure you track garbage collection and memory across the cluster on each component, specifically, the executors and the driver. Garbage collection stalls or abnormality in patterns can increase back pressure.spark-gc

  2. Performance issues mostly show up during shuffles. Although avoiding shuffles and reducing the amount of data being moved between nodes is nice to have, it is often times unavoidable. It would be wise to track the shuffle read and write bytes on each executor and driver as well as the total Input bytes. As a thumb rule, more the shuffles, lower the performance.spark-shuffle

  3. Track the latency of your Spark application. This is the time it takes to complete a single job on your Spark cluster
  4. Measure the throughput of your Spark application. This gives you the number of records that are being processed per second
  5. Monitor the number of file descriptors that are open. Shuffles sometimes cause too many file descriptors to be opened resulting in "Too many open files" exception. Check your ulimit as well.
  6. Monitor scheduling delay in streaming mode. It gives the time taken by the scheduler to submit jobs to the cluster

At OpsClarity, we understand Spark in-depth.  Alerts are automatically configured for you. So when throughput or latency changes abnormally, you get alerted. If garbage collection and memory usage patterns go awry, or if shuffles are heavy, you get alerted. We do this through anomaly detection models that are baked into the product. As mentioned before, it also makes it very easy to identify where the bottlenecks are instead of jumping from component to component trying to find the root cause of an issue. We've curated a list of metrics you should consider monitoring and alerting, and these are automatically made available with OpsClarity.

Spark Driver (one driver for each application)Spark Executor (multiple executors for each application)
Block manager

•   MemoryUsed

•   MemoryRemaining

Scheduler

•   active jobs

•   total jobs

Streaming

•   # of receivers

•   # of completed batches

•   # of running batches

•   # of records received

•   # of records processed

•   # of unprocessed batches

•    average message processing time

# of RDD blocks

Memory Used

Max Memory Allocated

Disk Used

# of Active Tasks

# of Failed Tasks

# of Completed Tasks

Total Duration for which the Executor has run

Total Input Bytes

Total Shuffle Read Bytes

Total Shuffle Write Bytes

 

Spark MasterSpark Worker
# of Workers Alive in the Cluster

# of Apps Running and Waiting

# of Apps Waiting

# of Workers Provisioned in the Cluster

# of Free Cores

# of Used Cores

# of Executors Currently Running

Memory Used

Memory Free

 

HDFSYARN
Live Data Nodes

Down Data Nodes

HDFS disk %

Corrupted Blocks

UnderReplicated Blocks

Memory %

Network %

CPU Load

NameNode Heap %

NameNode RPC

Application Manager

Launches

Registrations

Node Manager

Active Nodes

Lost Nodes

Unhealthy Nodes

Containers

Available GB

Available VCores

Containers Running

Containers Failed


How Do I Troubleshoot Issues in Spark?

While you can hit a number of bumps when running Apache Spark, here are a few things to ask yourself during the debugging process.

  1. Is the Spark pipeline healthy? Are there affected components that could lead to degraded performance?
  2. Does the cluster have enough resources (cores, memory) available for jobs to run?
  3. How well is the data parallelized? Are there too many or too little partitions?
  4. How much memory is the application consuming per executor?
  5. What is the garbage collection pattern on the executors and the driver? Too much GC correlating with increase in ingestion rate could cause havoc on the executors
  6. Are you using the right serializer?
  7. Has there been an increase in shuffle reads/writes?
  8. What is the throughput and latency of my application?

The Spark UI provides details of a single job in great depth by breaking it down to stages and tasks. However, if you need a production ready monitoring tool to get an aggregated view of how the cluster is performing, and understand how a dynamic infrastructure like Spark with different moving components affect availability and performance, you need a more complete monitoring tool. Understanding what’s important to be alerted on and not, can greatly help you focus on things that matter and increase your agility on building applications on Spark. 

spark-dash1

 

spark-dash2

 

At OpsClarity, we not only discover and automatically monitor Spark but also the connected components such as Hadoop services, messaging systems like Kafka, etc. With an understanding of how your components are connected (the Operational Knowledge Graph), OpsClarity can help you seamlessly monitor performance and availability across different services.

Apache Spark Metric (unit) application garbage collection cluster Memory (storage engine) Executor (software)

Published at DZone with permission of Swaroop Ramachandra, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • What Is Istio Service Mesh?
  • Working on an Unfamiliar Codebase
  • VPN Architecture for Internal Networks
  • Design Patterns for Microservices: Ambassador, Anti-Corruption Layer, and Backends for Frontends

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: