Understanding Apache Spark Failures and Bottlenecks

Section 1

Introduction to Spark Performance

Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems.

Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop.

Spark applications are easy to write and easy to understand when everything goes according to plan. However, it becomes very difficult when Spark applications start to slow down or fail. Sometimes a well-tuned application might fail due to a data change or a data layout change. Sometimes an application which had been running well so far, starts behaving badly due to resource starvation. The list goes on and on.

It's not only important to understand a Spark application, but also its underlying runtime components like disk usage, network usage, contention, etc., so that we can make an informed decision when things go bad.

Section 2

Challenges of Monitoring and Tuning Spark

Building big data apps on Spark that monetize and extract business value from data have become a default standard in larger enterprises.

While Spark offers tremendous ease of use for developers and data scientists, deploying, monitoring, and optimizing production apps can be an altogether complex and cumbersome exercise. These create significant challenges for the operations team (and end-users) who are responsible for managing the big data apps holistically, while addressing many of the business requirements around SLA Management, MTTR, DevOps productivity, etc.

Tools such as Apache Ambari and Cloudera Manager primarily provide a systems view point to administer the cluster and measure metrics related to service health/performance and resource utilization. They only provide high-level metrics for individual jobs and point you to relevant sections in YARN or Spark Web UI for further debugging and troubleshooting. A guided path to address issues related to missed SLAs, performance, failures, and resource utilization for big data apps remains a huge gap in the ecosystem.

This is a preview of the Understanding Apache Spark Failures and Bottlenecks Refcard. To read the entire Refcard, please download the PDF from the link above.

Understanding Apache Spark Failures and Bottlenecks

Introduction to Spark Performance

Challenges of Monitoring and Tuning Spark

Like This Refcard? Read More From DZone