Introduction to Spark Performance
Apache Spark is a powerful open-source distributed computing framework for scalable and efficient analysis of big data apps running on commodity compute clusters. Spark provides a framework for programming entire clusters with built-in data parallelism and fault tolerance while hiding the underlying complexities of using distributed systems.
Spark has seen a massive spike in adoption by enterprises across a wide swath of verticals, applications, and use cases. Spark provides speed (up to 100x faster in-memory execution than Hadoop MapReduce) and easy access to all Spark components (write apps in R, Python, Scala, and Java) via unified high-level APIs. Spark also handles a wide range of workloads (ETL, BI, analytics, ML, graph processing, etc.) and performs interactive SQL queries, batch processing, streaming data analytics, and data pipelines. Spark is also replacing MapReduce as the processing engine component of Hadoop.
Spark applications are easy to write and easy to understand when everything goes according to plan. However, it becomes very difficult when Spark applications start to slow down or fail. Sometimes a well-tuned application might fail due to a data change or a data layout change. Sometimes an application which had been running well so far, starts behaving badly due to resource starvation. The list goes on and on.
It's not only important to understand a Spark application, but also its underlying runtime components like disk usage, network usage, contention, etc., so that we can make an informed decision when things go bad.