The Complete Apache Spark Collection [Tutorials and Articles]
Join the DZone community and get the full member experience.Join For Free
In this edition of "Best of DZone," we've compiled our best tutorials and articles on one of the most popular analytics engines for data processing, Apache Spark. Whether you're a beginner or are a long-time user, but have run into inevitable bottlenecks, we've got your back!
Before we begin, we'd like need to thank those who were a part of this article. DZone has and continues to be a community powered by contributors like you who are eager and passionate to share what they know with the rest of the world.
Let's get started!
Apache Spark Tutorial (Fast Data Architecture Series) by Bill Ward — In this article, a data scientist and developers gives an Apache Spark tutorial that demonstrates how to get Apache Spark installed.
Overview of the Apache Spark Ecosystem by Frank Evans — Make the under-the-hood elements of Spark less of a mystery and transfer existing programming knowledge and methods into the power of the Spark engine.
Spark vs Kafka vs Flink
Spark Streaming vs. Kafka Streaming by Mahesh Chand Kandpal — If event time is very relevant and latencies in the seconds are completely unacceptable, Kafka should be your first choice. Otherwise, Spark works just fine.
Streaming and Structured Streaming
What Is Structured Streaming? by Himanshu Gupta — Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
Apache Spark: Setting Up a Cluster on AWS by Jay Sridhar — You can augment and enhance Apache Spark clusters using Amazon EC2's computing resources. Find out how to set up clusters and run master and slave daemons on one node.
Databases, RDDs, and DataFrames
Reading Data From Oracle Database With Apache Spark by Emrah Mete — Learn how to connect Apache Spark to an Oracle database, read the data directly, and write it in a DataFrame.
What Are Spark Checkpoints on Data Frames? by Jean Georges Perrin — Checkpoints freeze the content of your DataFrames before performing additional operations. They're essential to effectively managing your DataFrames.
Understanding Apache Spark Failures and Bottlenecks by Rishitesh Mishra — When everything goes according to plan, it's easy to write and understand applications in Apache Spark. However, sometimes a well-tuned application might fail due to a data change or a data layout change — or an application that had been running well so far, might start behaving badly due to resource starvation.
Smart Resource Utilization With Spark Dynamic Allocation by Haim Cohen — Configuring your Spark applications wisely will provide you with a good balance between smart allocation and performance.
Apache Spark Performance Tuning – Degree of Parallelism by Rathnadevi Manivannan — Learn about improving performance and increasing speed through partition tuning in a Spark application running on YARN.
Why Your Spark Applications Are Slow or Failing, Part 1: Memory Management and Part 2: Data Skew and Garbage Collection by Rishitesh Mishra — See how common memory management issues, data skew, and garbage collection can have a significant impact on your Spark application's performance.
Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds by Henry Powell and Gianmario Spacagna — Barclays Data Scientist Gianmario Spacagna and Harry Powell, Head of Advanced Analytics, describe how they iteratively process raw data directly from the central data warehouse into Spark and how Tachyon is their key enabling technology.
Introduction to Spark With Python: PySpark for Beginners by Kislay Keshari — Take a look at how to use Apache Spark with Python (PySpark) in order to perform analysis on robust data sets.
PySpark DataFrame Tutorial: Introduction to DataFrames by Kislay Keshari — Explore the idea of DataFrames and how they can they help data analysts make sense of large dataset when paired with PySpark.
How to Perform Distributed Spark Streaming With PySpark by Neha Priya — Look at how to use PySpark to quickly analyze incoming data streams to provide real-time metrics.
Scala and Spark
Cleanframes: A Data Cleansing Library for Apache Spark! by Dawid Rutowicz — A developer discusses how to use an open source, Scala-based library that can help take some of the boilerplate code out of data cleansing.
Spark and Machine Learning
Churn Prediction With Apache Spark Machine Learning by Carol McDonald — Learn how to get started using Apache Spark’s machine learning decision trees and machine learning pipelines for classification.
Predictive Analytics With Spark ML by David Moyers — Whether you're running Spark on a large cluster or embedded within a single node app, Spark makes it easy to create predictive analytics with just a few lines of code.
A Glimpse at the Future of Apache Spark 3.0 With Deep Learning and Kubernetes by Oliver White — Learn how Spark 3.0, Kubernetes, and deep learning all come together.
No One Puts Baby in a Container
Running Apache Spark Applications in Docker Containers by Arseniy Tashoyan — Even once your Spark cluster is configured and ready, you still have a lot of work to do before you can run it in a Docker container. But these tips can help make it easier!
Quick Start With Apache Livy by Guglielmo Iozzia — Learn how to get started with Apache Livy, a project in the process of being incubated by Apache that interacts with Apache Spark through a REST interface.
Example ETL Application Using Apache Spark and Hive by Emrah Mete — In this article, we'll read a sample data set with Spark on HDFS (Hadoop File System), do a simple analytical operation, then write to a table that we'll make in Hive.
Be a Part of the Conversation!
Think we missed something? Want to contribute? Let us know in the comments below... or, join the conversation by becoming a member of our community of thousands of developers eager to share their knowledge and passion for programming with others.
Opinions expressed by DZone contributors are their own.