As the size of data continues to grow, the tools necessary to effectively work with it become exponentially more important. If software is eating the world, then Apache Spark has the world’s biggest appetite. The primary value of Spark is its ability to control an expandable cluster of machines, and make them available to the user as though they were a single machine ecosystem. The objective of this article is to help make the under-the-hood elements of Spark less of a mystery, and to transfer existing programming knowledge and methods into the power of the Spark engine. At the core of Spark functionality are two main components: the data storage/interaction format and the execution engine.
- Data storage and interaction is abstracted into a format Spark calls the Resilient Distributed Dataset (RDD). An RDD from the user’s perspective is very similar to a single array. Each element of the array preserves its order and can be made up of virtually any object. This means an array can store scalar data (numbers, booleans, strings, etc.) as its values. This additionally means that an array can store as its values more complex data, such as tuples, sub-arrays, and even objects that represent relational data rows. Under the hood, this data format seamlessly handles partitioning data across all of the nodes in a cluster and holds them in the memory pool of the cluster as a single unit. The messy details of working with the data across multiple machines, as well as the scaling benefits, are completely abstracted away from the developer.
- Equally impressive is the execution engine. Spark includes built-in functionality for a number of common data analysis, data wrangling, and machine learning operations. It additionally includes a very clear API that allows a developer to build in their own functionality that Spark will parallelize across the entire cluster. Because of the abstraction level to the developer and the data format of the RDD, custom functions and algorithms can be easily written in such a way that is structurally similar to looping through each item of the data to do some operation, but then utilizes the Spark engine to seamlessly parallelize those operations. Then, the engine recombines the results into a single output data set, ensuring that the order of the data remains intact.
- Scala: This is the core of the Spark ecosystem. Scala is a functional variant on the Java programming language that executes and compiles in the same manner as Java itself. Whichever language you use to control and develop a data processing pipeline, the underlying processing is Scala code. This additionally means that new features and capabilities in Spark are most commonly released in Scala first. However, as other languages such as Python and R gain popularity, the waiting time becomes less and less.
- Python & R: These bindings utilize both the syntax and object types of their respective domains. Since most of the coding you do will either be into functions that will be parallelized or are simply instruction sets for controlling cluster functionality, the vast majority of things you can do in Spark are available through these gateways.
- SQL: This is the newest member of the Spark team, and provides a data structure abstraction analogous to a dataframe in R or Python Pandas.
The infrastructure and abstractions of Spark are available for any functionality that you, the developer, can write. However, there are built in Spark libraries that already have highly optimized versions of well known algorithms and machine learning pipelines that you can use right out of the box.
The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.
- MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.
- KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.
Though not as fully realized as the more numerical machine learning algorithms, there are many highly capable natural language processing (NLP) tools. For models like Latent Dirichlet Allocation topic modeling, Naive Bayes modeling, or Term Frequency-Inverse Document Frequency feature engineering, there are built-in features in MLlib. For other common NLP processes that are highly parallelizable—like stopword removal, stemming/ lemmatization, or term frequency filtering—the Spark engine can efficiently run the same functions that would accomplish these tasks on a single document over millions of elements in parallel with no additional code.
Currently only interactive from the Scala language, Spark includes a library called GraphX. This is a first-class network graph analytics engine and data object store that can run many of the standard graph analytics functions. This includes clustering, classification, traversal, searching, and path finding.
Real-Time Stream Processing
Just as many of the operations described so far have been batch oriented, based on static data, Spark’s fast use of pooled server memory means that new data can be added to a batch at near-real-time speeds. This technique is called micro-batching, and it’s the method that Spark
uses for real-time data stream processing. The engine additionally provides end-to-end fault tolerance to guarantee your data will not get lost in the pipeline during processing, as well as an exactly-once guarantee that means a given portion of data in the stream will never be processed more than once.
Apache Spark is highly effective for big and small data processing tasks not because it best reinvents the wheel, but because it best amplifies the existing tools needed to perform effective analysis. Coupled with its highly scalable nature on commodity grade hardware, and incredible performance capabilities compared to other well known Big Data processing engines, Spark may finally let software finish eating the world.