Spark as a Fundamental Open-Source Big Data Technology

DZone 's Guide to

Spark as a Fundamental Open-Source Big Data Technology

A discussion of why Spark is such a great framework for big data projects, showing off a little sample code to help explain.

· Big Data Zone ·
Free Resource

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

Spark as a Fundamental Open-Source Big Data Technology

Big data — the analysis of datasets beyond the capacity of conventional tools — has been a necessity for many industries for many years. The particular unconventional tools that make big data feasible have changed through the years, however. As this article explains, Spark is the platform whose adoption for solving big data problems currently appears most explosive.

Time Before Spark

To understand Spark's potential, it helps to recall the shape of big data one decade ago. In 2008-2009, the big data-as-a-business concept was often conflated with Hadoop technology. Hadoop is an open-source framework for managing clusters (networks of multiple computers) operating on MapReduce programming tasks. MapReduce is a programming model popularized by Google in 2004 that structures the collection and analysis of large datasets. A decade ago, paradigmatic big data projects were coded as MapReduce batches applied to the data of a particular domain and then executed on Hadoop-managed clusters. Big data and Hadoop were so closely identified then and for several years after that departments unfamiliar with big data (e.g. venture capitalists, PR firms, HR departments) notoriously confused the two in their advertisements and other writing. It's fair to summarize, as Kaushik Pal does, that Hadoop is "the basic data platform for all big data-related offerings."

Hadoop's emphasis on batch processing is clumsy for iterative and interactive jobs. Even more than that, Hadoop's MapReduce interpretation assumes that datasets reside in the Hadoop Distributed File System (HDFS). Many (perhaps the majority of) datasets fit this model uncomfortably. High-performance machine learning, for instance, emphasizes in-memory processing with relatively infrequent recourse to filesystem mass storage.

Spark, "a unified analytics engine for large-scale data processing" that began as a Berkeley class project in 2009, emphasizes:

  • Compatibility with Hadoop through the reuse of HDFS as a storage layer.

  • Interactive querying.

  • Support of machine learning.

  • Pipelining (i.e. ease of connection of different execution units so that a complex calculation can be achieved as a "bucket brigade" that passes data through successive stages of computation).

Spark also features flexibility in several aspects, including the different programming languages it serves, the clouds in which it can be rented, and the big data libraries it integrates.

Spark vs. Hadoop

Spark is typically faster than Hadoop, with a factor of up to 100+ jobs for fitting Spark's in-memory model better. Spark is tuned for typical ML tasks like NaiveBayes and K-Means computations, and can also help save time and alleviate hardware constraints. Early Spark projects, however, had a reputation for leaking memory, at least in the hands of novices. Additionally, long-running batch MapReduce jobs appear to be easier to get correct with Hadoop.

Spark is also a more general-purpose programming framework, as mentioned above and as the examples below show in more detail. Hadoop conceives big data rather inflexibly as Java-coded MapReduce operations; in contrast, the learning curve for Spark is far less steep. A conventional program in Python, Java, Scala, R, or even SQL can almost immediately start to write familiar-looking programs on a conventional desktop that simultaneously leverage Spark's power. Spark's official site has several evocative examples. Consider this word counter in Python:

import pyspark
source = "file://..."
result = "file://..."
with pyspark.SparkContext("local", "WordCount") as sc:
 text_file = sc.textFile(source)
 counts = text_file.flatMap(lambda line: line.split("
 .map(lambda word: (word, 1))
 .reduceByKey(lambda a, b: a + b)

Any Python programmer can read this. While it runs on a low-powered development host, it also runs unchanged on Docker-ized Spark, with Sparkon industrial-strength cloud clusters, experimental supercomputers, high up-time mainframes, and so on. Also, it's easy to refine such an example with conventional Python programming; a follow-up example might be:

import re
 import pyspark
 source = "file://..."
 result = "file://..."
 def better_word_splitter(line):
 Use negative look-behind to split on all
 whitespace, but only once per whitespace
 return re.split("(?<!\\s)\s", line.strip())
with pyspark.SparkContext("local", "WordCount2") as sc:
 text_file = sc.textFile(source)
 counts = text_file.flatMap(better_word_splitter)\
 .map(lambda word: (word, 1))\
 .reduceByKey(lambda a, b: a + b)

Spark is certainly newer than Hadoop and has a reputation of being less widely understood. At the same time, Spark complements and generalizes Hadoop so existing specialists in programming domains like ETL transformations, ML, graph analysis, OLAP, dataset streaming, time-series analysis, or interactive and experimental queries can adopt Spark incrementally. Also, Spark's incorporation of these distinct domains simplifies architectural design; everything needed for a particular result can be written within a single pipeline and computed on a standard Spark cluster.

Another example from the official Spark Apache site — this time in Scala— illustrates some of the power of Spark's integration. Not long ago, predictive analysis was an undertaking for graduate school; now, the power of Spark makes it a one-liner:

// Every record of this DataFrame contains the label and
// features represented by a vector.
val df = sqlContext.createDataFrame(data).
toDF("label", "features")
// Set parameters for the algorithm.
// Here, we limit the number of iterations to 10.
val lr = new LogisticRegression().setMaxIter(10)
// Fit the model to the data.
val model = lr.fit(df)
// Inspect the model: get the feature weights.
val weights = model.weights
// Given a dataset, predict each point's label, and
show the results.

Spark's exposure in general-purpose programming languages such as Scala means that it's easy to extend, adapt, and integrate such powerful results with other organizational assets. Too often in the past, big data was an isolated specialization. Spark's strengths bring big data to a wider range of programmers and projects.

Keep in mind what Spark brings operationally: once a program is correct, it will be fast and it will be able to be scaled heroically with Spark's ability to manage a range of clusters.

Ready to Go

All these capabilities sound nice. But is Spark truly safe for projects that rely on million-dollar hardware, not to mention the value and security of proprietary data? Yes! Billion-dollar companies including GoDaddy, Alibaba, and Shopify rely on Spark for crucial services and results.

As an intellectual property, Hadoop found a home in 2011 with the Apache Software Foundation in an innovative ownership arrangement. Spark later followed that same path. Interestingly enough, for the last four of those years, activity at the Spark repository exceeded that of the older and generally more prominent Hadoop repository. While that comparison means little in isolation, it at least hints at the large number of organizations that regard Spark as a fundamental open-source technology

If anything, Spark's flexibility and integration make it a safer choice than Hadoop or other alternatives. While Hadoop itself behaves reliably, too many Hadoop-based projects have stumbled in interfacing to a MapReduce-focused kernel; the MapReduce part is correct, but the wrappers around it connecting to other organizational assets end up being novel and correspondingly shaky. In contrast, Spark's more general framework invites the kind of convenient, trustworthy interfaces that contribute to the success of a project as a whole.


Derrick Harris was right when he summarized Spark for a business audience over three years ago: "Spark is faster, more flexible, and easier to use than Hadoop MapReduce." Spark's sophisticated in-memory processing makes it faster — sometimes by orders of magnitude. Spark maintains a rich array of APIs for graphs, streaming, ML, and more that even manage Spark's own in-memory acceleration. Spark builds in pipelines and supports multiple clustering facilities. Programmers can work in any of five languages rather than just Hadoop's Java basis. 

For all these reasons, Spark's growth will only increase in the next few years. Spark is the one technology that big data practitioners most need to know.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

apache hadoop, apache spark, big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}