Even in the tech world, most professionals have little-to-no exposure to Apache Spark, what it is, how it works or the groundbreaking results you can expect from it. Because of this, it’s necessary to first introduce the technology before going into a deeper discussion of what Spark-as-a-Service has to offer.
An Introduction to Apache Spark
Apache Spark is an in-memory distributed processing and analytics platform that was originally built in 2009 at UC Berkeley. As a class project, participating faculty and students wanted to design a processing framework that could fill in the big data technology gaps; More specifically, a framework for the iterative, interactive processing of big data.
Since then, Spark has matured into a platform that data engineers and data scientists can use to build big data analytics applications in popular programming languages like Java, Python, Scala, and R. Currently Spark is mostly used as ancillary support within the Hadoop ecosystem (incl. HDFS, MapReduce, YARN). Despite its slow disk access requirements, MapReduce is still the main parallel processing workhorse for big data. In the alternative, Apache Spark is now used for faster, in-memory parallel processing of big data. Like MapReduce, Spark runs on clusters. Spark clusters are comprised of multiple Worker Nodes, each of which contains an Executor and Tasks. Each Spark cluster has a single Cluster Manager that acts as a resource manager. YARN is a popular resource manager for managing Spark clusters.
Spark was designed to be run on hundreds of nodes in a cluster, while MapReduce is capable of running over tens of thousands of nodes. Both MapReduce and Spark are designed to meet general big data processing requirements, both run on YARN, and both use data that’s stored in the HDFS. MapReduce is used for batch processing of data en-masse. As a secondary step, high-latency big data analytics can then be generated from this data. On the other hand, Apache Spark uses in-memory storage and processing, so that developers can, at one time, generate real-time analytics from streaming big data sources.
Will it Replace MapReduce?
The short answer: Almost certainly.
Back in September of last year, Cloudera announced its plan to make Apache Spark its default data processing appliance, in place of MapReduce. It is a faster and more flexible data processing engine and it’s a lot easier to program than MapReduce, but MapReduce was designed for mega-scale batch processing jobs. The Hadoop community is working to increase the amount of data that Spark can process, so that it can be used as a full-scale MapReduce replacement in the near future. Judging from recent investments and adoption rates in the tech world, Spark’s low-latency, in-memory processing capabilities are primed and almost ready to replace the older MapReduce batch processing framework.
Spark’s Four Submodules
Apache Spark is broken into four main submodules. These are:
- Spark SQL
Data in Spark can be passed interchangeably through any of these four submodules. The following is just a quick series of explanations – Spark submodules in a nutshell, if you will.
Spark SQL for Big Data Querying
Spark DataFrames allow you to use SQL functions to process columnar, tabular data within Apache Spark. You can also use Spark as a processing engine for accessing and manipulating data in the Hive database. In this way, you can use Spark SQL and HiveQL to interact with, analyze, and query big data directly from HDFS.
MLlib for Machine Learning and Data Science on Spark
Spark’s MLlib submodule provides capabilities for machine learning and real-time analysis of big data. MLlib functionalities include:
- Dimensionality Reduction
- Collaborative Filtering, and more…
If you’d like to see more about doing data science using Spark MLlib submodule, check out its documentation here.
Spark GraphX for Big Data Graph Processing
Got big graph data and want to do some in-memory processing on it? If so, Spark GraphX has you covered. You can use GraphX to create and modify the vertices and edges that comprise your graph dataset, as well as to carry out graph computations on graph data that’s stored in the HDFS.
Spark Streaming for Real-Time Processing of Big Data
Stream processing is Spark’s flagship submodule. In this submodule, you convert continuously streaming sources of big data into discreet data streams (DStream) that are processed in micro-batch (Think: batch on the scale of seconds). All other submodules within Spark can be used to process, analyze, and manipulate this DStream data. The Spark streaming module is used for stream management within Spark.
Now that you know what Spark is and its basic layout, let’s take a minute to consider Spark-as-a-Service.
Although vendors like IBM and Qubole have begun offering Spark-as-a-Service, Databricks is Apache’s own homegrown Spark-as-a-Service cloud-based provider. Databricks allows you to deploy Spark quickly and (relatively) easily, on an as-needed basis… So, if you have a short-term big data analytics project, you can quickly set that up on Databricks, load your data onto its cloud-based servers, process it, and then terminate the cluster after you’ve gotten the results you need.
Building and configuring Spark clusters is resource-intensive, so one of Databricks’ main draws is that it automates the cluster building and configuration process. To set-up a Spark cluster in Databricks, all you need to do is specify how much memory capacity you need, and the platform will size and configure your cluster for you. Security, process monitoring, and resource monitoring are also built-in to Databricks’ cluster management services. Within the Spark cluster, Notebooks are available for writing jobs and processes using Scala, Python or SQL. Databricks also has functionality that you can use to generate data visualizations and analytics dashboards directly within the platform.
Want to Explore Further Out into the World of Apache Spark?
If you want to learn how to begin using Spark to start building data products at your organization, you can get started with Apache Spark and Scala Certification Training.
About the Author :
Lillian Pierson, P.E. is a leading expert in the field of big data and data science. She equips working professionals and students with the data skills they need to stay competitive in today's data driven economy.
She is the author of three highly referenced technical books by Wiley & Sons Publishers: Data Science for Dummies (2015), Big Data / Hadoop for Dummies (Dell Special Edition, 2015), and Big Data Automation for Dummies (BMC Special Edition, 2016).
Lillian has spent the last decade training and consulting for large technical organizations in the private sector, such as IBM, Dell, and Intel, as well as government organizations, from the U.S. Navy down to the local government level.
As the Founder of Data-Mania LLC, Lillian offers online and face-to-face training courses as well as workshops, and other educational materials in the area of big data, data science, and data analytics. Follow her on : Twitter | Linkedin