Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Apache Spark Tutorial (Fast Data Architecture Series)

DZone's Guide to

Apache Spark Tutorial (Fast Data Architecture Series)

In this article, a data scientist and developers gives an Apache Spark tutorial that demonstrates how to get Apache Spark installed.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Continuing the Fast Data Architecture Series, this article will focus on Apache Spark. In this Apache Spark Tutorial we will learn what Spark is and why it is important for Fast Data Architecture. We will install Spark on our Mesos Cluster and run a sample spark application.

  1. Installing Apache Mesos 1.6.0 on Ubuntu 18.04
  2. Kafka Tutorial for Fast Data Architecture
  3. Kafka Python Tutorial for Fast Data Architecture

Video Introduction

Check out my Apache Spark Tutorial Video on YouTube:


What Is Apache Spark?

Apache Spark is a unified computing engine and collection of several libraries to help data scientists analyze big data. Unified means that Spark aims to support many different data analysis tasks that can range from SQL query analysis to machine learning and graphing support. Before Spark, there was Hadoop MapReduce that was the dominant player in data analysis platforms. Spark was developed to remedy some issues that were identified with Hadoop MapReduce.

The are several components to Spark that achieve the unified computing engine module as outlined in the Apache Spark Documentation.

  1. Spark SQL - This a library that allows data scientist to analyze data using simple SQL queries.
  2. Spark Streaming - A library that can process streaming data in real-time using batch processing.
  3. MLlib - A machine learning library for Spark.
  4. GraphX - A library that adds graphing functionality for Spark.

We will be covering these in much more detail in future articles. We will install Apache Spark on our Mesos Cluster that we have installed in previous articles in the Fast Data Architecture Series.

Apache Spark Architecture

Spark consists of a Driver Program that manages the Spark application. Spark Driver Programs can be written in many languages including Python, Scala, Java, and R. The driver program splits a task into executors and schedules the executors to run. You can install Spark applications on many cluster managers including Apache Mesos, Kubernetes, or in standalone mode. In this Apache Spark tutorial, we will deploy the Spark driver program to a Mesos cluster and run an example application that comes with Spark to test it.

The Spark driver program schedules work on Spark executors. Executors actually carry out the work of the Spark application. In our Mesos environment, these executors are scheduled on Mesos nodes and are short-lived. They are created, carry out their assigned tasks, report their status back to the driver, and then they are destroyed.

Install Apache Spark

Run the following on your Mesos Masters and all your Mesos Slaves. You will also want to install this on your local development system.

$ wget http://apache.claz.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
$ mkdir -p /usr/local/spark && tar -xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark --strip-components=1
$ chown root:root -R /usr/local/spark/


This will create a binary installation of Apache Spark that we can use to deploy our Spark applications on using Mesos. In the next section, we will create a SystemD Service that will run our Spark cluster dispatcher.

Create a SystemD Service Definition

When we have Spark deployed in cluster mode on a Mesos cluster we need to have the Spark Dispatcher running that will schedule our Spark applications on Mesos. In this section, we will create a SystemD Service definition that we will use to manage the Spark Dispatcher as a service.

On one of your Mesos Masters create a new file /etc/systemd/system/spark.service and add the following contents:

[Unit]
Description=Spark Dispatcher Service
After=mesos-master.service
Requires=mesos-master.service

[Service]
Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
ExecStart=/usr/local/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --master mesos://192.168.1.30:5050

[Install]
WantedBy=multi-user.target7


This file configures the Spark dispatcher service to start up after the mesos-master service. Also, notice that we specify the IP and port of our Mesos Master. Be sure that yours reflects your actual IP address and port for your Mesos Master. Now we can enable the service and start it

# systemctl daemon-reload
# systemctl start spark.service
# systemctl enable spark.service


You can make sure it is running using this command:

# systemctl status spark.service


If everything is working correctly, you will see the service as Started and Active. The next part of our Apache Spark Tutorial is to test our Spark deployment!

Testing Spark

Now that we have our Spark Dispatcher service running on our Mesos Cluster we can test it by running an example job. We will be using an example that comes with Spark that will calculate PI for us.

bin/spark-submit --name SparkPiTestApp --class org.apache.spark.examples.SparkPi --master mesos://192.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100


You will see that our example is scheduled:

.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100
2018-07-11 16:44:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-11 16:44:12 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20180711164412-0001. Polling submission state...
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180711164412-0001 in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - State of driver driver-20180711164412-0001 is now RUNNING.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "serverSparkVersion" : "2.3.1",
  "submissionId" : "driver-20180711164412-0001",
  "success" : true
}
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4edf5319-8ff1-45bc-b4ad-56b1291f4125


To see the output, we need to look at the Sandbox for our job in Mesos. Go to your Mesos web interface, http://{mesos-ip}:5050. You should see that there is a task named Driver for SparkPiTestApp under Completed Tasks which is our job.

Click on the Sandbox link for our job, then click on the stdout link to see the logging for our application. You will see that it calculated Pi for us.

2018-07-11 16:44:20 INFO  DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.485 s
2018-07-11 16:44:20 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 3.551479 s
Pi is roughly 3.1418855141885516


Conclusion

This Apache Spark tutorial simply demonstrated how to get Apache Spark installed. The true power of Spark lies in the APIs that it provides to write powerful analytical applications to process your raw data and provide meaningful results that you can use to make real-world business decisions. Don't miss out on the next several articles where we cover how to write Spark applications using the Python API and continue in our exploration of the SMACK Stack. If you haven't already, please signup for my weekly newsletter so you will get updates when I release new articles. Thanks for reading this tutorial. If you liked it or hated it then please leave a comment below.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,apache shark ,data architecture ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}