DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Apache Spark Tutorial (Fast Data Architecture Series)

Apache Spark Tutorial (Fast Data Architecture Series)

In this article, a data scientist and developers gives an Apache Spark tutorial that demonstrates how to get Apache Spark installed.

Bill Ward user avatar by
Bill Ward
·
Jul. 13, 18 · Tutorial
Like (5)
Save
Tweet
Share
12.99K Views

Join the DZone community and get the full member experience.

Join For Free

Continuing the Fast Data Architecture Series, this article will focus on Apache Spark. In this Apache Spark Tutorial we will learn what Spark is and why it is important for Fast Data Architecture. We will install Spark on our Mesos Cluster and run a sample spark application.

  1. Installing Apache Mesos 1.6.0 on Ubuntu 18.04
  2. Kafka Tutorial for Fast Data Architecture
  3. Kafka Python Tutorial for Fast Data Architecture

Video Introduction

Check out my Apache Spark Tutorial Video on YouTube:


What Is Apache Spark?

Apache Spark is a unified computing engine and collection of several libraries to help data scientists analyze big data. Unified means that Spark aims to support many different data analysis tasks that can range from SQL query analysis to machine learning and graphing support. Before Spark, there was Hadoop MapReduce that was the dominant player in data analysis platforms. Spark was developed to remedy some issues that were identified with Hadoop MapReduce.

The are several components to Spark that achieve the unified computing engine module as outlined in the Apache Spark Documentation.

  1. Spark SQL - This a library that allows data scientist to analyze data using simple SQL queries.
  2. Spark Streaming - A library that can process streaming data in real-time using batch processing.
  3. MLlib - A machine learning library for Spark.
  4. GraphX - A library that adds graphing functionality for Spark.

We will be covering these in much more detail in future articles. We will install Apache Spark on our Mesos Cluster that we have installed in previous articles in the Fast Data Architecture Series.

Apache Spark Architecture

Spark consists of a Driver Program that manages the Spark application. Spark Driver Programs can be written in many languages including Python, Scala, Java, and R. The driver program splits a task into executors and schedules the executors to run. You can install Spark applications on many cluster managers including Apache Mesos, Kubernetes, or in standalone mode. In this Apache Spark tutorial, we will deploy the Spark driver program to a Mesos cluster and run an example application that comes with Spark to test it.

The Spark driver program schedules work on Spark executors. Executors actually carry out the work of the Spark application. In our Mesos environment, these executors are scheduled on Mesos nodes and are short-lived. They are created, carry out their assigned tasks, report their status back to the driver, and then they are destroyed.

Install Apache Spark

Run the following on your Mesos Masters and all your Mesos Slaves. You will also want to install this on your local development system.

$ wget http://apache.claz.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
$ mkdir -p /usr/local/spark && tar -xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark --strip-components=1
$ chown root:root -R /usr/local/spark/


This will create a binary installation of Apache Spark that we can use to deploy our Spark applications on using Mesos. In the next section, we will create a SystemD Service that will run our Spark cluster dispatcher.

Create a SystemD Service Definition

When we have Spark deployed in cluster mode on a Mesos cluster we need to have the Spark Dispatcher running that will schedule our Spark applications on Mesos. In this section, we will create a SystemD Service definition that we will use to manage the Spark Dispatcher as a service.

On one of your Mesos Masters create a new file /etc/systemd/system/spark.service and add the following contents:

[Unit]
Description=Spark Dispatcher Service
After=mesos-master.service
Requires=mesos-master.service

[Service]
Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
ExecStart=/usr/local/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --master mesos://192.168.1.30:5050

[Install]
WantedBy=multi-user.target7


This file configures the Spark dispatcher service to start up after the mesos-master service. Also, notice that we specify the IP and port of our Mesos Master. Be sure that yours reflects your actual IP address and port for your Mesos Master. Now we can enable the service and start it

# systemctl daemon-reload
# systemctl start spark.service
# systemctl enable spark.service


You can make sure it is running using this command:

# systemctl status spark.service


If everything is working correctly, you will see the service as Started and Active. The next part of our Apache Spark Tutorial is to test our Spark deployment!

Testing Spark

Now that we have our Spark Dispatcher service running on our Mesos Cluster we can test it by running an example job. We will be using an example that comes with Spark that will calculate PI for us.

bin/spark-submit --name SparkPiTestApp --class org.apache.spark.examples.SparkPi --master mesos://192.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100


You will see that our example is scheduled:

.168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark/examples/jars/spark-examples_2.11-2.3.1.jar 100
2018-07-11 16:44:12 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-11 16:44:12 INFO  RestSubmissionClient:54 - Submitting a request to launch an application in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submission successfully created as driver-20180711164412-0001. Polling submission state...
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Submitting a request for the status of submission driver-20180711164412-0001 in mesos://192.168.1.30:7077.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - State of driver driver-20180711164412-0001 is now RUNNING.
2018-07-11 16:44:13 INFO  RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "serverSparkVersion" : "2.3.1",
  "submissionId" : "driver-20180711164412-0001",
  "success" : true
}
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-07-11 16:44:13 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-4edf5319-8ff1-45bc-b4ad-56b1291f4125


To see the output, we need to look at the Sandbox for our job in Mesos. Go to your Mesos web interface, http://{mesos-ip}:5050. You should see that there is a task named Driver for SparkPiTestApp under Completed Tasks which is our job.

Click on the Sandbox link for our job, then click on the stdout link to see the logging for our application. You will see that it calculated Pi for us.

2018-07-11 16:44:20 INFO  DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 3.485 s
2018-07-11 16:44:20 INFO  DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 3.551479 s
Pi is roughly 3.1418855141885516


Conclusion

This Apache Spark tutorial simply demonstrated how to get Apache Spark installed. The true power of Spark lies in the APIs that it provides to write powerful analytical applications to process your raw data and provide meaningful results that you can use to make real-world business decisions. Don't miss out on the next several articles where we cover how to write Spark applications using the Python API and continue in our exploration of the SMACK Stack. If you haven't already, please signup for my weekly newsletter so you will get updates when I release new articles. Thanks for reading this tutorial. If you liked it or hated it then please leave a comment below.

Apache Spark Data science Data architecture Machine learning Architecture

Published at DZone with permission of Bill Ward, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Silver Bullet or False Panacea? 3 Questions for Data Contracts
  • Top Five Tools for AI-based Test Automation
  • Spring Cloud: How To Deal With Microservice Configuration (Part 1)
  • The Real Democratization of AI, and Why It Has to Be Closely Monitored

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: