DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Apache Ranger and AWS EMR Automated Installation and Integration Series (3): Windows AD + EMR-Native Ranger
  • Cloud-Driven Analytics Solution Strategy in Healthcare
  • Iceberg Catalogs: A Guide for Data Engineers
  • Setting Up a ScyllaDB Cluster on AWS Using Terraform

Trending

  • Concourse CI/CD Pipeline: Webhook Triggers
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • Cookies Revisited: A Networking Solution for Third-Party Cookies
  • Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Apache Spark: Setting Up a Cluster on AWS

Apache Spark: Setting Up a Cluster on AWS

You can augment and enhance Apache Spark clusters using Amazon EC2's computing resources. Find out how to set up clusters and run master and slave daemons on one node.

By 
Jay Sridhar user avatar
Jay Sridhar
·
Apr. 03, 17 · Opinion
Likes (8)
Comment
Save
Tweet
Share
21.7K Views

Join the DZone community and get the full member experience.

Join For Free

“If the facts don’t fit the theory, change the facts.”
― Albert Einstein

Apache Spark is the newest kid on the Big Data block.

While re-using major components of the Apache Hadoop Framework, Apache Spark lets you execute big data processing jobs that do not neatly fit into the Map-Reduce paradigm. It provides support for many patterns similar to the Java 8 Streams functionality, while letting you run these jobs on a cluster.

You have a data processing job working nicely with Java 8 Streams? But need more horsepower & memory than a single machine can provide?Apache Spark is your friend.

In this article, we delve into the basics of Apache Spark and show you how to set up a single-node cluster using the computing resources of Amazon EC2. For the purposes of the demonstration, we setup a single server and run the master and slave on the same node. Such a setup is good for getting your feet wet with Apache Spark on a laptop.

Create AWS Instance

Setting up an AWS EC2 instance is quite straightforward and we have covered it here while demonstrating how to set up a Hadoop Cluster. The procedure is the same up until the cluster is running on EC2. Follow the steps in that guide till the instance is launched, and get back here to continue with Apache Spark.

Instance Setup

Once the instance is up and running on AWS EC2, we need to setup the requirements for Apache Spark.

Install Java

Install Java on the node using the ubuntu package: openjdk-8-jdk-headless

sudo apt-get -y install openjdk-8-jdk-headless


Install Apache Spark

Next head on over to the Apache Spark website and download the latest version. At the time of the writing of this article, the latest version is 2.1.0. We have chosen to install Spark with Hadoop 2.7 (the default).

Download and unpack the Apache Spark package.

mkdir ~/server
cd ~/server
wget <Link to Apache Spark Binary Distribution>
tar xvzf spark-2.1.0-bin-hadoop2.7.tgz


After unpacking, you have just one step to complete the installation: JAVA_HOME.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/


And that’s it for installation! The friendly folks at Apache Spark have certainly made our lives easy, haven’t they?

Startup Master

Let us now fire up the Apache Spark master. The master is in charge of the cluster. This is where you submit jobs, and this where you go for the status of the cluster. Start the master as follows:

cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-master.sh


Once the master is running, navigate to port 8080 on the Node’s Public DNS and you get a snapshot of the cluster.

Apache Spark Master Node

The URL highlighted in red is the Spark URL for the Cluster. Copy it down as you will need it to start the slave.

Slave Startup

Ensure that JAVA_HOME is set properly and run the following command.

cd ~/server
./spark-2.1.0-bin-hadoop2.7/sbin/start-slave.sh spark://ip-172-31-30-53.us-west-1.compute.internal:7077


And with that, your cluster should be functioning. Hit the status page again at port 8080 to check for it. Observe that you can see the slave under Workers, along with the number of core available and the memory.

Run Jobs Using Pyspark

Let us now run a job using the Python shell provided by Apache Spark. Starting up the shell needs the Spark Cluster URL mentioned earlier.

cd ~/server
./spark-2.1.0-bin-hadoop2.7/bin/pyspark --master spark://ip-172-31-30-53.us-west-1.compute.internal:7077


After a brief startup, you should see the pyspark prompt “>>>”.

For the purpose of testing, we are using a data file containing salaries of baseball players from 1985 through 2016. It contains 26429 records.

Here is a sample session with the pyspark shell using the file Salaries.csv

>>> a = sc.textFile('Salaries.csv')
>>> a.count()
26429
>>> a.filter(lambda x : '2005' in x).count()
837


Python Code to Run Jobs

Let us now see how to run some sample python code on the Spark cluster. The following shows code similar to the above pyspark session.

from pyspark import SparkContext

dataFile = "../data/Salaries.csv"
sc = SparkContext("spark://ip-172-31-30-53.us-west-1.compute.internal:7077", "Simple App")
a = sc.textFile(dataFile)

print "Count of records: ", a.count()

print "Count of 2005 records: ", a.filter(lambda x : '2005' in x).count()

sc.stop()


Along with a bunch of diagnostic output, the code prints:

Count of records:  26429
Count of 2005 records:  837


Summary

And that, my friends, is a simple and complete Apache Spark tutorial. We covered the basics of setting up Apache Spark on an AWS EC2 instance. We ran both the Master and Slave daemons on the same node. Finally, we demonstrated an interactive pyspark session as well as some Python code to run jobs on the cluster.

AWS Apache Spark cluster

Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Apache Ranger and AWS EMR Automated Installation and Integration Series (3): Windows AD + EMR-Native Ranger
  • Cloud-Driven Analytics Solution Strategy in Healthcare
  • Iceberg Catalogs: A Guide for Data Engineers
  • Setting Up a ScyllaDB Cluster on AWS Using Terraform

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!