Exploring Apache Spark 2.1 and Zeppelin in Hortonworks

DZone 's Guide to

Exploring Apache Spark 2.1 and Zeppelin in Hortonworks

A tutorial on how to launch a Spark job with Hortonworks' Data Cloud and add Zeppelin services.

· Big Data Zone ·
Free Resource

Apache Spark 2.1 was released recently in the community. The main focus of this release was improvements in Structured Streaming and Machine Learning.

  • Structured Streaming: Kafka .10 support, Metrics, and Stability improvements
  • Machine Learning: SparkR Improvements including new ML algorithms for LDA, Random forests, GMM, etc.

Wanna Try Spark 2.1 Now? Well, You Are In Luck…

Hortonworks Data Cloud (“HDCloud”) for AWS gives you a quick way to launch a Spark cluster in the cloud. With the latest HDCloud Technical Preview (version #1.12 TP available http://hortonworks.github.io/hdp-aws/), we have added an option for HDP 2.6 (Technical Preview) which includes a new cluster configuration for Spark 2.1 for Data Science workloads. Let’s use this new HDCloud Technical Preview to launch Spark 2.1 and setup Zeppelin:

  1. Launch a Spark 2.1 cluster with HDCloud
  2. Run an example Spark job using Spark 2.1
  3. Install and Configure Zeppelin to run with Spark 2.1

Step 1: Launch a Spark 2.1 Cluster With HDCloud

Grab the HDCloud Technical Preview, launch your Cloud Controller, login and create your cluster that includes Spark 2.1 by selecting HDP 2.6 (Technical Preview) and choosing the Apache Spark 2.1 Cluster Type.

Step 2: Run An Example Spark Job

Before we run the Spark PI Example, you’ll want to open access to the Spark History Server UI (which runs on port 18081). By default, HDCloud configures the AWS Security Group to not have access to port 18081, so you will need to open this up from the AWS EC2 console.

From the AWS EC2 console, locate the EC2 instance for the cluster Master node. Click on the Security Group for this instance and edit the Inbound access for this port. For example, I created this rule to allow routing to allow connection to port 18081.

Note: Opening this (or any) port for Inbound access should be done with a bit of caution. We strongly recommend you take great care in limiting Inbound port access, protocols and client IP addresses to prevent malicious agents from gaining access to your data or resources. The above example of an Inbound port rule uses a very wide-open CIDR ( for illustrative purposes. You should look to provide much more restrictive access.

From the cloud controller, browse to the Ambari Web UI for your cluster. Login and navigate to the Spark2 service. You can use the Ambari “Quick Link” to get to the Spark2 History Service UI like so:

The History Server will show the version and build of Spark. In this case, version 2.1.

 SSH into one of the cluster Worker nodes:.

And run the Spark PI example:

You can see the completed job in the Spark History Server UI.

To run more Spark examples in HDC, visit A Lap Around Spark blog and try the examples from there in HDC.

Optional Step 3: Add Zeppelin Service and Configure Zeppelin

In addition to running Spark jobs from command line as shown above, if you also want to run Spark jobs from Zeppelin UI, follow the below steps to add and configure Zeppelin into this Spark cluster.

Note: In a future HDCloud Technical Preview, we’ll look to add Zeppelin by default to the Spark 2.1 Cluster Type (so you won’t need the steps below). But for now, you need to manually install & configure Zeppelin into the Spark 2.1 cluster. So here goes…

Use Ambari UI to Add Service

Select Zeppelin Service to Add

Accept defaults and go through Ambari add service wizard.
Once Zeppelin is added to the cluster, go to Zeppelin service in Ambari and use Zeppelin UI under the quick link to launch Zeppelin UI

By default, Zeppelin comes configured with both Spark1 and Spark2 interpreters.

However before using Spark2 interpreter use Ambari to navigate to Zeppelin config and comment out  or remove SPARK_HOME under Advance Zeppelin Env section. After commenting out SPARK_HOME, restart Zeppelin using Ambari.

Once Zeppelin is restarted, visit the Zeppelin UI to create a new note with Spark 2.1, or edit the existing note to run with Spark 2.1.

Adding Security to Zeppelin UI

Note when you add Zeppelin UI, it does not have authentication enabled. We strongly recommend adding security to Zeppelin UI. To add Authentication to Zeppelin UI and protect it from unauthenticated access, see adding authentication to Zeppelin section of Zeppelin guide.

Open Issues

SparkR does not yet work with this technical preview.

What’s Next?

It is great to see such rapid progress in the Spark Community and we are excited to get your feedback on the latest Spark 2.1 release.

If you have issues or need help with launching Spark 2.1 or trying out HDCloud, please visit https://community.hortonworks.com/spaces/61/operations-track_2.html?type=question. We’d love to hear from you.

amazon, big data, cloud, hadoop, spark

Published at DZone with permission of Vinay Shukla . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}