Over a million developers have joined DZone.

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone's Guide to

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone Zone Leader Tim Spann runs through a checklist for setting up Big Data applications with Apache Spark.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Let's run through some tools for installing, setting up, and troubleshooting a Big Data environment in Apache Spark. 

First, validate that you have connectivity and no firewall issues when you are starting.  Conn Check is an awesome tool for that.

If you need to setup a number of servers at once, check out Sup.

First get version 1.8 of the JDK.  Apache Spark works best with Scala, Java, and Python.  Get the version of Scala you may need. Scala Version 2.10 is the standard version and used for the precompiled downloads. You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself. Install Python 2.6 for PySpark. Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job:

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job:

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark:// --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

apache spark ,big data ,postgresql ,mongodb

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}