Over a million developers have joined DZone.

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone's Guide to

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone Zone Leader Tim Spann runs through a checklist for setting up Big Data applications with Apache Spark.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Let's run through some tools for installing, setting up, and troubleshooting a Big Data environment in Apache Spark. 

First, validate that you have connectivity and no firewall issues when you are starting.  Conn Check is an awesome tool for that.

If you need to setup a number of servers at once, check out Sup.

First get version 1.8 of the JDK.  Apache Spark works best with Scala, Java, and Python.  Get the version of Scala you may need. Scala Version 2.10 is the standard version and used for the precompiled downloads. You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself. Install Python 2.6 for PySpark. Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job:

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job:

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark:// --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

apache spark ,big data ,postgresql ,mongodb

Published at DZone with permission of Tim Spann, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}