Over a million developers have joined DZone.

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone's Guide to

Tools for Troubleshooting, Installation and Setup of Apache Spark Environments

DZone Zone Leader Tim Spann runs through a checklist for setting up Big Data applications with Apache Spark.

· Big Data Zone ·
Free Resource

Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

Let's run through some tools for installing, setting up, and troubleshooting a Big Data environment in Apache Spark. 

First, validate that you have connectivity and no firewall issues when you are starting.  Conn Check is an awesome tool for that.

If you need to setup a number of servers at once, check out Sup.

First get version 1.8 of the JDK.  Apache Spark works best with Scala, Java, and Python.  Get the version of Scala you may need. Scala Version 2.10 is the standard version and used for the precompiled downloads. You can use Scala 2.11, but you will need to build the package yourself.   You will need Apache Maven if you want to build yourself. Install Python 2.6 for PySpark. Also download SBT for Scala.

Once everything is installed, a very cool tool to work with Apache Spark is the new Apache Zeppelin.   Very cool for data exploration and data science experiments, give it a try.

An Example SBT for building a Spark Job:

name := "Postgresql Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.5.1"
libraryDependencies += "org.postgresql" % "postgresql" % "9.4-1204-jdbc42"
libraryDependencies += "org.mongodb" % "mongo-java-driver" % "3.1.0"
libraryDependencies += "com.stratio.datasource" % "spark-mongodb_2.10" % "0.10.0"

An example of running a Spark Scala Job:

sudo /deploy/spark-1.5.1-bin-hadoop2.6/bin/spark-submit --packages com.stratio:spark-mongodb-core:0.8.7  --master spark:// --class "PGApp" --driver-class-path /deploy/postgresql-9.4-1204.jdbc42.jar  target/scala-2.10/postgresql-project_2.10-1.0.jar  --driver-memory 1G

Items to add to your Spark toolbox:

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.

apache spark ,big data ,postgresql ,mongodb

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}