Simple Data Analysis Using Apache Spark

DZone 's Guide to

Simple Data Analysis Using Apache Spark

Apache Spark is excellent for certain kinds of distributed computation, especially iterative operations on large data sets.

· Big Data Zone ·
Free Resource

Apache Spark is a framework for distributed computing. A typical Spark program runs parallel to many nodes in a cluster. For a detail and excellent introduction to Spark please look at the Apache Spark website (https://spark.apache.org/documentation.html).

The purpose of this tutorial is to walk through a simple Spark example by setting the development environment and doing some simple analysis on a sample data file composed of userId, age, gender, profession, and zip code (you can download the source and the data file from Github https://github.com/rjilani/SimpleSparkAnalysis).

For the sake of this tutorial I will be using IntelliJ community IDE with the Scala plugin; you can download the IntelliJ IDE and the plugin from the IntelliJ website. You can also use your favorite editor or Scala IDE for Eclipse if you want to.  A preliminary understanding of Scala as well as Spark is expected.  The version of Scala used for this tutorial is 2.11.4 with Apache Spark 1.3.1.

Once you have installed the IntelliJ IDE and Scala plugin, please go ahead and start a new Project using File->New->Project wizard and then choose Scala and  SBT from the New Project Window Wizard. At this stage if this is your first time to create a project, you may have to choose a Java project SDK, a Scala and SBT version.

Once the project is created, copy and paste the following lines into your SBT file:

name := "SparkSimpleTest"version := "1.0"scalaVersion := "2.11.4"libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.3.1""org.apache.spark" %% "spark-sql" % "1.3.1", "org.apache.spark" %% "spark-streaming" % "1.3.1")

Note: you don’t need to have spark SQL and spark streaming libraries to finish this tutorial, but add it any way in case you have to use spark SQL and streaming for future examples.  Once you save SBT file, IntelliJ will ask you to refresh, and once you hit refresh it will download all the required dependencies. Once all the dependencies are downloaded you are ready for fun stuff.  

Go ahead and add a new Scala class of type Object (without going into the Scala semantics, in plain English it mean your class will be executable with a main method inside it).

For the sake of brevity I would also omit the boiler plate code in this tutorial (you can download the full source file from Github https://github.com/rjilani/SimpleSparkAnalysis).  Now let’s jump into the code, but before proceeding further lets cut the verbosity by turning off the spark logging using these two lines at the beginning of the code:



Now add the following two lines:

val conf = new SparkConf().setAppName("Simple Application").setMaster("local")

val sc = new SparkContext(conf)

The line above is boiler plate code for creating a spark context by passing the configuration information to spark context. In spark programming model every application runs in spark context; you can think of spark context as an entry point to interact with Spark execution engine. The configuration object above tells Spark where to execute the spark jobs (in this case the local machine). For a detail explanation of configuration options please refers Spark documentation on spark website.

Spark is written in Scala and exploits the functional programming paradigm, so writing map and reduce jobs becomes very natural and intuitive.  Spark exposes many transformation functions and detailed explanation of these functions can be found on the Spark website (https://spark.apache.org/docs/latest/programming-guide.html#transformations).

Now get back to fun stuff (real coding):

val data = sc.textFile("./Data/u.user") //Location of the data file

  .map(line => line.split(","))

  .map(userRecord => (userRecord(0),

  userRecord(1), userRecord(2),userRecord(3),userRecord(4)))

The code above is reading a comma delimited text file composed of user’s records, and chaining the two transformations using the map function.  The first map function takes a closure and split the data file in lines using a “,” delimiter.  The second map function takes each line record and creating an RDD (resilient distributed dataset) of Scala tuples. Think of Scala tuples as an immutable list that can hold different type of objects. 

Every spark RDD object exposes a collect method that returns an array of object, so if you want to understand what is going on, you can iterate the whole RDD as an array of tuples by using the code below:


//Data file is transformed in Array of tuples at this point





The whole fun of using Spark is to do some analysis on Big Data (no buzz intended).  So let’s ask some questions to do the real analysis.

1.  How many unique professions do we have in the data file?

//Number of unique professions in the data file
val uniqueProfessions = data.map {case (id, age, gender, profession,zipcode) => profession}.distinct().count()

Look at the code above. That is all it takes to find the unique professions in the whole data set. Isn't it amazing! Here is the explanation of the code:

The map function is again an example of the transformation, the parameter passed to the map function is a case class (see Scala case classes) that returns the attribute profession for the whole RDD in the data set, and then we call the distinct and count function on the RDD .

2.  How many different users belongs to unique professions 

//Group users by profession and sort them by descending order

val usersByProfession = data.map{ case (id, age, gender, profession,zipcode) => (profession, 1) }
.reduceByKey(_ + _).sortBy(-_._2)

The map function is again an example of the transformation, the parameter passed to map function is a case class (see Scala case classes) that returns a tuple of profession and integer 1, that is further reduced by he “reduceByKey” function in unique tuples and the sum of all the values related to the unique tuple.

 “SortBy” function is a way to sort the RDD by passing a closure that takes a tuple as an input and sorts the RDD on the basis of second element of tuple (in our case it is the sum of all the unique values of the professions). A “-“sign in front of the closure is a way to tell “sortBy” to sort the value in descending order.

You can print the list of professions and their count using the line below:







3.  How many users belong to a unique zip code in the sample file:

//Group users by zip code and sort them by descending order

val usersByZipCode = data.map{ case (id, age, gender, profession,zipcode) => (zipcode, 1) }
.reduceByKey(_ + _).sortBy(-_._2)

4.  How many users are male and female:

//Group users by Gender and sort them by descending order

val usersByGender = data.map{ case (id, age, gender, profession,zipcode) => (gender, 1) }
.reduceByKey(_ + _).sortBy(-_._2)

Items 3 and 4 use the same pattern as item 2. The only difference is that the map functions returns the tuple of zip code and gender that is further reduced by the reduceByKey function.

I hope the above tutorial is easy to digest. If not, please hang in there, brush up your Scala skills and then review the code again. With time and practice you will find the code much easier to understand.

apache spark, big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}