Over a million developers have joined DZone.

Learning Spark With Scala

DZone's Guide to

Learning Spark With Scala

Often, processing alone is not enough when it comes to big volumes of data. Data must be processed quickly, in real-time, continuously, and concurrently.

· Big Data Zone ·
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

The demand for stream processing is increasing a lot these days. The reason is that often, processing big volumes of data is not enough.

Data has to be processed fast so that a firm can react to changing business conditions in real-time.

Stream processing is the real-time processing of data continuously and concurrently.

For that, I have started learning Apache Spark, as it processes data in batch mode as well as in real-time.

Apache Spark is an open-source, general-purpose, lightning fast cluster computing system. It provides a high-level API that works with, for example, Java, Scala, Python and R. Apache Spark is a tool for running Spark applications. Spark is 100 times faster than doing big data on Hadoop and ten times faster than accessing data from disk.

Spark also provides interactive processing, graph processing, in-memory processing, and batch processing of data with very fast speed, ease of use, and a standard interface.

Spark is not only a big data processing engine. It is a framework that provides a distributed environment to process data. This means we can perform any type of task using Spark.

To see its performance, let's take a example of factorial.

Calculating the factorial for a very large number is always cumbersome in any programming language. CPU will take much time to complete the calculation.

I have written factorial function using two ways:

Using tail recursion in Scala:

def factorial(num: BigInt): BigInt = {
def factImp(num: BigInt, fact: BigInt): BigInt = {
if (num == 0) fact
factImp(num - 1, num * fact)
factImp(num, 1)

The time taken by above code to find the Factorial of 200000 on my machine (Quad Core Intel i5) was about 20s.

Factorial function using Spark:

def factorialUsingSpark(num: BigInt): BigInt = {
if (num == 0) BigInt(1)
else {
val list = (BigInt(1) to num).toList
sc.parallelize(list).reduce(_ * _)

The time taken by Spark to find the factorial of 200000 on the same machine was only 5s, which is almost 4x faster than using Scala alone.

Computation do depends on hardware of system but atleast it gives us an idea how spark efficiently processes complex computations.

So, this was my first step to learn Spark with Scala. I know that it is not much; I still need to explore more in Spark like RDD, DataFrames, structured streaming, etc., about which I will be writing in my future posts. So, stay tuned!

The complete code can be downloaded from GitHub. Comments and suggestions are welcome.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

big data ,tutorial ,scala ,spark ,data streaming ,real-time data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}