Over a million developers have joined DZone.

Run Word Count With Scala and Spark on HDInsight

DZone's Guide to

Run Word Count With Scala and Spark on HDInsight

After solving the problem with word counts on Scala and Spark, the next step is to deploy the solution to HDInsight using Spark, HDFS, and Scala.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Previously, we tried to solve a word count problem with a Scala and Spark approach.

The next step is to deploy our solution to HDInsight using Spark, HDFS, and Scala.

We shall provision a Spark cluster:


Since we are going to use HDInsight, we can utilize HDFS and therefore use the Azure storage:


Then, we choose our instance types:


We are ready to create the Spark cluster:


Our data shall be uploaded to the HDFS file system.

To do so, we will upload our text files to the Azure storage account that is integrated with HDFS.

For more information on managing a storage account with the Azure CLI, check the official guide. Any text file will work.

azure storage blob upload mytextfile.txt sparkclusterscala example/data/mytextfile.txt 

Since we use JDFS, we shall make some changes to the original script:

val text = sc.textFile("wasb:///example/data/mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)

Then, we can upload our Scala class to the head node using SSH:

scp WordCountscala.scala demon@{your cluster}-ssh.azurehdinsight.net:/home/demo/WordCountscala.scala 

Again, in order to run the script, things are pretty straightforward:

spark-shell -i WordCountscala.scala 

And once the task is done, we are presented with the Spark prompt. Plus, we can now save our results to the HDFS file system.

<scala> counts.saveAsTextFile("/wordcount_results")          

And do a quick check.

hdfs dfs -ls
hdfs dfs -text /wordcount_results/part-00000

And that's it! 

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

big data ,hdinsights ,tutorial ,scala ,spark ,hdfs

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}