Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Run Word Count With Scala and Spark on HDInsight

DZone's Guide to

Run Word Count With Scala and Spark on HDInsight

After solving the problem with word counts on Scala and Spark, the next step is to deploy the solution to HDInsight using Spark, HDFS, and Scala.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Previously, we tried to solve a word count problem with a Scala and Spark approach.

The next step is to deploy our solution to HDInsight using Spark, HDFS, and Scala.

We shall provision a Spark cluster:

screenshot-from-2017-02-22-23-12-22

Since we are going to use HDInsight, we can utilize HDFS and therefore use the Azure storage:

screenshot-from-2017-02-22-23-12-59

Then, we choose our instance types:

screenshot-from-2017-02-22-23-13-21

We are ready to create the Spark cluster:

screenshot-from-2017-02-22-23-13-55

Our data shall be uploaded to the HDFS file system.

To do so, we will upload our text files to the Azure storage account that is integrated with HDFS.

For more information on managing a storage account with the Azure CLI, check the official guide. Any text file will work.

azure storage blob upload mytextfile.txt sparkclusterscala example/data/mytextfile.txt 

Since we use JDFS, we shall make some changes to the original script:

val text = sc.textFile("wasb:///example/data/mytextfile.txt")
val counts = text.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
counts.collect


Then, we can upload our Scala class to the head node using SSH:

scp WordCountscala.scala demon@{your cluster}-ssh.azurehdinsight.net:/home/demo/WordCountscala.scala 

Again, in order to run the script, things are pretty straightforward:

spark-shell -i WordCountscala.scala 

And once the task is done, we are presented with the Spark prompt. Plus, we can now save our results to the HDFS file system.

<scala> counts.saveAsTextFile("/wordcount_results")          

And do a quick check.

hdfs dfs -ls
/wordcount_results/
hdfs dfs -text /wordcount_results/part-00000

And that's it! 

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
big data ,hdinsights ,tutorial ,scala ,spark ,hdfs

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}