HDInsight, Writing Hadoop Map Reduce Jobs In C# And Querying Results Back Using LINQ
Join the DZone community and get the full member experience.
Join For FreeIn this post, we’ll explore
- HDInsight/Hadoop on Azure in general and steps for starting with the same
- Writing Map Reduce Jobs for Hadoop using C# in particular to store results in HDFS.
- Transferring the result data from HDFS to Hive
- Reading the data back from the hive using C# and LINQ
Preface
If you are new to Hadoop and Big Data concepts, I suggest you to quickly check out
- A quick introduction to Hadoop
- A quick introduction to Map Reduce
- An introduction to Hadoop HDInsight Services on Azure
There are a couple of ways you can start with HDInsight.
- You may Go to Azure Preview features and opt in for HDInsight and/or install the same locally
Step 1: Setting up your instance locally in your Windows
For Development, I highly recommend you to install HDInsight developer version locally – You can find it straight inside the Web Platform installer.
Once you install the HDInsight locally, ensure you are running all the Hadoop services.
Also, you may use the following links once your cluster is up and running.
- Goto http://localhost:50030 to see the HDInsight dashboard locally
- Goto http://localhost:50070 to explore the HDFS file system
Here is the HDInsight dashboard running locally.
And now you are set.
Step 2: Install the Map Reduce package via Nuget
Let us explore how to write few Map Reduce jobs in C#. We’ll write a quick job to count namespaces from C# source files Earlier, in a couple of posts related to Hadoop on Azure - Analyzing some ‘Big Data’ using C# and Extracting Top 500 MSDN Links from Stack Overflow – I showed how to use C# Map Reduce Jobs with Hadoop Streaming to do some meaningful analytics. In this post, we’ll re-write the mapper and reducer leveraging the the new .NET SDK available, and will apply the same on few code files (you can apply that on any dataset).
The new .NET SDK for Hadoop makes it very easy to work with Hadoop from .NET – with more types for supporting Map Reduce Jobs, For creating LINQ to Hive queries etc. Also, the SDK provides an easier way to create and submit your own Map Reduce jobs directly in C# either to the local developer instance or to Azure Hadoop cluster.
To start with, create a console project and install the Microsoft.Hadoop.Mapreduce package via Nuget.
Install-Package Microsoft.Hadoop.Mapreduce
This will add the required dependencies.
Step 3: Writing your Mapper and Reducer
The mapper will read the input from the HDFS file system, and the writer will emit outputs to HDFS. HDFS is Hadoop’s distributed file system, which guarantees high availability. Checkout the Apache HDFS architecture guide for details.
With Hadoop SDK, now you can inherit your Mapper from the MapperBase class, and Reducer from the ReducerCombinerBase class. This is equivalent to the independent Mapper and Reducer exes I demonstrated earlier using Hadoop streaming, just that we’ve got a better way of doing the same. In the Map method, we are just extracting the namespace declarations using reg ex to emit the same (See hadoop streaming details in my previous article)
//Mapper public class NamespaceMapper : MapperBase { //Override the map method. public override void Map(string inputLine, MapperContext context) { //Extract the namespace declarations in the Csharp files var reg = new Regex(@"(using)\s[A-za-z0-9_\.]*\;"); var matches = reg.Matches(inputLine); foreach (Match match in matches) { //Just emit the namespaces. context.EmitKeyValue(match.Value,"1"); } } } //Reducer public class NamespaceReducer : ReducerCombinerBase { //Accepts each key and count the occurrances public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { //Write back context.EmitKeyValue(key,values.Count().ToString()); } }
Next, let us write a Map Reduce Job and configure the same.
Step 4: Writing your Namespace Counter Job
You can simply specify your Mapper and Reducer types and inherit from HadoopJob to create a job class. Here we go.
//Our Namespace counter job public class NamespaceCounterJob : HadoopJob<NamespaceMapper, NamespaceReducer> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var config = new HadoopJobConfiguration(); config.InputPath = "input/CodeFiles"; config.OutputFolder = "output/CodeFiles"; return config; } }
Note that we are overriding the Configure method to specify the configuration parameters. In this case, we are specifying the input and output folders for our mapper/reducer – The lines in the files in input folder will be passed to our mapper instances, and the combined output from the reducer instances will be placed in the output folder.
Step 5: Submitting the job
Finally, we need to connect to the cluster and submit the job, using the ExecuteJob method. Here we go with the main driver.
class Program { static void Main(string[] args) { var hadoop = Hadoop.Connect(); var result=hadoop.MapReduceJob.ExecuteJob<NamespaceCounterJob>(); } }
We are invoking the ExecuteJob method using the NamespaceCounterJob type we just created. In this case, we are submitting the job locally – if you want to submit the job to an Azure HDInsight cluster for the actual execution scenario, you should pass the Azure connection parameters. Details here
Step 6: Executing the job
Before executing the job, you should prepare your input – in this case, you should copy the source code files in the input folder we provided as part of the configuration while creating our Job (see the NamespaceCounterJob). To do this, fire up the Hadoop command line console from the desktop. If your cluster is on Azure, you can remote login to the cluster head node by choosing Remote Login from the HDInsight Dashboard.
- Create a folder using the hadoop fs –mkdir input/CodeFiles command
- Copy few CSharp files to your folder using hadoop fs –copyFromLocal your\path\*.cs input/CodeFiles
See I’m copying all my CS files under BasicsRevisited folder to input/CodeFiles.
Now, build your project in Visual Studio, open the bin folder and execute your exe file. This will internally kick start MRRunner.exe and your map reduce job will get executed (The name of my executable is simply MapReduce.exe). You can see the detected file dependencies are automatically submitted.
Once the Map Reduce job is completed, you’ll find that the combined
output will be placed in output/CodeFiles folder. You can issue the –ls
and –cat commands to list the files and view the content of the
part-00000 file where the output will be placed (Yes, a little Linux
knowledge will help at times ).
The part-00000 file contains the combined output of our task – see the
name spaces along with their count from the files I submitted.
Step 7: Loading data from HDFS to Hive
As a next step, let us load the data from HDFS to Hadoop Hive so that we can query the same. We'll create a table using the CREATE TABLE hive syntax, and will load the data. You can run ‘hive’ command from the Hadoop command line to run the following statements.
CREATE TABLE nstable ( namespace STRING, count INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; LOAD DATA INPATH 'output/CodeFiles/part-00000' into table nstable;
And here is what you might see.
Now, you can read the data from the hive.
- See my article Querying data from the Hive using LINQ and C#
- I wrote a VS Add In so that you can connect to Hive and query the data
And there you go. Now you know everything about writing your own Hadoop Map Reduce Jobs in C#, load the data to the Hive, and query the same back in C# to visualize your data. Happy Coding.
Published at DZone with permission of Anoop Madhusudanan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Five Java Books Beginners and Professionals Should Read
-
Microservices Decoded: Unraveling the Benefits, Challenges, and Best Practices for APIs
-
Micro Frontends on Monorepo With Remote State Management
-
Web Development Checklist
Comments