DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Coding
  3. Tools
  4. Microsoft and Hadoop – Windows Azure HDInsight

Microsoft and Hadoop – Windows Azure HDInsight

Istvan Szegedi user avatar by
Istvan Szegedi
·
Dec. 09, 12 · Interview
Like (0)
Save
Tweet
Share
6.30K Views

Join the DZone community and get the full member experience.

Join For Free
traditionally microsoft windows used to be a sort of stepchild in hadoop world – the ‘hadoop’ command to manage actions from command line and the startup/shutdown scripts were written in linux/*nix in mind assuming bash. thus if you wanted to run hadoop on windows, you had to install cygwin. also apache hadoop document states the following (quotes from hadoop r1.1.0 documentation):

“•gnu/linux is supported as a development and production platform. hadoop has been demonstrated on gnu/linux clusters with 2000 nodes
•win32 is supported as a development platform. distributed operation has not been well tested on win32, so it is not supported as a production platform.”

microsoft and hortonworks joined their forces to make hadoop available on windows server for on-premise deployments as well as on windows azure to support big data in the cloud, too.

this post covers windows azure hdinsight (hadoop on azure, see https://www.hadooponazure.com ). as of writing, the service requires an invitation to participate in the ctp (community technology preview) but the invitation process is very efficiently managed - after filling in the survey, i received the service access code within a couple of days.

new cluster request

the first step is to request a new cluster, you need to define the cluster name and the credentials to be able to login to the headnode. by default the cluster consists of 3 nodes.

after a few minutes, you will have a running cluster,  then click on the “go to cluster” link to navigate to the main page.

wordcount with hdinsight on  azure

no hadoop test is complete without the standard wordcount application – microsoft azure hdinsight provides an example file (davinci.txt) and the java jar file to run wordcount  - the hello world of hadoop.

first you need to go to the javascript console to upload the text file using fs.put():

js> fs.put()




choose file ->  browse
destination: /user/istvan/example/data/davinci

create a job:

createjob

the actual command that microsoft azure hdinsight executes is as follows:

c:\apps\dist\hadoop-1.1.0-snapshot\bin\hadoop.cmd jar c:\apps\jobs\templates\634898986181212311.hadoop-examples-1.1.0-snapshot.jar wordcount /user/istvan/example/data/davinci davinci-output

you can validate the output from javascript console:

js> result = fs.read("davinci-output")
"(lo)cra"	1
"1490	1
"1498,"	1
"35"	1
"40,"	1
"as-is".	1
"a_	1
"absoluti	1
"alack!	1

microsoft hdinsight streaming – hadoop job in c#

hadoop streaming is a utility to support running external map and reduce jobs. these external jobs can be written in various programming languages such as python or ruby – should we talk about microsoft hdinsight, the example better be based on .net c#…

the demo application for c# streaming is again a wordcount example using the imitation of unix cat and wc commands. you could run the demo from the “samples” tile but i prefer to demonstrate hadoop streaming from the command line to have a closer look at what is going on under the hood.

in order to run hadoop command line from windows cmd prompt, you need to login to the hdinsight headnode using remote desktop. first you need to click on “remote desktop” tile, then login the remote node using the credentials you defined at cluster creation time. once you logged in, click on hadoop coomand line shortcut.

in hadoop command line, go to the hadoop distribution directory (as of writing this post, microsoft azure hdinsight is based on hadoop 1.1.0):

c:> cd \apps\dist
c:> hadoop fs -get /example/apps/wc.exe .
c:> hadoop fs -get /example/apps/cat.exe .
c:> cd \apps\dist\hadoop-1.1.0-snapshot
c:\apps\dist\hadoop-1.1.0-snapshot> hadoop jar lib\hadoop-streaming.jar -input "/user/istvan/example/data/davinci" -output "/user/istvan/example/dataoutput" -mapper "..\..\jars\cat.exe" -reducer "..\..\jars\wc.exe" -file "c:\apps\dist\wc.exe" -file "c:\apps\dist\cat.exe"

the c# code for wc.exe is as follows:

using system;
using system.io;
using system.linq;




namespace wc
{
    class wc
    {
        static void main(string[] args)
        {
            string line;
            var count = 0;




            if (args.length > 0){
                console.setin(new streamreader(args[0]));
            }




            while ((line = console.readline()) != null) {
                count += line.count(cr => (cr == ' ' || cr == '\n'));
            }
            console.writeline(count);
        }
    }
}

and the code for cat.exe is:

using system;
using system.io;




namespace cat
{
    class cat
    {
        static void main(string[] args)
        {
            if (args.length > 0)
            {
                console.setin(new streamreader(args[0])); 
            }




            string line;
            while ((line = console.readline()) != null) 
            {
                console.writeline(line);
            }




        }




    }
}

interactive console

microsoft azure hdinsight comes with two types of interactive console: one is the standard hadoop hive console, the other one is unique in hadoop world, it is based on javascript.

let us start with hive. you need to upload your data using the javascript fs.put() method as described above. then you can create your hive table and run a select query as follows :

create table stockprice (yyyymmdd string, open_price float, high_price float, low_price float, close_price float, stock_volume int, adjclose_price float)
row format delimited fields terminated by ',' lines terminated by '\n' location '/user/istvan/input/';




select yyyymmdd, high_price, stock_volume from stockprice order by high_price desc;

interactivehive

interactivehive-select

the other flavor of hdinsight interactive console is based on javascript - as said before, this is a unique offering from microsoft – in fact, the javascript commands are converted to pig statements.

javascriptconsole

the syntax resembles a kind of linq style query, though not the same:

js> pig.from("/user/istvan/input/goog_stock.csv", "date,open,high,low,close,volume,adjclose", ",").select("date, high, volume").orderby("high desc").to("result")




js> result = fs.read("result")
05/10/2012	774.38	2735900
04/10/2012	769.89	2454200
02/10/2012	765.99	2790200
01/10/2012	765	3168000
25/09/2012	764.89	6058500

under the hood

microsoft and hortonworks have re-implemented the key binaries (namenode, jobtracker, secondarynamenode, datanode, tasktracker) as executables (exe files) and they are running as services in the background. the key ‘hadoop’ command – which is traditionally a bash script – is also re-implemented as hadoop.cmd.

the distribution consists of hadoop 1.1.0, pig-0.9.3, hive 0.9.0, mahout 0.5 and sqoop 1.4.2.

hadoop azure

Published at DZone with permission of Istvan Szegedi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Spring Cloud: How To Deal With Microservice Configuration (Part 1)
  • Kotlin Is More Fun Than Java And This Is a Big Deal
  • What Should You Know About Graph Database’s Scalability?
  • The Quest for REST

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: