DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Introducing Apache Hadoop Services for Windows Azure

Introducing Apache Hadoop Services for Windows Azure

Roger Jennings user avatar by
Roger Jennings
·
Sep. 01, 12 · Interview
Like (0)
Save
Tweet
Share
5.05K Views

Join the DZone community and get the full member experience.

Join For Free

the sql server team (@ sqlserver ) announced apache hadoop services for windows azure , a.k.a. apache hadoop on windows azure or hadooop on azure , at the profesional association for sql server (pass) summit in october 2011.

• update 8/23/2012 for service update 3 (su3) released to the web on 8/21/2012. su3 updated hadoop common, mapreduce, and hdfs to v1.0.1, hive to 0.8.1, pig to 0.9.3, and sqoop to 1.4.2. versions of mahout and cmu pegasus are unchanged from service update 2 (su2.) su3 delivers 3-node clusters, a rest management apis for hadoop job submission, progress inquiry and killing jobs, as well as a c# sdk v1.0, powershell cmdlets, and direct browser access to a cluster. stay tuned for a post to red gate software’s acloudyplace blog about su3.

update 4/2/2012: updated samples gallery screen capture in step 3 with five samples added in march 2012, apache sqoop as a top-level project, and links to sections from toc.

table of contents

  • introduction
  • tutorial: running the 10gb graysort sample’s teragen job
  • tutorial: running the 10gb graysort sample’s terasort job
  • tutorial: running the 10gb graysort sample’s teravalidate job
  • apache hadoop on windows azure resources

introduction

val fontama ’s availability of community technology preview (ctp) of hadoop based service on windows azure post of 12/14/2011 described apache hadoop on windows azure and how to obtain an invitation to the ctp:

image in october at the pass summit 2011, microsoft announced expanded investments in “big data”, including a new apache hadoop™ based distribution for windows server and service for windows azure. in doing so, we extended microsoft’s leadership in bi and data warehousing, enabling our customers to glean and manage insights for any data, any size, anywhere. we delivered on our promise this past monday, when we announced the release of the community technology preview (ctp) of our hadoop based service for windows azure.

image today this preview is available to an initial set of customers. those interested in joining the preview may request to do so by filling out this survey . microsoft will issue a code that will be used by the selected customers to access the hadoop based service. we look forward to making it available to the general public in early 2012. customers will gain the following benefits from this preview:

  • image broader access to hadoop through simplified deployment and programmability. microsoft has simplified setup and deployment of hadoop, making it possible to setup and configure hadoop on windows azure in a few hours instead of days. since the service is hosted on windows azure, customers only download a package that includes the hive add-in and hive odbc driver. in addition, microsoft has introduced new javascript libraries to make javascript a first class programming language in hadoop. through this library javascript programmers can easily write mapreduce programs in javascript, and run these jobs from simple web browsers. these improvements reduce the barrier to entry, by enabling customers to easily deploy and explore hadoop on windows.
  • image breakthrough insights through integration microsoft excel and bi tools. this preview ships with a new hive add-in for excel that enables users to interact with data in hadoop from excel. with the hive add-in customers can issue hive queries to pull and analyze unstructured data from hadoop in the familiar excel. second, the preview includes a hive odbc driver that integrates hadoop with microsoft bi tools. this driver enables customers to integrate and analyze unstructured data from hadoop using award winning microsoft bi tools such as powerpivot and powerview. as a result customers can gain insight on all their data, including unstructured data stored in hadoop.
  • elasticity , thanks to windows azure. this preview of the hadoop based service runs on windows azure, offering an elastic and scalable platform for distributed storage and compute.

we look forward to your feedback! learn more at www.microsoft.com/bigdata .

val fontama
senior product manager
sql server product management


image mary jo foley (@ maryjofoley ) ranked apache hadoop on windows azure #9 of her 10 sexiest microsoft business teases for 2012 on 12/29/2011:

…

image 9. azurehadoop (or is it hadoopazure?): microsoft made available the preview bits for the hadoop distribution for windows azure in december 2011. the final release is slated for march 2012 . (microsoft and partner hortonworks are also working on an on-premises hadoop on windows server distribution .) hadoop on windows azure is interesting because it combines microsoft’s big-data plans and products with its cloud platform. the idea microsoft will be pushing in 2012 is that hadoop on azure will give users of microsoft’s analytics tools, including plain-old excel, a way to make use of the growing number of data sets stored on windows azure.

…


tutorial: running the 10gb graysort sample’s teragen job

following is a step-by-step tutorial for running the first process of the 10gb graysort sample project:

1. after you receive your invitation code, navigate to https://www.hadooponazure.com/ and log-in with your windows live id and invitation code to open the account page with the request a new cluster content active. type a globally unique dns name for your cluster, hadoop1 for this example, select a cluster size (large for this example), and type a administrative username, password and password confirmation:

image click screen captures for a full-size (1024x768-px) image.

• update 8/23/2012 for su3: no-charge cluster size is now fixed at 3 nodes with 1.5 tb disk space with a five-day lifetime, non-renewable. users who want m

note : there is no charge for windows azure resources used during the ctp, so you don’t need to provide a credit card to create your cluster.

2. when your cluster is provisioned, the account page’s content changes to include tiles to create a new job as well as access your cluster by different methods:

image

note : you must renew your cluster every two days. it’s clear that that faux-metro uis are de rigueur at microsoft these days.

3. click the samples tile to open the account/sample gallery page, which describes the currently available samples as of 4/2/2012.

figure 4 - nine samples

note : the above screen capture includes five additional sample projects added in march 2012. the apache software foundation asserted open source big data tool used for efficient bulk transfer between apache hadoop and structured datastores in an introduction to its the apache software foundation announces apache sqoop as a top-level project post of 4/2/2012.

4. the graysort mapreduce sample is a useful starting point because it runs in a reasonably short time (about 4 minutes with a large cluster), so click the 10gb graysort tile to open its account/sampledetails page, which describes the sample:

image

5. click the deploy to your cluster button to automatically populate text boxes with values for the teragen program, which generates 10 gb of data:

image

note : if you have tried sql azure labs’ microsoft codename “data numerics” ctp, you’ll notice that the process for creating the hadoop cluster and executing the first mapreduce job is much more automated that that described in my deploying “cloud numerics” sample applications to ... post of 1/28/2012 (updated 1/30/2012).

5. click the execute job button to run the teragen program, which initially displays this job info page:

image

6. after a few seconds, the program begins adding lines of debut output for the 50 maps in increments close to 1 percent:

note : hadoop automatically repairs the failures reported above, but it’s surprising that lines for 78 and 79 percent are missing.

7. when processing completes, click the left-arrow button to return to the account page with a tile for the teragen process added:

image

8. click the job history tile to display a summary of the preceding operation, which confirms successful completion with an exit code = ok cell:

image

9. click the left-arrow button to return to the main accounts page and click the manage cluster tile to display total storage used (30 gb) and other data source options (data market, windows azure blob storage, and amazon s3):

image


tutorial: running the 10gb graysort sample’s terasort job

10. return to the account page, click the samples tile to open the account/samples page (see step 3), click the 10gb graysort tile to open its account/sampledetails page (see step 4), and click the deploy to your cluster button to open the create job page.

11. replace teragen with terasort in the parameter 1 text box, add a space and =dmapred.reduce.tasks=25 as a suffix to the existing parameter 2 text box value, and change the parameter 3 text box’s value to /example/data/10gb-sort-input /example/data/10gb-sort-out :

image

12. click the top, middle, and bottom arrow symbols adjacent to the three text boxes to validate the three parameter values:

image

note : if you receive invalid instead of ok for any of the parameter values and you’ve verified the content is identical to the above, click the garbage can symbols next to the offending text boxes to remove them, click add parameter button three times to recreate them, retype the parameter values shown above, and click the validation arrows again.

the final command should read:

hadoop jar hadoop-examples-0.20.203.1-snapshot.jar terasort "-dmapred.map.tasks=50 -dmapred.reduce.tasks=25" /example/data/10gb-sort-input /example/data/10gb-sort-out

13. click the execute job button to start the sorting process:

image

14. about 45 minutes after you start the job, the stdoutput and stderror results appear:

notice that reduce operations don’t occur until mapping is ~80% complete.

15. click the job history tile to display the summary for the terasort option:

image

16. return to the account page and click the manage cluster tile to determine the additional storage space used (10 gb) by the terasort operation:

image

tutorial: running the 10gb graysort sample’s teravalidate job

17. return to the account page, click the samples tile to open the account/samples page (see step 3), click the 10gb graysort tile to open its account/sampledetails page (see step 4), and click the deploy to your cluster button to open the create job page.

18. delete the parameters and add three new empty parameters. type teravalidate in the parameter 1 text box, click the arrow to validate the parameter, type "-dmapred.map.tasks=50 -dmapred.reduce.tasks=25" in the parameter 2 text box value and validate it, and type /example/data/10gb-sort-out /example/data/10gb-sort-validate in the parameter 3 text box’s value and validate it:

image

the final command should read:

hadoop jar hadoop-examples-0.20.203.1-snapshot.jar teravalidate "-dmapred.map.tasks=50 -dmapred.reduce.tasks=25" /example/data/10gb-sort-out /example/data/10gb-sort-validate

19. click the execute job button to start the task:

image

20. after about five minutes, the task completes with the following (partial) debug output:

note : it’s not clear from the debug output above how to determine the result of the validation task.

21. return to the account page and click the job history tile to verify the outcome:

image

note : job #3’s failure was due to a mismatch in the input file name, as emphasized above.


apache hadoop on windows azure resources

download the apache hadoop-based services for windows azure how-to and faq whitepaper in pdf or *.docx format.

wesley mcswain posted a apache hadoop based services for windows azure how to guide , which is similar (but not identical) to the above document, to the technet wiki on 12/13/2011. the latest update when this post was written was 1/18/2012. here’s its content:

this content is a work in progress for the benefit of the hadoop community. please feel free to contribute to this wiki page based on your expertise and experience with hadoop.

if you have any questions, please use the groups dl http://tech.groups.yahoo.com/group/hadooponazurectp/

how-tos
  1. setup your hadoop cluster
    • provision a temporary hadoop cluster on microsoft's elastic map reduce portal
    • provision a new hadoop cluster on your windows azure subscription.
    • provision a new hadoop cluster on your on-premise hardware.
  2. running sample jobs
    • how to run sample pi estimator job
    • running sample wordcount hadoop job with a few twists
    • running 10gb sort hadoop job with teragen, terasort and teravalidate options *
  3. writing your own job and running on cluster
    • writing your very own wordcount hadoop job in java and deploying to windows azure cluster
    • running a javascript map/reduce job
  4. job administration
    • understanding mapreduce job administration by running 10gb sort hadoop job with terasort option
  5. interactive console:
    • tasks with the interactive javascript console
      • how to create and run a javascript map reduce job
    • tasks with hive on the interactive console
      • how to run hive queries from the interactive console
  6. remote desktop
    • how to remote login to hadoop cluster
    • using the hadoop command shell
    • view the job tracker
    • view hdfs
  7. connecting windows azure blob storage from hadoop cluster
    • configuring hadoop cluster to connect with azure storage
    • running a hadoop job using azure storage as input and output paramteres
  8. open ports
    • how to connect excel to hadoop on azure via hiveodbc
    • how to ftp data to hadoop on azure
    • how to sftp data to hadoop on azure
  9. manage data
    • import data from data market
    • setup asv - use your windows azure blob storage account
    • setup s3 - use your amazon s3 account
  10. apache hadoop on windows azure:
    • tips and tricks to manage your hadoop cluster
    • running apache pig (pig latin) at apache hadoop on windows azure
  11. scenarios
    • querying a web log file via hiveql
* does not include details for terasort or teravalidate options
faqs
  • frequently asked questions with hadoop on windows azure
more information
  • see apache hadoop on windows .
hadoop azure Web Service cluster career Big data

Published at DZone with permission of Roger Jennings. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top 10 Secure Coding Practices Every Developer Should Know
  • Data Mesh vs. Data Fabric: A Tale of Two New Data Paradigms
  • Public Cloud-to-Cloud Repatriation Trend
  • The Importance of Delegation in Management Teams

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: