DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Machine Learning Using Microsoft HDInsight on Azure

Machine Learning Using Microsoft HDInsight on Azure

Istvan Szegedi user avatar by
Istvan Szegedi
·
Jan. 02, 13 · Interview
Like (0)
Save
Tweet
Share
8.02K Views

Join the DZone community and get the full member experience.

Join For Free

introduction

our last post was about microsoft and hortonworks joint effort to deliver hadoop on microsoft windows azure dubbed hdinsight. one of the key microsoft hdinsight components is mahout, a scalable machine learning library that provides a number of algorithms relying on the hadoop platform. machine learning supports a wide range of use cases from email spam filtering to fraud detection to recommending books or movies, similar to amazon.com features.these algorithms can be divided into three main categories: recommenders/collaborative filtering, categorization and clustering. more details about these algorithms can be read on apache mahout wiki.

recommendation engine on azure

a standard recommender example in machine learning is a movie recommender. surprisingly enough this example is not among the provided hdinsight examples so we need to implement it on our own using mahout-0.5 components.

the movie recommender is an item based recommendation algorithm: the key concept is that having a large dataset of users, movies and values indicating how much a user liked that particular movie, the algorithm will recommend movies to the users. a commonly used dataset for movie recommendation is from grouplens .

the downloadable file that is part of the 100k dataset (u.data) is not suitable for mahout as is because its format is like:

user item value timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....

mahout requires the data to be in the following format: userid,itemid,value
so the content has to be converted to

196,242,3
186,302,3
22,377,1
....

there is no web based console to execute mahout on azure, we need to go to remote desktop to download the rdp configuration and then login to azure headnode via rdp. then we have to run hadoop command line to get a prompt.

c:>c:\apps\dist\hadoop-1.1.0-snapshot\bin\hadoop jar c:\apps\dist\mahout-0.5\mahout-examples-0.5-job.jar org.apache.mahout.driver.mahoutdriver recommenditembased --input recommend.csv --output recommendout --tempdir recommendtmp --usersfile user-ids.txt --similarityclassname similarity_euclidean_distance --numrecommendations 5

the standard mahout.cmd seems to have a few bugs, if we run mahout.cmd then it will throw an error complaining about java usage. i had to modify the file to remove setting hadoop_classpath envrionment variable, see the changes in bold-italic:

@rem run it
if not [%mahout_local%] == [] (
    echo "mahout_local is set, running locally"
    %java% %java_heap_max% %mahout_opts% -classpath %mahout_classpath% %class% %*
) else (
    if [%mahout_job%] == [] (
        echo "error: could not find mahout-examples-*.job in %mahout_home% or %mahout_home%\examples\target"
        goto :eof
    ) else (
@rem  set hadoop_classpath=%mahout_classpath%
        if /i [%1] == [hadoop] (
            echo running: %hadoop_home%\bin\%*
            call %hadoop_home%\bin\%*
        ) else (
            echo running: %hadoop_home%\bin\hadoop jar %mahout_job% %class% %*
            call %hadoop_home%\bin\hadoop jar %mahout_job% %class% %*
        )
    )
)

after this change we can run mahout as expected:

c:\apps\dist\mahout-0.5\bin>mahout.cmd recommenditembased --input recommend.csv --output recommendout --tempdir recommendtmp --usersfile user-ids.txt --similarityclassname similarity_euclidean_distance --numrecommendations 5

input argument defines the path to the input directory, output argument determines the path to output directory.

the numrecommendations means the number of recommendations per user.

the usersfile defines the users to recommend for (in our case it contained 3 users only, 112, 286, 310:

c:>hadoop fs -cat user-ids.txt
112
286
301

the similarityclass is the name of the distributed similarity class and it can be similarity_euclidean_distance, similarity_loglikelihood, similarity_pearson_correlation, etc. this class determine the algorithm to calculate similarities between the items.

the execution of mapreduce tasks can be monitored via hadoop mapreduce admin console:

mahout-recommender-console

mahout-recommender-console1

once the job is finished, we need to use hadoop filesystem commands to display the output file produced by the recommenderjob:

c:\apps\dist\hadoop-1.1.0-snapshot>hadoop fs -ls .
found 5 items
drwxr-xr-x   - istvan supergroup          0 2012-12-21 11:00 /user/istvan/.trash

-rw-r--r--   3 istvan supergroup    1079173 2012-12-23 22:40 /user/istvan/recomm
end.csv
drwxr-xr-x   - istvan supergroup          0 2012-12-24 12:24 /user/istvan/recomm
endout
drwxr-xr-x   - istvan supergroup          0 2012-12-24 12:22 /user/istvan/recomm
endtmp
-rw-r--r--   3 istvan supergroup         15 2012-12-23 22:40 /user/istvan/user-i
ds.txt

c:\apps\dist\hadoop-1.1.0-snapshot>hadoop fs -ls recommendout
found 3 items
-rw-r--r--   3 istvan supergroup          0 2012-12-24 12:24 /user/istvan/recomm
endout/_success
drwxr-xr-x   - istvan supergroup          0 2012-12-24 12:23 /user/istvan/recomm
endout/_logs
-rw-r--r--   3 istvan supergroup        153 2012-12-24 12:24 /user/istvan/recomm
endout/part-r-00000

c:\apps\dist\hadoop-1.1.0-snapshot>hadoop fs -cat recommendout/part-r*
112     [1228:5.0,1473:5.0,1612:5.0,1624:5.0,1602:5.0]
286     [1620:5.0,1617:5.0,1615:5.0,1612:5.0,1611:5.0]
301     [1620:5.0,1607:5.0,1534:5.0,1514:5.0,1503:5.0]

thus the recommenderjob recommends item 1228, 1473, 1612, 1624 and 1602 to user 112; item 1620, 1617, 1615, 1612 and 1611 for user 286 and 1620, 1607, 1534, 1514 and 1503 for user 301, respectively.

for those inclined to theory and scientific papers, i suggest to read the paper from sarwar, karypis, konstand and riedl that provides the background of the item based recommendation algorithms.

mahout examples on azure

hadoop on azure comes with two predefined examples: one for classification, one for clustering. they require command line to be executed – a smilar way as described above for the item based recommendation engine.

mahout

the classification demo is based on naive bayes classifier- first you need to train your classifier with a set of known data and then you can run the algorithm on the actual data set. this concept is called supervised learning.

to run the classification example we need to download the 20news-bydate.tar.gz file from http://people.csail.mit.edu/jrennie/20newsgroups/20news-bydate.tar.gz and unzip it under mahout-0.5/examples/bin/work directory. the data set has two subsets, one for training the classifier and the other one to run the test. then we can run the command:

c:\apps\dist\mahout-0.5\examples\bin> build-20news-bayes.cmd

this will kick off the hadoop mapreduce job and after a while it will spit out the confusion matrix based on bayes algorithm. the confusion matrix will tell us what categories were correctly identified by the classifier and what were incorrect.

for instance, it has a category called rec.motorcycles (column a), and the classifier correctly identified 381 items out of 398 belonging to this cathegory, while it defined 9 items incorrectly as belonging to rec.autos (column f), 2 items incorrectly as belonging to sci.electronics (column n), etc.

work_path=c:\apps\dist\mahout-0.5\examples\bin\work\20news-bydate\
running: c:\apps\dist\hadoop-1.1.0-snapshot\bin\hadoop jar c:\apps\dist\mahout-0
.5\bin\..\\mahout-examples-0.5-job.jar org.apache.mahout.driver.mahoutdriver tes
tclassifier   -m examples/bin/work/20news-bydate/bayes-model   -d examples/bin/w
ork/20news-bydate/bayes-test-input   -type bayes   -ng 1   -source hdfs   -metho
d "mapreduce"

12/12/24 17:55:58 info mapred.jobclient:     map output records=7532
12/12/24 17:55:59 info bayes.bayesclassifierdriver: ============================
===========================
confusion matrix
-------------------------------------------------------
a       b       c       d       e       f       g       h       i       j
k       l       m       n       o       p       q       r       s       t
u       <--classified as
381     0       0       0       0       9       1       0       0       0
1       0       0       2       0       1       0       0       3       0
0        |  398         a     = rec.motorcycles
1       284     0       0       0       0       1       0       6       3
11      0       66      3       0       1       6       0       4       9
0        |  395         b     = comp.windows.x
2       0       339     2       0       3       5       1       0       0
0       0       1       1       12      1       7       0       2       0
0        |  376         c     = talk.politics.mideast
4       0       1       327     0       2       2       0       0       2
1       1       0       5       1       4       12      0       2       0
0        |  364         d     = talk.politics.guns
7       0       4       32      27      7       7       2       0       12
0       0       6       0       100     9       7       31      0       0
0        |  251         e     = talk.religion.misc
10      0       0       0       0       359     2       2       0       1
3       0       1       6       0       1       0       0       11      0
0        |  396         f     = rec.autos
0       0       0       0       0       1       383     9       1       0
0       0       0       0       0       0       0       0       3       0
0        |  397         g     = rec.sport.baseball
1       0       0       0       0       0       9       382     0       0
0       0       1       1       1       0       2       0       2       0
0        |  399         h     = rec.sport.hockey
2       0       0       0       0       4       3       0       330     4
4       0       5       12      0       0       2       0       12      7
0        |  385         i     = comp.sys.mac.hardware
0       3       0       0       0       0       1       0       0       368
0       0       10      4       1       3       2       0       2       0
0        |  394         j     = sci.space
0       0       0       0       0       3       1       0       27      2
291     0       11      25      0       0       1       0       13      18
0        |  392         k     = comp.sys.ibm.pc.hardware
8       0       1       109     0       6       11      4       1       18
0       98      1       3       11      10      27      1       1       0
0        |  310         l     = talk.politics.misc
0       11      0       0       0       3       6       0       10      6
11      0       299     13      0       2       13      0       7       8
0        |  389         m     = comp.graphics
6       0       1       0       0       4       2       0       5       2
12      0       8       321     0       4       14      0       8       6
0        |  393         n     = sci.electronics
2       0       0       0       0       0       4       1       0       3
1       0       3       1       372     6       0       2       1       2
0        |  398         o     = soc.religion.christian
4       0       0       1       0       2       3       3       0       4
2       0       7       12      6       342     1       0       9       0
0        |  396         p     = sci.med
0       1       0       1       0       1       4       0       3       0
1       0       8       4       0       2       369     0       1       1
0        |  396         q     = sci.crypt
10      0       4       10      1       5       6       2       2       6
2       0       2       1       86      15      14      152     0       1
0        |  319         r     = alt.atheism
4       0       0       0       0       9       1       1       8       1
12      0       3       6       0       2       0       0       341     2
0        |  390         s     = misc.forsale
8       5       0       0       0       1       6       0       8       5
50      0       40      2       1       0       9       0       3       256
0        |  394         t     = comp.os.ms-windows.misc
0       0       0       0       0       0       0       0       0       0
0       0       0       0       0       0       0       0       0       0
0        |  0           u     = unknown
default category: unknown: 20

12/12/24 17:55:59 info driver.mahoutdriver: program took 129826 ms

c:\apps\dist\mahout-0.5\examples\bin

again for those interested in theory and scientific papers, i suggest to read the following webpage.


Machine learning azure

Published at DZone with permission of Istvan Szegedi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top 5 Node.js REST API Frameworks
  • Top 10 Secure Coding Practices Every Developer Should Know
  • Load Balancing Pattern
  • 5 Factors When Selecting a Database

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: