Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Fundamentals of Machine Learning

DZone's Guide to

Fundamentals of Machine Learning

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Let's face it - computing was created to analyze data and machine learning represents the state-of-the-art in making sense of data. For many years it has been out of reach for the common developer.

This is perhaps one of the highest paid and most sought-after skills today.  No question about it -  this is the place to really make a big as a developer.

image001

Figure 1: The world of machine learning

Machine learning represents the logical extension of simple data retrieval and storage. It is about developing building blocks that make computers learn and behave more intelligently.

Machine learning makes it possible to mine historical data and make predictions about future trends. Without realizing it, you are probably already using the benefits of machine learning. Search engine results, online recommendations, ad targeting, fraud detection, and spam filtering are all examples of what is possible with machine learning.

Machine learning is about making data-driven decisions. While instinct might be important, it is difficult to beat empirical data.

The many facets of machine learning

Once you start to dive deep into the topic you start addressing such topics as:

  1. Supervised and unsupervised learning

  2. Classification

  3. Markov models and Bayesian networks and much more

Mahout and Hadoop

The Apache Mahout project's goal is to build a scalable machine learning library.

There is some degree of overlap with big data analytics within a Hadoop

There is an entire machine learning open-source project that you can get for free with Hadoop. You can learn more here:

  1. http://mahout.apache.org/

Mahout includes algorithms for clustering, classfication and collaborative filtering. You can also find:

  1. Matrix factorization based recommenders

  2. K-Means, Fuzzy K-Means clustering

  3. Latent Dirichlet Allocation

  4. Singular Value Decomposition

  5. Logistic regression classifier

  6. (Complementary) Naive Bayes classifier

  7. Random forest classifier

I went to UC Berkeley and they offer many awesome courses there

I wish I had more time. I would seriously consider taking this free MIT online class, which you can find here:

  1. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/index.htm

Azure is democratizing machine learning

Historically, machine learning, has required complex software and high-end computers. This field of computing required season data scientist. What's been needed is a fully managed cloud service for this form of machine learning, also known as predictiveanalytics.

Welcome To ML Studio

Using simple drag-and-drop gestures along with some data flow graphs you are able to set up some experiments and take advantage of sophisticated algorithms about writing code.

Data Scientists Code in R

R is a popular open source programming environment for statistics and data mining. The good news is that it is easily integrated into ML Studio. I have a lot of friends using functional languages for machine learning, such as F#. It's pretty clear, however, that R is dominant in this space.

Polls and surveys of data miners are showing R's popularity has increased substantially in recent years. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors. R is a GNU project and is written primarily in C, Fortran.

DATA ANALYTICS

Below is a framework that provides a way for you to think about the predictive nature of machine learning. It's all about providing insight to business decisions where limited resources are applied to grow revenue or limit expenses. This might include insights into consumer spending patterns, or to optimizing supply chain.

How to think about the analytics spectrum

One great way to think about machine learning is to break down analytics into 3 questions:

  1. What happened?

    • Historical
  2. What will happen?

    • Predictive
  3. What should I do next?

    • Prescriptive

How to think of the personas doing analytics

  1. The information worker

    • Typically using a self-service approach using Power BI.

      • Power BI for Office 365 is a self-service business intelligence (BI) solution delivered through Excel and Office 365 that provides information workers with data analysis and visualization capabilities to identify deeper business insights about their data
  2. IT professionals

    • Involved in data transformation, data warehousing, creating data merchant cubes for analytics, and data modeling
    • Work for GM's are directors
  3. Data scientists

    • Deeply technical and skilled not just with code, but with mathematics, statistics, and probability
    • Can use a variety of techniques to apply probability to predictions (ie, there is a 42% chance that prices will go up in the next 18 hours)
    • Like Monte Carlo simulations, parameterizing the model
    • What to look for in a data scientist

      • Domain Knowledge
      • Clear Understanding Of The Scientific Method

        • Objectivity, Hypothesis, Validation, Transparency
      • Strong in Math and Statistics

      • Intellectual Curiosity and Critical Thinking

      • Visualization and Communication

      • Advanced Computing And Data Management

Academic backgrounds

If you were to go to school, went to study to be a data scientist, what courses would you take?

  1. Applied Mathematics

  2. Computer Science

  3. Econometrics

  4. Statistics

  5. Engineering

Industries that really benefit from that of science

  1. Financial Services

  2. Telecommunications

  3. Information Technology

  4. Manufacturing

  5. Utilities

  6. Healthcare

  7. Marketing

Wrapping up

This post provided a high-level view of some of the characteristics and concepts with respect to machine learning. In the next post will start playing around with the Azure portal.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Bruno Terkaly, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}