Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Classification From Scratch, Overview

DZone's Guide to

Classification From Scratch, Overview

The goal behind this series of posts is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Before my course on "big data and economics" at the University of Barcelona in July, I wanted to upload a series of posts on classification techniques, to get an insight on machine learning tools.

According to a common idea, machine learning algorithms are black boxes. I wanted to get back on that saying. First of all, isn't that also the case for regression models, like generalized additive models (with splines)? Do you really know what the algorithm is doing? Even the logistic regression. In textbooks, we can easily find math formulas. But what is really done when I run it in R?

When I started working in academia, someone told me something like, "if you really want to understand a theory, teach it." And that has been my motto for more than 15 years. I wanted to add a second part to that statement: "if you really want to understand an algorithm, recode it." So let's try this... My ambition is to recode (more or less) most of the standard algorithms used in predictive modeling, from scratch, in R. What I plan to mention, within the next two weeks, will be:

  • the logistic regression.
  • the logistic regression with splines.
  • the logistic regression with kernels (and knn).
  • the penalized logistic regression.
  • the heuristics of neural networks.
  • an introduction to SVM.
  • classification trees.
  • bagging and random forests.
  • gradient boosting (and adaboost).

I will use two datasets as an illustration. The first one is inspired by the cover of "Foundations of Machine Learning" by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. At least, with this dataset, it will be possible to plot predictions (since there are only two - continuous - features).

Here is some code to get a visualization of the prediction (here, the probability to be a black point).

Note that colors are defined here as:

clr10= c("#ffffff","#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b")

Or with some nonlinear model:

The second one is a dataset I got from Gilbert Saporta, about heart attacks and decease (our binary variable).

myocarde = read.table("http://freakonometrics.free.fr/myocarde.csv",head=TRUE, sep=";")
myocarde$PRONO = (myocarde$PRONO=="SURVIE")*1
y = myocarde$PRONO
X = as.matrix(cbind(1,myocarde[,1:7]))

So far, I do not plan to talk (too much) on the choice of tuning parameters (and cross-validation), on comparing models, etc. The goal here is simply to understand what's going on when we call either glm, glmnet, gam, random forest, svm, xgboost, or any function to get a prediction model.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,r ,predictive analytics ,regression models

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}