# Classification From Scratch, Part 6: Neural Nets

# Classification From Scratch, Part 6: Neural Nets

### A look into the implications of neural networks on big data, and how to represent the necessary algorithms in R. Let's get started!

Join the DZone community and get the full member experience.

Join For Free**Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now. **

This is the sixth post of our series on classification from scratch. The latest one was on the lasso regression, which was still based on a logistic regression model, assuming that the variable of interest ** Y** has a Bernoulli distribution. From now on, we will discuss a technique that did not originate from those probabilistic models, even if they might still have a probabilistic interpretation. Somehow. Today, we will start with neural nets.

Maybe I should start with a disclaimer. The goal is not to replicate well-designed R functions, used for predictive modeling. It is simply to get a basic understanding of what's going on.

## Networks, Nodes, and Edges

First of all, neural nets are nets or networks. I will skip the parallel with "neural" stuff because it does not help me understand what is happening (all apologies for my poor knowledge of biology, and cells)

So, it's about some network. Networks have nodes, and edges (possibly connected) that connect nodes.

Or maybe, to more specific (at least it helped me understand what's going on), some sort of flow network.

In such a network, we usually have sources (here multiple) sources (here **{s1, s2, s3}**), on the left, on a sink (here **{t}**), on the right. To continue with this metaphorical introduction, information from the sources should reach the sink. And usually, sources are explanatory variables, {x1 ,...,xp}, and the sink is our variable of interest ** y**. And we want to create a graph, from the sources to the sink. We will have directed edges, with only one (unique) direction, where we will put weights. It is not a flow, the parallel with the flow will stop here. For instance, the most simple network will be the following one, with no layer (i.e., no node between the source and the sink):

The output here is a binary variable:

It can also be

But here, it's not a big deal. In our network, our output will be

because it is easier to handle. For instance, consider y=*f*(something), for some function *f* taking values in (0,1). One can consider the sigmoid function

This is actually the logistic function (so we should not be surprised to have results somehow close the logistic regression...). This function *f* is called the activation function, and there are thousands of such functions. If

people consider the hyperbolic tangent

or the inverse tangent function

And as input for such function, we consider a weighted sum of incoming nodes. So here

We can also add a constant actually

So far, we are not far away from the logistic regression. Except that our starting point was a probabilistic model, in the sense that the later was interpreted as a probability (the probability that Y=1) and we wanted the model with the highest likelihood. But we'll talk about the selection of weights later one. First, let us construct our first (very simple) neural network. First, we have the sigmoid function:

`sigmoid = function(x) 1 / (1 + exp(-x))`

Then consider some weights. In our model with seven explanatory variables, with need 7 weights. Or 8 if we include the constant term. Let us consider *w*=1

```
weights_0 = rep(1,8)
X = as.matrix(cbind(1,myocarde[,1:7]))
y_5_1 = sigmoid(X %*% weights_0)
```

then use

`y_5_1 = sigmoid(X %*% weights_0)`

In order to see if we get a "good" prediction, let's plot the ROC curve, and compare it with the one we got with a (simple) logistic regression:

```
library(ROCR)
pred = ROCR::prediction(y_5_1,myocarde$PRONO)
perf = ROCR::performance(pred,"tpr", "fpr")
plot(perf,col="blue",lwd=2)
reg = glm(PRONO~.,data=myocarde,family=binomial(link = "logit"))
y_0 = predict(reg,type="response")
pred0 = ROCR::prediction(y_0,myocarde$PRONO)
perf0 = ROCR::performance(pred0,"tpr", "fpr")
plot(perf0,add=TRUE,col="red")
```

That's not bad for a very first attempt. Except that we've been cheating here, since we did use

How, for real, should we choose those weights?

## Using a Loss Function

Well, if we want an "optimal" set of weights, we need to "optimize" an objective function. So we need to quantify the loss of a mistake, between the prediction, and the observation. Consider here a quadratic loss function:

```
loss = function(weights){
mean( (myocarde$PRONO-sigmoid(X %*% weights))^2) }
```

It might be stupid to use a quadratic loss function for a classification, but here, it's not the point. We just want to understand what the algorithm we use is, and the loss function *l* is just one parameter. Then we want to solve

Thus, consider

`weights_1 = optim(weights_0,loss)$par`

(where the starting point is the OLS estimate). Again, to see what's going on, let us visualize the ROC curve

```
y_5_2 = sigmoid(X %*% weights_1)
pred = ROCR::prediction(y_5_2,myocarde$PRONO)
perf = ROCR::performance(pred,"tpr", "fpr")
plot(perf,col="blue",lwd=2)
plot(perf0,add=TRUE,col="red")
```

That's not amazing, but again, that's only a first step.

## A Single Layer

Let us add a single layer in our network.

Those nodes are connected to the sources (incoming from sources) from the left and then connected to the sink, on the right. Those nodes are not inter-connected. And again, for that network, we need edges (i.e., series of weights). For instance, on the network above, we did add one single layer, with (only) three nodes.

For such a network, the prediction formula is

or more synthetically

Usually, we consider the same activation function everywhere. Don't ask me why I find that weird.

Now, we have a lot of weights to choose. Let us use again OLS estimates:

```
weights_1 <- lm(PRONO~1+FRCAR+INCAR+INSYS+PAPUL+PVENT,data=myocarde)$coefficients
X1 = as.matrix(cbind(1,myocarde[,c("FRCAR","INCAR","INSYS","PAPUL","PVENT")]))
weights_2 <- lm(PRONO~1+INSYS+PRDIA,data=myocarde)$coefficients
X2=as.matrix(cbind(1,myocarde[,c("INSYS","PRDIA")]))
weights_3 <- lm(PRONO~1+PAPUL+PVENT+REPUL,data=myocarde)$coefficients
X3=as.matrix(cbind(1,myocarde[,c("PAPUL","PVENT","REPUL")]))
```

In that case, we did specify edges, and which sources (explanatory variables) should be used for each additional node. Actually, here, other techniques could be have been used, like using a PCA. Each node will then be one of the components. But we'll use that idea later one...

`X = cbind(sigmoid(X1 %*% weights_1), sigmoid(X2 %*% weights_2), sigmoid(X3 %*% weights_3))`

But we're not done here. Those were weights from the source to the know nodes, in the layer. We still need the weights from the nodes to the sink. Here, let us use a simple average

```
weights = c(1/3,1/3,1/3)
y_5_3 <- sigmoid(X %*% weights)
```

Again, we can plot the ROC curve to see what we've done...

```
pred = ROCR::prediction(y_5_3,myocarde$PRONO)
perf = ROCR::performance(pred,"tpr", "fpr")
plot(perf,col="blue",lwd=2)
plot(perf0,add=TRUE,col="red")
```

## On Back Propagation

Now, we need some optimal selection of those weights. Observe that with only 3 nodes, there are already (7 + 1) x 3 + 3 = 27 parameters in that model! Clearly, parcimony is not the major issue when you start using neural nets! If

we want to solve

for some loss function, which is

for the quadratic norm, or

if we want to use cross-entropy.

For convenience, let us center all the variable we create, otherwise, we get numerical problems.

```
center = function(z) (z-mean(z))/sd(z)
loss = function(weights){
weights_1 = weights[0+(1:7)]
weights_2 = weights[7+(1:7)]
weights_3 = weights[14+(1:7)]
weights_ = weights[21+1:4]
X1=X2=X3=as.matrix(myocarde[,1:7])
Z1 = center(X1 %*% weights_1)
Z2 = center(X2 %*% weights_2)
Z3 = center(X3 %*% weights_3)
X = cbind(1,sigmoid(Z1), sigmoid(Z2), sigmoid(Z3))
mean( (myocarde$PRONO-sigmoid(X %*% weights_))^2)}
```

Now that we have our objective function, consider some starting points. We can consider weights from a PCA, and then use a gradient descent algorithm,

```
pca = princomp(myocarde[,1:7])
W = get_pca_var(pca)$contrib
weights_0 = c(W[,1],W[,2],W[,3],c(-1,rep(1,3)/3))
weights_opt = optim(weights_0,loss)$par
```

The prediction is then obtained using

```
weights_1 = weights_opt[0+(1:7)]
weights_2 = weights_opt[7+(1:7)]
weights_3 = weights_opt[14+(1:7)]
weights_ = weights_opt[21+1:4]
X1=X2=X3=as.matrix(myocarde[,1:7])
Z1 = center(X1 %*% weights_1)
Z2 = center(X2 %*% weights_2)
Z3 = center(X3 %*% weights_3)
X = cbind(1,sigmoid(Z1), sigmoid(Z2), sigmoid(Z3))
y_5_4 = sigmoid(X %*% weights_)
```

And, like before, why not plot the ROC curve of that model?

```
pred = ROCR::prediction(y_5_4,myocarde$PRONO)
perf = ROCR::performance(pred,"tpr", "fpr")
plot(perf,col="blue",lwd=2)
plot(perf,add=TRUE,col="red")
```

That's not too bad. But with 27 coefficients, that's what we would expect, no?

## Using the nnet() Function

That's more or less what is done in neural nets functions. Let's now have a look at some dedicated R functions.

```
library(nnet)
myocarde_minmax = myocarde
minmax = function(z) (z-min(z))/(max(z)-min(z))
for(j in 1:7) myocarde_minmax[,j] = minmax(myocarde_minmax[,j])
```

Here, variables are linearly transformed, to take values in (0,1). Then we can construct a neural network with one single layer, and three nodes:

```
model_nnet = nnet(PRONO~.,data=myocarde_minmax,size=3)
summary(model_nnet)
a 7-3-1 network with 28 weights
options were -
b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1
-9.60 -1.79 21.00 14.72 -20.45 -5.05 14.37 -17.37
b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2
4.72 2.83 -3.37 -1.64 1.49 2.12 2.31 4.00
b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3
-0.58 -6.03 25.14 18.03 -1.19 7.52 -19.47 -12.95
b->o h1->o h2->o h3->o
-1.32 29.00 -10.32 26.27
```

Here, it is the complete full network. And actually, there are (online) some functions that can be used to visualize that network:

```
library (devtools)
source_url( 'https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r' )
plot.nnet (model_nnet)
```

Nice, isn't it? We clearly see the intermediary layer, with three nodes, and on top the constants. Edges are the plain lines, the darker, the heavier (in terms of weights).

## Using neuralnet()

Other R functions can actually be considered.

```
library(neuralnet)
model_nnet = neuralnet(formula(glm(PRONO~.,data=myocarde_minmax)),
myocarde_minmax,hidden=3, act.fct = sigmoid)
plot(model_nnet)
```

Again, for the same network structure, with one (hidden) layer, and three nodes in it.

## Network With Multiple Layers

The good thing is that it's not possible to add more layers. Like two layers. Nodes from the first layer are no longer connected with the sink, but with nodes in the second layer. And those nodes will then be connected to the sink. We now have something like

where

I may be rambling here (a little bit) but that's a lot of parameters. Here is the visualization of such a network,

```
library(neuralnet)
model_nnet = neuralnet(formula(glm(PRONO~.,data=myocarde_minmax)),
myocarde_minmax,hidden=3, act.fct = sigmoid)
plot(model_nnet)
```

## Application

Let us get back on our simple dataset, with only two covariates.

```
library(neuralnet)
df_minmax =df
df_minmax$y=(df_minmax$y=="1")*1
minmax = function(z) (z-min(z))/(max(z)-min(z))
for(j in 1:2) df_minmax[,j] = minmax(df[,j])
X = as.matrix(cbind(1,df_minmax[,1:2]))
```

Consider only one layer, with two nodes

```
model_nnet = neuralnet(formula(lm(y~.,data=df_minmax)),
df_minmax,hidden=c(2))
plot(model_nnet)
```

Here, we did not specify it, but the activation function is the sigmoid (actually, it is called logistic here)

```
model_nnet$act.fct
function (x)
{
1/(1 + exp(-x))
}
attr(,"type")
[1] "logistic"
f=model_nnet$act.fct
```

The weights (on the figure) can be obtained using

```
w0 = model_nnet$weights[[1]][[2]][,1]
w1 = model_nnet$weights[[1]][[1]][,1]
w2 = model_nnet$weights[[1]][[1]][,2]
```

Now, to get our prediction, we should use

which can be obtained using

```
f(cbind(1,f(X%*%w1),f(X%*%w2))%*%w0)
[,1]
[1,] 0.7336477343
[2,] 0.7317999050
[3,] 0.7185803540
[4,] 0.7404005280
[5,] 0.7518482779
[6,] 0.4939774149
[7,] 0.4965876378
[8,] 0.7101714888
[9,] 0.5050760026
[10,] 0.5049877644
```

Unfortunately, it is not the output of the model here,

```
neuralnet::prediction(model_nnet)
Data Error:0;
$rep1
x1 x2 y
1 0.1250 0.0000000000 0.02030470787
2 0.0625 0.1176470588 0.89621706711
3 0.9375 0.2352941176 0.01995171956
4 0.0000 0.4705882353 1.10849420363
5 0.5000 0.4705882353 -0.01364966058
6 0.3125 0.5294117647 -0.02409150561
7 0.6875 0.8235294118 0.93743057765
8 0.3750 0.8823529412 1.01320924782
9 1.0000 0.9058823529 1.04805134309
10 0.5625 1.0000000000 1.00377379767
```

If anyone has a clue, I'd be glad to know what went wrong here... I find that odd to have outputs outside the (0,1) interval, but the output is neither

```
cbind(1,f(X%*%w1),f(X%*%w2))%*%w0
[,1]
[1,] 1.01320924782
[2,] 1.00377379767
[3,] 0.93743057765
[4,] 1.04805134309
[5,] 1.10849420363
[6,] -0.02409150561
[7,] -0.01364966058
[8,] 0.89621706711
[9,] 0.02030470787
[10,] 0.01995171956
```

**Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here. **

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}