Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Classification From Scratch, Part 7 of 8: SVM

DZone's Guide to

Classification From Scratch, Part 7 of 8: SVM

In this post, we continue our discussion of regression models in by looking at Support Vector Machines and how they apply to big data.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

This is the seventh post of our series on classification from scratch. The latest one was on the neural nets, and today, we will discuss SVM, support vector machines.

A Formal Introduction

Here y takes values in {-1,+1}. Our model will be

Image title

Thus, the space is divided by a (linear) border

Image title

The distance from point xi to ∆ is 

Image title

If the space is linearly separable, the problem is ill-posed (there is an infinite number of solutions). So consider



The strategy is to maximize the margin. One can prove that we want to solve

Image title

subject to

Image title

Again, the problem is ill posed (non identifiable), and we can consider m=1:

Image title

subject to

Image title

The optimization objective can be written

Image title

The Primal Problem

In the separable case, consider the following primal problem,

Image title

subject to

Image title

In the non-separable case, introduce slack (error) variables 

Image title

there is no error 

Image title

Let C denote the cost of misclassification. The optimization problem becomes

Image title

Image title

Let's try to code this optimization problem. The dataset is here:

n = length(myocarde[,"PRONO"])
myocarde0 = myocarde
myocarde0$PRONO = myocarde$PRONO*2-1
C = .5

And we have to set a value for the cost, C. In the (linearly) constrained optimization function in R, we need to provide the objective function f(ø) and the gradient: 

Image title

f = function(param){
  w  = param[1:7]
  b  = param[8]
  xi = param[8+1:nrow(myocarde)]
  .5*sum(w^2) + C*sum(xi)}
grad_f = function(param){
  w  = param[1:7]
  b  = param[8]
  xi = param[8+1:nrow(myocarde)]
  c(2*w,0,rep(C,length(xi)))}

and (linear) constraints are written as:

Image title

U = rbind(cbind(myocarde0[,"PRONO"]*as.matrix(myocarde[,1:7]),diag(n),myocarde0[,"PRONO"]),
cbind(matrix(0,n,7),diag(n,n),matrix(0,n,1)))
C = c(rep(1,n),rep(0,n))

Then we use

constrOptim(theta=p_init, f, grad_f, ui = U,ci = C)

Observe that something is missing here: we need a starting point for the algorithm, øo. Unfortunately, I could not think of a simple technique to get a valid starting point (that satisfies those linear constraints).

Let us try something else. Because those functions are quite simple: either linear or quadratic. Actually, one can recognize in the separable case, but also in the non-separable case, a classic quadratic program

Image title


subject to Az ≥ b.

library(quadprog)
eps = 5e-4
y = myocarde[,"PRONO"]*2-1
X = as.matrix(cbind(1,myocarde[,1:7]))
n = length(y)
D = diag(n+7+1)
diag(D)[8+0:n] = 0 
d = matrix(c(rep(0,7),0,rep(C,n)), nrow=n+7+1)
A = Ui
b = Ci
sol = solve.QP(D+eps*diag(n+7+1), d, t(A), b, meq=1, factorized=FALSE)
qpsol = sol$solution
(omega = qpsol[1:7])
[1] -0.106642005446 -0.002026198103 -0.022513312261 -0.018958578746 -0.023105767847 -0.018958578746 -1.080638988521
(b     = qpsol[n+7+1])
[1] 997.6289927

Given an observation x, the prediction is

Image title

y_pred = 2*((as.matrix(myocarde0[,1:7])%*%omega+b)>0)-1

Observe that here, we do have a classifier, depending if the point lies on the left or on the right (above or below, etc.) the separating line (or hyperplane). We do not have a probability, because there is no probabilistic model here. So far.

The Dual Problem

The Lagrangian of the separable problem could be written introducing Lagrange multipliers

Image title

as

Image title

Somehow,  alpha[i] represents the influence of the observation (yi, xi).

Consider the Dual Problem, with 

Image title

and

Image title

Image title

subject to 

Image title

The Lagrangian of the non-separable problem could be written introducing Lagrange multipliers 

Image title

and define the Lagrangian

Image title

as

Image title

Somehow, alpha[i] represents the influence of the observation (yi, xi).

The Dual Problem become with G=[Gij] and Gij=yiyjxj^Txi

Image title

subject to 

Image title

As previously, one can also use quadratic programming

library(quadprog)
eps = 5e-4
y = myocarde[,"PRONO"]*2-1
X = as.matrix(cbind(1,myocarde[,1:7]))
n = length(y)
Q = sapply(1:n, function(i) y[i]*t(X)[,i])
D = t(Q)%*%Q
d = matrix(1, nrow=n)
A = rbind(y,diag(n),-diag(n))
C = .5
b = c(0,rep(0,n),rep(-C,n))
sol = solve.QP(D+eps*diag(n), d, t(A), b, meq=1, factorized=FALSE)
qpsol = sol$solution

The two problems are connected in the sense that for all x

Image title

To recover the solution of the primal problem,

Image title

thus

omega = apply(qpsol*y*X,2,sum)
omega
                           1                        FRCAR                        INCAR                        INSYS 
 0.0000000000000002439074265  0.0550138658687635215271960 -0.0920163239049630876653652  0.3609571899422952534486342 
                       PRDIA                        PAPUL                        PVENT                        REPUL 
-0.1094017965288692356695677 -0.0485213403643276475207813 -0.0660058643191372279579454  0.0010093656567606212794835

whil eb=yω^Tx (but actually, one can add the constant vector in the matrix of explanatory variables).

More generally, consider the following function (to make sure that D is a definite-positive matrix, we use the nearPD function).

svm.fit = function(X, y, C=NULL) {
 n.samples = nrow(X)
 n.features = ncol(X)
 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
 for (i in 1:n.samples){
  for (j in 1:n.samples){
   K[i,j] = X[i,] %*% X[j,] }}
 Dmat = outer(y,y) * K
 Dmat = as.matrix(nearPD(Dmat)$mat) 
 dvec = rep(1, n.samples)
 Amat = rbind(y, diag(n.samples), -1*diag(n.samples))
 bvec = c(0, rep(0, n.samples), rep(-C, n.samples))
 res = solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
 a = res$solution 
 bomega = apply(a*y*X,2,sum)
 return(bomega)
}

On our dataset, we obtain

M = as.matrix(myocarde[,1:7])
center = function(z) (z-mean(z))/sd(z)
for(j in 1:7) M[,j] = center(M[,j])
bomega = svm.fit(cbind(1,M),myocarde$PRONO*2-1,C=.5)
y_pred = 2*((cbind(1,M)%*%bomega)>0)-1
table(obs=myocarde0$PRONO,pred=y_pred)
    pred
obs  -1  1
  -1 27  2
  1   9 33

i.e. 11 misclassifications, out of 71 points (which is also what we got with the logistic regression).

Kernel-Based Approach

In some cases, it might be difficult to "separate" by a linear separators the two sets of points, like below:

It might be difficult, here, because of our want to find a straight line in the two dimensional space (x1,x2). But maybe, we can distort the space, possibly by adding another dimension

That's heuristically the idea. Because on the case above, in dimension 3, the set of points is now linearly separable. And the trick to do so is to use a kernel. The difficult task is to find the good one (if any).

A positive kernel on X is a function K: X x X ->R symmetric, and such that for any 

Image title

Image title

For example, the linear kernel is k(xi,xj)=xi^Txj. That's what we've been using here, so far. One can also define the product kernel 

Image title

Finally, the Gaussian kernel is: 

Image title

Since it is a function of ||xi-xj||, it is also called a radial kernel.

linear.kernel = function(x1, x2) {
 return (x1%*%x2)
}
svm.fit = function(X, y, FUN=linear.kernel, C=NULL) {
 n.samples = nrow(X)
 n.features = ncol(X)
 K = matrix(rep(0, n.samples*n.samples), nrow=n.samples)
 for (i in 1:n.samples){
  for (j in 1:n.samples){
   K[i,j] = FUN(X[i,], X[j,])
  }
 }
 Dmat = outer(y,y) * K
 Dmat = as.matrix(nearPD(Dmat)$mat) 
 dvec = rep(1, n.samples)
 Amat = rbind(y, diag(n.samples), -1*diag(n.samples))
 bvec = c(0, rep(0, n.samples), rep(-C, n.samples))
 res = solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
 a = res$solution 
 bomega = apply(a*y*X,2,sum)
 return(bomega)
}

Link to the Regression

To relate this duality optimization problem to OLS, recall that 

Image title

But one can also write

Image title

Application (on Our Small Dataset)

One can actually use a dedicated R package to run an SVM. To get the linear kernel, use

library(kernlab)
df0 = df
df0$y = 2*(df$y=="1")-1
SVM1 = ksvm(y ~ x1 + x2, data = df0, C=.5, kernel = "vanilladot" , type="C-svc")

Since the dataset is not linearly separable, there will be some mistakes here

table(df0$y,predict(SVM1))

     -1 1
  -1  2 2
  1   1 5

The problem with that function is that it cannot be used to get a prediction for points other than those in the sample (and I could neither extract w nor b from the 24 slots of that object). But it's possible by adding a small option in the function

SVM2 = ksvm(y ~ x1 + x2, data = df0, C=.5, kernel = "vanilladot" , prob.model=TRUE, type="C-svc")

With that function, we convert the distance as some sort of probability. Someday, I will try to replicate the probabilistic version of SVM, I promise, but today, the goal is just to understand what is done when running the SVM algorithm. To visualize the prediction, use

pred_SVM2 = function(x,y){
return(predict(SVM2,newdata=data.frame(x1=x,x2=y), type="probabilities")[,2])}
plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],
     cex=1.5,xlab="",
     ylab="",xlim=c(0,1),ylim=c(0,1))
vu = seq(-.1,1.1,length=251)
vv = outer(vu,vu,function(x,y) pred_SVM2(x,y))
contour(vu,vu,vv,add=TRUE,lwd=2,nlevels = .5,col="red")


Here the cost is C=.5, but of course, we can change it

SVM2 = ksvm(y ~ x1 + x2, data = df0, C=2, kernel = "vanilladot" , prob.model=TRUE, type="C-svc")
pred_SVM2 = function(x,y){
return(predict(SVM2,newdata=data.frame(x1=x,x2=y), type="probabilities")[,2])}
plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],
     cex=1.5,xlab="",
     ylab="",xlim=c(0,1),ylim=c(0,1))
vu = seq(-.1,1.1,length=251)
vv = outer(vu,vu,function(x,y) pred_SVM2(x,y))
contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5,col="red")


As expected, we have a linear separator. But slightly different. Now, let us consider the "Radial Basis Gaussian kernel."

SVM3 = ksvm(y ~ x1 + x2, data = df0, C=2, kernel = "rbfdot" , prob.model=TRUE, type="C-svc")

Observe that here, we've been able to separate the white and the black points.

table(df0$y,predict(SVM3))

     -1 1
  -1  4 0
  1   0 6
plot(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],
     cex=1.5,xlab="",
     ylab="",xlim=c(0,1),ylim=c(0,1))
vu = seq(-.1,1.1,length=251)
vv = outer(vu,vu,function(x,y) pred_SVM3(x,y))
contour(vu,vu,vv,add=TRUE,lwd=2,levels = .5,col="red")


Now, to be completely honest, if I understand the theory of the algorithm used to compute and b with linear kernels (using quadratic programming), I do not feel comfortable with this R function. Especially if you run it several times... you can get (with exactly the same set of parameters)

or

(to be continued...)

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,r ,svm ,support vector machines

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}