Breezing Through Support Vector Machines
Breezing Through Support Vector Machines
In this article, we discuss some basics behind how vector algebra works under-the-hood in machine learning models.
Join the DZone community and get the full member experience.Join For Free
When we add "Machine" to anything it looks cool... perhaps due to an assumption made about the introduction of both "intelligence" and "automation" hinted by the use of such a term.
So, let's take that out and we are back to old, classical vector algebra. It's like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known.
The problem is which particular shape and length must be chosen to show maximum contrast between classes.
We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).
In it's simplest form, output Y can be represented as a vector product of two independent vectors K and X respectively, where we have to compute values of vector K, given input vector X, so that it minimizes the error between computed approximation Yhat and Y.
You may also like: R Essentials.
(K)*(X) is the generalized dot product (a.k.a. the "inner product" of input vectors). Doing the dot product of X with transpose of K is equivalent to computing the projection of vector X on vector K. This is the "shadow" cast of one vector on another.
To put it more simply, it is a weighted sum of values of vectors K and X respectively; for each observation value of X, where the corresponding value of K for a given value of X must be computed in order to arrive at Yhat, which most closely resembles the classification sets that exist in Y.
The value that such computation takes, considering that it will be a numeric value, must provide a good enough demarcation in order to separate different classes that exist in Y. Such "demarcation" has to be as wide as possible so as not to interfere with individual class boundaries.
The contour of such demarcation depends on the dimensional space we are operating in.
For a 2-D space, it is simply a plane ( Z1*X+Z2*Y=C) — think about a DMZ zone or a "No man's land" that exists between nations that share borders. And this plane would evaluate to a starkly contrasting value (-C for one side and +C for another, for a symmetrical demarcation to exist between the two class boundaries).
In fact, the equation Yhat<-t(K)*X , which is same as : K1*X1+K2*X2+K3*X3+...KN*XN, for N observations, is indeed this plane that seeks to distinguish and separate the two classes on either side.
A starker contrast in computed value on either side of demarcation, will result in more pronounced errors of misclassification and in turn enable optimization algorithm to minimize such errors by adjusting values of vector K.
As an example, the computed value abruptly switches from -100 to +100, as the model tries to distinguish Class A from Class B.
The optimization objective for the algorithm would be to maximize this abrupt contrast.
Obviously, the planes can distinguish linearly distributed classes. If we can't imagine a line that can separate the two classes without bending or twisting it, then we have look at non-linear solutions to arriving at providing maximum contrast.
The problem is similar to taking a stroll or drive to remember important landmarks so that we can make it back or go there again. In doing so, every location that one visits needs to be compared in similarity with a set of previously identified landmarks.
Hence, we need a similarity function that tells in Binary or Boolean terms as to whether a given observation is very similar to a given landmark.
Similarity functions can be generally expressed as Similarity(L, X).
One commonly used form would be a polynomial expressed as:
Another popular, general function can be Gaussian — one that takes the form:
We are tweaking it by replacing mean(X) by LM[i], where LM is a vector of Landmarks. Then, we introduce constant epsilon (just like we did in polynomial similarity function), which in this case takes a value as (1/(2*sd(x)^2))
Obviously, it can be hard to choose which similarity function would be most suitable and what should be the values of various constants.
In R, the function
best.svm comes in handy (in package e1071).
We will use this on our previous data set german_credit.csv. ( Refer to my previous article.)
best.svm has now automatically selected the similarity function type as Radial. (Similar to the Gaussian function definition, as discussed here).
Opinions expressed by DZone contributors are their own.