Machine Learning: The Bigger Picture, Part I
The first part of Tamis van der Laan's article on Machine Learning, featured in DZone's Guide to Big Data Processing, available now!
Join the DZone community and get the full member experience.Join For Free
In the past few decades, computer systems have achieved a whole lot. They have managed to organize and catalog the information produced by our civilization as a whole. They have relieved us from stringent cognitive tasks and increased our productivity significantly. One could say that where the industrial revolution automated labor, the digital revolution has automated cognitive labor. This statement isn’t entirely correct however, if it was we would all be without a job. So what can humans do that machines can’t?
The classic computer system can be seen as a big information switching network that directs information to where it needs to go. A good example of such a system is SAGE, one of the very first and largest computers ever built. The SAGE computer system was built during the cold war for the sole task of integrating all information from radar systems across the United States for defense purposes. The problem was the tracking and identification of airplanes flying over the United States. As nuclear weapons could only be deployed by means of aircraft, there was a need to monitor the entire US airspace for Russian nuclear bombers. The SAGE system integrated the information coming from all radar installations and routed this information to operators that would classify airplanes as hostile or friendly. This information was then passed onto other operators and eventually to the command center, where the decision was made to intercept enemy aircraft using fighters or launch ground to air missile systems. One could argue that SAGE was one of the first big data processing systems ever built. The following video link demonstrates the operations of the SAGE system:
Although impressive, we cannot say the SAGE system was very intelligent. The system would simply route information to operators in an orderly fashion. Because the information was routed automatically to the appropriate workstation labor could be divided between operators and each operators could specialize in there own specific task. Because of this, decisions could be made much faster and enemy airplanes could be intercepted in a timely manner. The system, however, was not intelligent, as it heavily relied on the cognitive abilities of the operators to make complex decisions. It did, however, provide an information routing infrastructure that radically increased the speed of operations. Due to an increase in computational power, computers can now make complex decisions on there own. One thing that still defines them however is that they need to be programmed by hand as to which decision they should make in which situation. One could naively think that with enough programmers and time we can let computers take over all our cognitive tasks. This, however, is not the case, there are applications that simply do not lend themselves to the classic way of programming. A good example of such a problem is face recognition. It turns out that it is very difficult to write a algorithm by hand that recognizes faces.
The problem here is that if you ask someone how they recognizes a face they will simply tell you they don’t know, they just do.
The second problem is that what defines a face is very difficult to describe. One could for example say a face has an oval shape, but describing that shape exactly and all the variance between the shape of peoples faces is extremely difficult, let alone all the shape differences caused by the angle at which a face can be photographed.
The third problem is that things tend to change over time. For example, beards have become more popular in the recent years, which could throw off the system.
Machine learning is a eld of computer science that concerns itself with the creation of algorithms that can learn from data and allows us to solve problems which cannot be programmed directly by hand such as face recognition. The fundamental idea is, instead of writing an algorithm that recognizes faces directly we write an algorithm that learns to recognize faces indirectly by example. The algorithm takes these examples and learns what is called a model which quanti es what constitutes a face or not. Because the system learns based off examples we can feed in samples in a continuous manner such that the algorithm can update its internal model continuously. This ensures we are always able to recognize faces even with changing trends in facial hair. The de nition of machine learning is very broad, i.e. algorithms that learn from data and hence is often applied in context. Some examples include computer vision, speech recognition and natural language processing. Machine learning is often applied in conjunction with big data systems. For example for analyzing large volumes of text documents in order to extract there topic. In London a system is operational that tracks the movement of people using camera’s all across the city. In the US a system is being developed which uses a single camera attached to a drone to observe a whole city ad hoc (youtu.be/QGxNyaXfJsA).
Machine Learning By Example
In order to get an idea of how machine learning works we are going to use a fairly simple example. We are going to build an industrial sorting machine that needs to differentiate between tomatoes and kiwis. The industrial machine uses a special measuring device that uses a laser to measure the color of the object on the assembly line from red to green. Using this information the sorting machine has to decide to put object into the tomato or kiwi bin. A machine learning algorithm that allows us to differentiate between different classes is called a classi er.
The Discriminative Classifier
To start of our machine learning journey we need samples of tomatoes and kiwis: 50 different tomatoes and 50 different kiwis. We then use our measuring device to measure all our examples. This results into 100 oat measurements that indicates the red to green color of the fruit, besides the measurements we also record in which class the fruit falls, namely the tomato or kiwi class. We will label the tomato class with a value of -1 and the kiwi class with a label of +1. We thus have two sets of 50 measurements and 50 labels. Let’s plot our measurements on the x axis and the labels on the y axis.
We see that the tomatoes and kiwis tend to lie in their own space on the horizontal axis. Even though there is some overlap between the two classes. This is to be expected as there exist for example unripe tomatoes that have a greenish color. This also tells us that only measuring color is probably not enough information to properly separate all kiwis and tomatoes, but for demonstration purposes I have chosen to keep things simple.
If we take a measurement of a unknown fruit we can now classify it as either a tomato or kiwi by simply looking in which region the fruit is located. What we have constructed is called discriminative classifier. The discriminative part refers to the fact that it splits space into regions corresponding to classes. This type of classi er has been very successful and is used all over in many applications. The next step in the process is to build a model that allows us to separate between the two fruits. We do this by setting up a function that is -1 for the tomato region and +1 for the kiwi region and consequently divides the space into two class regions.
The boundary between the two class regions is called the classi cation boundary. The goal is to divide the space in such a way that as much fruit is labeled as correctly as possible by our model. This process is called learning and we will talk about this process later on in the article. Given our model we can now in essence throw away our samples and simply apply the model to classify new objects as either a tomato or kiwi. We do this by measuring the color of the object and deciding in which class region this measurement falls.
In the example above we see that our measurement falls into the -1 region and we thus label the object as tomato. This approach is ne but as one caveat associated with it. Our model does not retain any information about the density or frequency of our data with respect to our measurement. To explain what this means consider the following scenario, what if our new sample lies very far away from all our training examples.
The question now is, are we looking at an exceptionally red tomato, or has something gone wrong, for example a piece of brightly colored red plastic has found its way into the system? This is a question our discriminative classifier can’t answer.
The Generative Classifier
In order to deal with the limitation stated above we introduce the generative classifier. Our generative classifier will make use of two normal distributions in order model our two fruit classes. Each normal distribution will represent the probability of the class given the color measurement. That is each normal distribution models the likelihood of a measurement being part of its respective class.
In order to discriminate between tomatoes and kiwis we now look at which of the two probability density functions is larger in order to classify a new measurement. In fact we can again split the space into two regions as before and throw away the normal distributions all together if we wanted to.
This is something we will not do, because the normal distributions include extra information we can use. Namely the normal distributions can tell us if a sample is likely to belong to one of the classes or not.
Consider the measurement from gure 4 which falls far away from the samples in our training set. We see that the measurement is not only improbable according to our training samples but also according to our model, i.e. the normal distributions. Hence we could detect this with our model and stop the assembly line for inspection by an operator. You might wonder where the word generative comes from. The word generative comes from the fact that we can draw samples according to our normal distributions and in effect generate measurements.
Learning and Optimization
So far we have seen that there are two classes of machinelearning algorithms, discriminative and generative. We assumethat we have a set of samples (actually called the training set)with which we can construct our model. Next we will take a look at how we actually construct a model using the training data, theprocess of which is called learning.
Let’s start with the generative classifier, we are looking to fit twonormal distributions to our data. The two normal distributionsare quantified using their mean μ and standard deviation σ. Soif we find those parameters we have constructed our modeland fitted the normal distribution onto their respective classdata. To compute the means we take the sample average of ourmeasurements in each class. To compute the standard deviationwe take the square root of the sample variance in the class data.The sample mean and deviation can be computed using thefollowing two equations:
Next we move onto the discriminative classifier. This can beillustrated with our discriminative classifier. Our discriminativeclassifier uses the following function as its model:
f(x) = sign[a · x + b]
We have measurement x, constants a and b that determinewhere the measurement will be separated into the −1 and +1regions. The goal is to find values for a and b such that as manysamples are classified correctly.
Finding the parameters is not always analytically tractable, thisis also the case for our discriminative 4 classifier, in this case we require some kind of optimization method which finds theparameters for us iteratively. We use the following optimization algorithm to find the values of a and b:
Algorithm 1: Optimization Algorithm
Start with a random guess for parameters a and b.
Update variables for all samples with measurement x and label y:
If maximum iterations not met go back to step 2.
In this algorithm f(x) is the result of applying our model tosample measurement x. This produces a −1 or +1 classification.We then compare this result to the original label y associatedwith the sample. We then update the parameters based on thiscomparison, i.e. we move the decision boundary in such a waythe sample is more likely to be updated. The size of the update isdefined by α, which is called the learning rate.
Supervised and Unsupervised Learning
Now that we know the difference between a generative anddiscriminative classifier as well as how to learn the parametersof such a classifier we now move on to unsupervised learning.You may not have realized it but in the example above we haveapplied supervised learning. We had a training set consisting ofour measurements (fruit color) and labels (tomato or kiwi). Butwhat if we don’t know the label of our examples? In other words, we have measurements of fruit we know are either of the type tomato or kiwi but we don’t know which.
This is the problem of unsupervised learning, and is consideredmore complex then supervised learning. The way to tacklethis problem is to refer back to our generative method ofclassification. Using our generative method we fitted two gaussians onto our class data. This time however we have noclass data and need to fit our gaussians to our data without knowledge of our classes. The way we can do this is by trying to fit the two gaussians such that they explain our data as goodas possible. We want to maximize the likelihood that our data was generated by our two normal distributions. The way to do this is by using a fairly complex optimization algorithm called expectation maximization, which is an iterative procedure for finding the maximum a posteriori estimates of the parameters.
This algorithm lies beyond the scope of this article but many resources on the subject exist. After applying expectation maximization optimization algorithm we get the values of the means and standard deviations producing the following result:
Now we can classify points by comparing the likelihoods as before.
Parametric and Non-Parametric Algorithms
Next we are going to look at the difference between parametricand non-parametric classifiers. The term parametric algorithmrefers to the parameters your model uses to define your model.Using a parametric algorithm we use the training data to extractvalues for our parameters that completely define our model. The two models we have looked at so far both use parameters. The discriminative model uses two parameters a and b, while thegenerative model uses four parameters two means μ1, μ2 and two standard deviations σ1, σ2. After finding these parameters we no longer need the training set and can literally throw it away.
Non-parametric algorithms don’t use parameters to specify their model but instead use the training data directly. This often results in more computational overhead and memory usage. But these non parametric classifiers are often more accurate and easy to set up on small training sets compared to parametric classifiers.
A good example of a non-parametric algorithm is the k nearest neighbor classifier. Given a new measurement we look up the k nearest neighbors and assign the most frequent class label to the measurement. When we take the nearest 5 neighbors of our measurement and look at our labels we get -1,-1,-1,+1,-1. We see that -1 is the most frequent label and hence we correctly assign our measurement to the -1, i.e. tomato class.
Opinions expressed by DZone contributors are their own.