Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Machine Learning: The Bigger Picture, Part II

DZone's Guide to

Machine Learning: The Bigger Picture, Part II

The second part of Tamis van der Laan's article from DZone's Guide to Big Data Processing, Volume III, available now!

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

This is the second part of Tamis van der Laan's article featured in the new DZone Guide to Big Data Processing, Volume III. Get your free copy for more insightful articles, industry statistics, and more. 

Overfitting

So far we have assumed we only have a machine learning model, a training set of samples, and a optimization algorithm to learn from these examples. The next thing we will talk about is the problem of overfitting. If we take our example of a discriminative classifier, we see that it splits the space into two distinct regions for each class. We can also consider a classifier that splits space into multiple regions for classification. Given our example, we get the result in the figure below:

Image title

We see that the model now splits space into multiple regions and manages to classify more samples in our training set correctly. Thus, it now labels more tomatoes as tomatoes and kiwis as kiwis. You would think that we have improved our model, but what we have done is the exact opposite. Remember that some tomatoes are more green, while some kiwis are more red. This means that the two classes overlap based on our single measurement, and are indistinguishable from each other.

Testing and Validation

To show the results of overfitting, we take into account another set of samples. We now have one set of 100 samples, called the training set, used to learn a machine learning model, and another set, called a validation set of 100 samples used for evaluating the model. If we apply our model to the evaluation set we get the figure below:

Image title

We see that our model performs much worse on our validation set then our training set. This is because we only use one measurement. This example shows us it is always important to use a validation set to check if the model is not overfitting our training data. More complex evaluation techniques are available such as n fold cross validation which can give a more accurate indication as to how badly your model is overfitting the data.

Multidimensional Inputs

So far we have only considered using one measurement to classify our fruits. We have also seen that because we only use one measurement the two classes tend to overlap. One could therefore think it is a good idea to add a second measurement such that we can better separate between our fruit classes. This indeed is the case, and increasing the dimensionality (adding more measurements) also increases the accuracy we can attain. To show this, we are going to add an additional measurement. We are going to have a sensor which measures how soft or hard the surface of a object is. The idea is that because apples have a hard surface and kiwis have a soft surface we can get better separation between the two classes of fruit. We will compare the discriminative and generative classifiers for 1 dimensional and 2 dimensional input measurements:

Image title

In the plot above, we see that moving from 1 to 2 dimensions does increase the accuracy of our classifier. Obviously not all measurements are going to increase the performance of our algorithm, in fact poorly chosen measurements can in fact degrade the performance of our algorithm. Consider, for example, measuring the temperature of the environment or fluctuations in air movement. These measurements have nothing to do with our goal of classifying fruit, and will only distort our results.

The Curse of Dimensionality

There is one catch to increasing the number of input measurements which has been named the curse of dimensionality. The curse of dimensionality is not difficult to understand, but it can be deceptive if you don't know about it. The problem is well illustrated by considering the volume of a hypercube. A hypercube is the generalized term for a cube in n dimensions. For example, a hypercube in 2 dimensions would simply be a square. The volume of a 2 dimensional hypercube would be its, area which is r^{2}, where r is the length of the sides of the hypercube. Now a hypercube in 3 dimensions would be a cube with area r^{3}, but we could also consider a hypercube with 4 dimensions, which would have a area of r^{4}. Now we can also decrease the number of dimensions. A 1 dimensional hypercube is simply a line and has area (or length) r, while a 0 dimensional hypercube is a point and has an area of 0. We see that the area of a generalized hypercube equals r^{n}, where n is the number of dimensions.

Thus, if we consider a hypercube of r=2, we increase the number of dimensions from 0 to 10, and look at the area, we see that it increases exponentially.

Image title

Now consider the case where we have 5 samples for our training set, and our input measurements all sit between 0 and 2. If we have 1 input measurement, we essentially have a input space equal to a 1 dimensional hypercube with side r=2. The 1 dimensional hypercube has a area of 2, which can be covered by 5 training samples. Now consider an input space with 3 dimensions (n=3), then we need to cover an area of r^{n}=2^{3}=8, which cannot be properly covered by our 5 training samples. In fact, we need about 5^{3}=125 samples to properly cover the input space. This is the curse of dimensionality, the more input measurements or dimensions you have, the higher accuracy you can achieve, but because the volume grows exponentially, you will also require exponentially more samples to properly cover the input space.

Image title

Dimensionality Reduction

We have just seen that as the dimensionality of the input grows larger we require exponentially more samples to properly cover the input space. In the real world, however, samples often lay on what is called a multidimensional manifold. Although the word multidimensional manifold sounds complex, the fundamental idea is quite simple. To illustrate it, we consider a 2 dimensional input space with a one dimensional manifold.

Image title

We see that we have two input dimensions but the data essentially sits on a 1 dimensional line (the manifold). What we can do is project the data onto this line to get a single value for each sample. Hence we have reduced the space from 2 dimensions to just 1 dimension by exploiting our knowledge of the manifold. 

Feature Engineering

So far, we have been talking about the measurements of object properties when it comes to generating input for our classifiers. These measurements are actually called features in the machine learning literature. Features can also be constructed from multiple measurements. For example, in addition to our color measurement, we might also want to measure the shape of the object. We do this by using a camera that takes a picture of the object. Using several image processing techniques, we extract the shape of the object, which we use as an additional object.

This process is called feature engineering, and it plays an important role in building good classification systems. Note that in essence, feature engineering is not needed. Given enough data, a complex machine learning model should be able to learn the features directly from data. These machine learning systems exist, and the most popular and effective are called deep learning systems. Because only raw input data is provided to such systems and they have to learn the features directly from examples, they require massive training sets to work. One thing to note is that there is a limit to feature engineering. Feature engineering is in essence a way of transferring knowledge about the domain directly from a human to a machine. However, humans are limited in their ability to write down all the subtleties of the domain in the form of features. Therefore, learning from raw data often produces more accurate results given enough data.

For more insights on machine learning, neural nets, data health, and more get your free copy of the new DZone Guide to Big Data Processing, Volume III!

Tamis works at Stackstate, which provides a real-time IT operations monitoring tool for DevOps teams. For more information, visit stackstate.com. 

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
machine learning

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}