# Machine Learning: The Bigger Picture, Part II

# Machine Learning: The Bigger Picture, Part II

### The second part of Tamis van der Laan's article from DZone's Guide to Big Data Processing, Volume III, available now!

Join the DZone community and get the full member experience.

Join For FreeThe open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

## Overfitting

So far we have assumed we only have a machine learning model, a training set of samples, and a optimization algorithm to learn from these examples. The next thing we will talk about is the problem of overfitting. If we take our example of a discriminative classifier, we see that it splits the space into two distinct regions for each class. We can also consider a classifier that splits space into multiple regions for classification. Given our example, we get the result in the figure below:

We see that the model now splits space into multiple regions and manages to classify more samples in our training set correctly. Thus, it now labels more tomatoes as tomatoes and kiwis as kiwis. You would think that we have improved our model, but what we have done is the exact opposite. Remember that some tomatoes are more green, while some kiwis are more red. This means that the two classes overlap based on our single measurement, and are indistinguishable from each other.

## Testing and Validation

To show the results of overfitting, we take into account another set of samples. We now have one set of 100 samples, called the training set, used to learn a machine learning model, and another set, called a validation set of 100 samples used for evaluating the model. If we apply our model to the evaluation set we get the figure below:

We see that our model performs much worse on our validation set then our training set. This is because we only use one measurement. This example shows us it is always important to use a validation set to check if the model is not overfitting our training data. More complex evaluation techniques are available such as n fold cross validation which can give a more accurate indication as to how badly your model is overfitting the data.

## Multidimensional Inputs

So far we have only considered using one measurement to classify our fruits. We have also seen that because we only use one measurement the two classes tend to overlap. One could therefore think it is a good idea to add a second measurement such that we can better separate between our fruit classes. This indeed is the case, and increasing the dimensionality (adding more measurements) also increases the accuracy we can attain. To show this, we are going to add an additional measurement. We are going to have a sensor which measures how soft or hard the surface of a object is. The idea is that because apples have a hard surface and kiwis have a soft surface we can get better separation between the two classes of fruit. We will compare the discriminative and generative classifiers for 1 dimensional and 2 dimensional input measurements:

In the plot above, we see that moving from 1 to 2 dimensions does increase the accuracy of our classifier. Obviously not all measurements are going to increase the performance of our algorithm, in fact poorly chosen measurements can in fact degrade the performance of our algorithm. Consider, for example, measuring the temperature of the environment or fluctuations in air movement. These measurements have nothing to do with our goal of classifying fruit, and will only distort our results.

## The Curse of Dimensionality

There is one catch to increasing the number of input measurements which has been named the curse of dimensionality. The curse of dimensionality is not difficult to understand, but it can be deceptive if you don't know about it. The problem is well illustrated by considering the volume of a hypercube. A hypercube is the generalized term for a cube in n dimensions. For example, a hypercube in 2 dimensions would simply be a square. The volume of a 2 dimensional hypercube would be its, area which is r^{2}, where r is the length of the sides of the hypercube. Now a hypercube in 3 dimensions would be a cube with area r^{3}, but we could also consider a hypercube with 4 dimensions, which would have a area of r^{4}. Now we can also decrease the number of dimensions. A 1 dimensional hypercube is simply a line and has area (or length) r, while a 0 dimensional hypercube is a point and has an area of 0. We see that the area of a generalized hypercube equals r^{n}, where n is the number of dimensions.

Thus, if we consider a hypercube of r=2, we increase the number of dimensions from 0 to 10, and look at the area, we see that it increases exponentially.

Now consider the case where we have 5 samples for our training set, and our input measurements all sit between 0 and 2. If we have 1 input measurement, we essentially have a input space equal to a 1 dimensional hypercube with side r=2. The 1 dimensional hypercube has a area of 2, which can be covered by 5 training samples. Now consider an input space with 3 dimensions (n=3), then we need to cover an area of r^{n}=2^{3}=8, which cannot be properly covered by our 5 training samples. In fact, we need about 5^{3}=125 samples to properly cover the input space. This is the curse of dimensionality, the more input measurements or dimensions you have, the higher accuracy you can achieve, but because the volume grows exponentially, you will also require exponentially more samples to properly cover the input space.

## Dimensionality Reduction

We have just seen that as the dimensionality of the input grows larger we require exponentially more samples to properly cover the input space. In the real world, however, samples often lay on what is called a multidimensional manifold. Although the word multidimensional manifold sounds complex, the fundamental idea is quite simple. To illustrate it, we consider a 2 dimensional input space with a one dimensional manifold.

We see that we have two input dimensions but the data essentially sits on a 1 dimensional line (the manifold). What we can do is project the data onto this line to get a single value for each sample. Hence we have reduced the space from 2 dimensions to just 1 dimension by exploiting our knowledge of the manifold.

## Feature Engineering

So far, we have been talking about the measurements of object properties when it comes to generating input for our classifiers. These measurements are actually called features in the machine learning literature. Features can also be constructed from multiple measurements. For example, in addition to our color measurement, we might also want to measure the shape of the object. We do this by using a camera that takes a picture of the object. Using several image processing techniques, we extract the shape of the object, which we use as an additional object.

This process is called feature engineering, and it plays an important role in building good classification systems. Note that in essence, feature engineering is not needed. Given enough data, a complex machine learning model should be able to learn the features directly from data. These machine learning systems exist, and the most popular and effective are called deep learning systems. Because only raw input data is provided to such systems and they have to learn the features directly from examples, they require massive training sets to work. One thing to note is that there is a limit to feature engineering. Feature engineering is in essence a way of transferring knowledge about the domain directly from a human to a machine. However, humans are limited in their ability to write down all the subtleties of the domain in the form of features. Therefore, learning from raw data often produces more accurate results given enough data.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}