This article was excerpted from the book Algorithms of the Intelligent Web, 2ndedition.
In order to understand deep learning, let’s choose the application of image recognition, namely, given a picture or a video, how do we build classifiers that will recognize objects? Such an application has potentially wide-reaching applications. With the advent of the quantified self, Google Glass is starting to take off; and we can imagine applications for this device that recognize objects and provide information to the user.
Let’s take the example of recognizing a car. What deep learning attempts to do is to build up layers of understanding, with each layer building upon the previous one. Figure 1 shows us some of the possible different layers of understanding for a deep network trained to recognize cars. Both this example, and some of the images below have been reproduced from Andrew Ng’s lecture on the subject.
Figure 1: Visualizing a deep network for recognizing cars. Some graphical content taken from Andrew Ngs Talk. A base training set of pictures is used to create a basis of edges. These edges can be compose together to detect parts of cars, and these parts of cars can be combined together to detect and object type which is in this case, a car.
At the bottom of Figure 1 we see a number of stock images of cars. We will consider these our training set. The question is now, how do we use deep learning to recognize the similarities between these images (i.e., that they all contain a car), possibly without any hand labeled ground truth (the algorithm isn’t told that the scene contains a car).
As you will see, deep learning relies on progressively higher concept abstractions built directly from lower level abstractions. In the case of our image recognition problem, we start out with the smallest element of information in our pictures, the pixel. The entire image set is used to construct a basis of features that can be used in composite to detect a slightly higher level of abstraction such as lines and curves. In the next highest level, these lines are curves are combined together to create parts of cars that have been seen in the training set, and these parts are further combined to create object detectors for a whole car!
There are two really important concepts to note here. First, note that no explicit feature engineering has been performed! In this example, unsupervised feature learning has been performed, i.e., representations of the data have been learned without any explicit interaction from the user. This may parallel how we as humans may perform recognition (and we are very good at pattern recognition indeed!).
The second important fact to note is that the concept of a car has not been made explicit. Provided sufficient variance in the input set of pictures, the highest-level “car” detectors should do sufficiently well on any car presented.