We’ve all read about neural nets and that they’re used in machine learning. Most of us know that they are in some way modeled after neurons and the way neurons are connected in the brain. But beyond that, what they are and how they actually work remains mysterious.
- The fundamental concept behind the neuron is analog. This is a bit confusing since many of us remember from biology class that neurons communicate via pulses. All of those pulses are pretty much the same. Neurons don't transmit different amplitude pulses but instead send streams of pulses at different frequencies. It’s like the difference between AM and FM radio, and it provides the same advantage of noise immunity (you can listen to FM during a thunderstorm). Neurons measure the temporal accumulation of pulses in order to determine whether they in turn emit a pulse. They do not count pulses.
- Artificial neural nets have been around for a while. FrankRosenblatt invented the perceptron algorithm in 1957,a special case neural net that contains a single layer of neurons. It is essentially a binary classifier. The initial algorithm was implemented on an early IBM digital computer, but because digital computers of the era were so slow, so big, and so expensive, much of the work on perceptions was done on purpose-built electromechanical,and even electrochemical, implementations.
- Also, at about the same time (1959 through the early 60s),David Hubel did research on living neural nets in the visual cortex of cats. Combining the unexpected results from the biological experiments with the apparent layering of interconnected neurons seen in microscopic examination of brain tissue, Hubel inferred that visual processing happened in layers of neurons that connected laterally at the same level and transmitted output signals to follow-on (deeper)intermediate layers of neurons.
- A single layer neural net (the perceptron) is essentially a simple linear classifier. If your problem has two parameters (X and Y), and if each of the X and Y points represent one of two possible classes (A and B), then (if it is possible) the perceptron algorithm will find the equation of the straight line in the XYplane that best divides the group of points representing classA from the group of points representing class B.
- Each neuron in a neural net has multiple inputs and one output. Associated with each input is a weighting factor(you can think of it as a variable resistor, or a volume control) which determines how much of the input signal is perceived by the neuron. All of the training that occurs in any neural net is only the adjustment of weighting factors for every input on every neuron. After the neural net is trained we record all of those weights as our solution.
- Neural nets operate in two distinct phases:
- The learning phase in which the neural net is exposed to a lot of annotated training data. This is the hard part of the problem, and it requires a large, well-prepared, clean input dataset. The result is a straightforward (albeit potentially very large) algebraic equation. This requires a great deal of human and computer time.
- The operational phase in which parameters (annotated just like the training data was) are slotted into the algebraic equation created during the learning phase. The output of that equation can be tested against a threshold and results in an automated classification decision.Single-layer neural nets were invented in the 1950s,implemented by using analog hardware.Brain research showed that vision centers used hidden layers of neurons,and these hidden layers necessitated backpropagation algorithms.Training a neural net is hard; using it is easy.Deep neural nets train slowly and need more data.
The perceptron was an interesting novelty in the late 1950s, and, as you might expect, the popular press got very excited and enthusiastically announced this technology with inappropriate hyperbole. The New York Times reported the perceptron to be“the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and beconscious of its existence.” But very early on researchers discovered that because the perceptron was just a linearclassifier that it was unable to solve a wide range of problemtypes. Most notably Marvin Minsky denigrated the perceptronbecause it couldn’t model a simple XOR gate. Perceptrons wereabandoned for decades.
In fact, the k-means algorithm, which was one of the earliest machine learning algorithms (invented in the early 1950s),could use the same X, Y input data sets and could easily classify those points into an arbitrary number of classes. It had a big advantage over linear classifiers: a programmer could look at the code as well as the interim results and understand what was happening because it implemented the algorithm in a way that was concrete and logical. Perceptrons had a black box quality that seemed magical, and as we progressed to the modern-day deep learning nets, it’s possible they have become mystical.
Because of the weaknesses of the single layer perceptron, the field stagnated for decades. But eventually, the cross-pollination between the biological research into real neural nets and the computational research into simulated feedback and highly connected logical networks sparked the invention of multilayer perceptron, which we know today as a neural network. Most of the work in the 1990s experimented with three-layer networks:
The input layer, which accepted the input parameters (essentially a perceptron).
The hidden layer, which accepted the outputs of the inputlayer (transformations of the original input parameters).
The output layer, which accepted the outputs of the hiddenlayer and results in the output classification.
It seemed likely to the researchers that an intermediate layer of neurons might provide a computational advantage. After all, nature used multiple layers of neurons, and the experiments of Hubel suggested that the deeper layers computed a generalized version of the original input. But training a neural net with a hidden layer introduced some serious challenges. From the beginning, the core method for training a single layer neural net (perceptron) was a technique called gradient descent:
- For each data point: apply parameters from one of the data points.
- For each input:
- Adjust the input weight by a tiny amount.
- Test the actual output against the expected output.
- If it is closer, keep the new weight value.
- If it is farther, make the change in the otherdirection and keep that new weight value.
Training a neural net involves doing this repeatedly over all ofthe training data until the average error between the expected output and the actual output plateaus. Of course, nothing is that simple, but these ancillary problems could be tackled with well-known techniques (e.g. stochastic weight value changes). The new problem specific to these new hidden layer networks was the less direct connection between input and output. The causality between input values was obscured or at least blurred by the hidden layer. How could you adjust the weights on the inputs of the hidden layer neurons in a sensible and efficient wayso that it would converge on a solution? We could measure the error between the expected and actual output, but how could we propagate the desired correction back to the hidden layer input weights and ultimately to the raw input layer weights? This became a major pillar of research upon which multilayer neural, and in the extreme, deep learning networks evolved.
This problem was worked by a number of research groups and was invented independently a number of times. The core concept of backpropagation is to more accurately predict the amount of change needed for each weight by knowing the correlation between the local rate of change of the weight and the rate of change of the error it induces. This implies that we are operating on the derivatives of the weight and error changes. Because both of those must be differentiable, we can no longer use simple thresholds for the neuron activation.This is why the activation function becomes important, and many, if not most, neural networks use the logistic function because, in the extreme, it gives a binary output (0 to 1), butit is forgiving (differentiable) as it transitions between those values. It’s as if the neuron outputs are graded on a curve, not just pass-fail. In essence the output layer is able to indicateto the hidden layer how much influence its recent weight changes are having on the output. Without backpropagation, multilayer neural nets would be at best computationally impractical and at worst impossible.
As you might guess, neural nets with hidden layers take longer to train. Not only do we have more neurons and thus more weights to adjust, they must also do more computation for each of those weights. In addition, because of the blurring of the direct causality relationship while working through the hidden layer, the system needs to take smaller steps to ensure that the gradient descent stays on track. And that means more passes over the training set. This problem only gets worse as we progressed to deep learning systems that have many hidden layers (I have seen up to six). To get back to a biological reference, your cerebral cortex has six distinct layers.
So, if we continue to take our AI hints from biology, then perhaps our deep learning models are on the right track!