Artificial Neural Networks: Some Misconceptions (Part 2)
Artificial Neural Networks: Some Misconceptions (Part 2)
In the next part of this series on ANN misconceptions, learn how neural networks come in many architectures and why size matters, but bigger isn't always better.
Join the DZone community and get the full member experience.Join For Free
Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.
Let's continue learning about misconceptions around artificial neural networks.
Neural Networks Come in Many Architectures
In Part 1, we discussed the most simple neural network architecture: the multi-layer perceptron. There are many different neural network architectures (far too many to mention here) and the performance of any neural network is a function of its architecture and weights. Many modern-day advances in the field of machine learning do not come from rethinking the way that perceptrons and optimization algorithms work but rather from being creative regarding how these components fit together. Below, I discuss some very interesting and creative neural network architectures that have developed over time.
In recurrent neural networks, some or all connections flow backward, meaning that feedback loops exist in the network. These networks are believed to perform better on time series data. As such, they may be particularly relevant in the context of the financial markets. For more information, here is a link to a fantastic article.
This diagram shows three popular recurrent neural network architectures — namely, the Elman neural network, the Jordan neural network, and the Hopfield single-layer neural network.
A more recent interesting recurrent neural network architecture is the Neural Turing Machine. This network combines a recurrent neural network architecture with memory. It has been shown that these neural networks are Turing-complete and were able to learn sorting algorithms and other computing task.
One of the first fully connected neural networks was the Boltzmann neural network, AKA the Boltzmann machine. These networks were the first networks capable of learning internal representations and solving very difficult combinatoric problems. One interpretation of the Boltzmann machine is that it is a Monte Carlo version of the Hopfield recurrent neural network. Despite this, the neural network can be quite difficult to train. But when constrained, they can prove to be more efficient than traditional neural networks. The most popular constraint on Boltzmann machines is to disallow direct connections between hidden neurons. This particular architecture is referred to as a Restricted Boltzmann Machine, which is used in Deep Botlzmann Machines.
This diagram shows how different Boltzmann Machines with connections between the different nodes can significantly affect the results of the neural network (graphs to the right of the networks).
In deep neural networks, there are neural networks with multiple hidden layers. Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image recognition and voice recognition problems. The number of deep neural network architectures is growing quite quickly but some of the most popular architectures include deep belief networks, convolutional neural networks, deep restricted Boltzmann machines, stacked auto-encoders, and many more. One of the biggest problems with deep neural networks, especially in the context of non-stationary financial markets, is overfitting. For more info, see DeepLearning.net.
This diagram shows a deep neural network that consists of multiple hidden layers.
Adaptive neural networks are neural networks that simultaneously adapt and optimize their architectures while learning. This is done by either growing the architecture (adding more hidden neurons) or shrinking it (pruning unnecessarily hidden neurons). I believe that adaptive neural networks are most appropriate for financial markets because markets are non-stationary. I say this because the features extracted by the neural network may strengthen or weaken over time depending on market dynamics. The implication of this is that any architecture that worked optimally in the past would need to be altered to work optimally today.
This diagram shows two different types of adaptive neural network architectures. The left image is a cascade neural network and the right image is a self-organizing map.
Although not a different type of architecture in the sense of perceptrons and connections, radial basis networks make use of radial basis functions as their activation functions, these are real-valued functions whose outputs depend on the distance from a particular point. The most commonly used radial basis function is the Gaussian distribution. Because radial basis functions can take on much more complex forms, they were originally used for performing function interpolation. As such, a radial basis function neural network can have a much higher information capacity. Radial basis functions are also used in the kernel of a support vector machine.
This diagram shows how curve fitting can be done using radial basis functions.
In summary, many hundreds of neural network architectures exist and the performance of one neural network can be significantly superior to another. As such, quantitative analysts interested in using neural networks should probably test multiple neural network architectures and consider combining their outputs together in an ensemble to maximize their investment performance.
Size Matters, but Bigger Isn’t Always Better
Having selected an architecture, one must then decide how large or small the neural network should be. How many inputs are there? How many hidden neurons should be used? How many hidden layers should be used (if we are using a deep neural network)? And how many output neurons are required? The reasons why these questions are important is because if the neural network is too large (or too small), the neural network could potentially overfit (or underfit) the data, meaning that the network would not generalize well out of sample.
How many and which inputs should be used?
The number of inputs depends on the problem being solved, the quantity and quality of available data, and perhaps some creativity. Inputs are simply variables that we believe have some predictive power over the dependent variable being predicted. If the inputs to a problem are unclear, you can systematically determine which variables should be included by looking at the correlations and cross-correlation between potential independent variables and the dependent variables.
There are two problems with using correlations to select input variables. First, if you are using a linear correlation metric, you may inadvertently exclude useful variables. Second, two relatively uncorrelated variables could potentially be combined to produce a strongly correlated variable. If you look at the variables in isolation, you may miss this opportunity. To overcome the second problem, you could use principal component analysis to extract useful eigenvectors (linear combinations of the variables) as inputs. That said, a problem with this is that the eigenvectors may not generalize well and they also assume the distributions of input patterns is stationary.
Another problem with selecting variables is multicollinearity. Multicollinearity is when two or more of the independent variables being fed into the model are highly correlated. In the context of regression models, this may cause regression coefficients to change erratically in response to small changes in the model or the data. Given that neural networks and regression models are similar, I suspect this is also a problem for neural networks.
Last, but not least, one statistical bias that may be introduced when selecting variables is omitted variable bias. Omitted variable bias occurs when a model is created that leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables, e.g. the weights may become too large on these variables or the SSE is large.
How Many Hidden Neurons Should You Use?
The optimal number of hidden units is problem-specific. That said, as a general rule of thumb, the more hidden units used, the more probable the risk of overfitting becomes. Overfitting is when the neural network does not learn the underlying statistical properties of the data but rather "memorizes" the patterns and any noise they may contain. This results in neural networks that perform well in the example but poorly out of the example. So, how can we avoid overfitting? There are two popular approaches used in the industry, namely early stopping and regularization, and then there is my personal favorite approach, global search.
Early stopping involves splitting your training set into a main training set and a validation set. Then, instead of training a neural network for a fixed number of iterations, you train until the performance of the neural network on the validation set begins to deteriorate. Essentially, this prevents the neural network from using all of the available parameters and limits its ability to simply memorize every pattern it sees. The image on the right shows two potential stopping points for the neural network (a and b).
The image below shows the performance and overfitting of the neural network when stopped at a or b.
Regularization penalizes the neural network for using complex architectures. Complexity in this approach is measured by the size of the neural network weights. Regularization is done by adding a term to sum squared error objective function, which depends on the size of the weights. This is the equivalent of adding a prior, which essentially makes the neural network believe that the function it is approximating is smooth.
Where n is the number of weights in the neural network. The parameters α and β control the degree to which the neural network overfits or underfits the data. Good values for α and β can be derived using Bayesian analysis and optimization. This and the above are explained in considerably more detail in this brilliant chapter.
My favorite technique, which is also by far the most computationally expensive, is global search. In this approach, a search algorithm is used to try different neural network architectures and arrive at a near-optimal choice. This is most often done using genetic algorithms which are discussed further on in this article.
What Are the Outputs?
Neural networks can be used for either regression or classification. Under regression model, a single value is outputted that may be mapped to a set of real numbers, meaning that only one output neuron is required. Under the classification model, an output neuron is required for each potential class to which the pattern may belong. If the classes are unknown, unsupervised neural network techniques such as self-organizing maps should be used.
In conclusion, the best approach is to follow Ockham's Razor. Ockham’s Razor argues that for two models of equivalent performance, the model with fewer free parameters will generalize better. On the other hand, one should never opt for an overly simplistic model at the cost of performance. Similarly, one should not assume that just because a neural network has more hidden neurons, and maybe with more hidden layers, it will outperform a much simpler network. Unfortunately, it seems to me that too much emphasis is placed on large networks and too little emphasis is placed on making good design decisions. In the case of neural networks, bigger isn’t always better.
Opinions expressed by DZone contributors are their own.