Experimenting With Generative Capsule Networks
Experimenting With Generative Capsule Networks
Let's look at an attempt to overcome the limitation of rotation by proposing a generative version of capsule networks.
Join the DZone community and get the full member experience.Join For Free
We have been experimenting with Generative Capsule Networks (CapsNets), which have the potential to structure SingularityNET’s disparate data and services.
Generative Capsule Networks have demonstrated the capability to generalize the process of generating shifted images for never encountered shifts far outside the range of the training set. CapsNets have shown better performance in the task of one-shot transferring the capability to reconstruct rotated images from some classes to others, and to extrap0olate outside the range of the training set.
Models with better generalization and transfer learning performance would greatly benefit the network, which nodes will be heavily reused for different tasks and trained on user data, which might be not well-prepared.
In a previous post, we studied the capabilities of networks with traditional models of formal neurons to generalize transformations. In particular, we found that considered architectures of (adversarial) autoencoders can learn to rotate images by a specified angle if images of the same class and similar angles were in the training set.
However, their reconstruction of images to be rotated by a novel angle is much worse, even if this angle was presented in the training set for images of other classes. The same is true even for shifts. The figure below shows examples of the reconstruction of "4" with novel shifts; the last row shows the result of the reconstruction for the shift encountered in the training set.
Reconstruction of images for novel shifts by autoencoders
Here, we make an attempt to overcome this limitation by proposing a generative version of capsule networks.
The idea behind CapsNets is attractive in the context of the considered problem. Indeed, capsules were intended for representing object-part relationship and they have non-scalar output for representing their "poses" (which do not suppose necessarily spatial transformations and can encode not only coordinates on the image plane, that is, their routing mechanism is quite general). Although CapsNets in our experiments didn’t show capabilities of learning invariants, they might be a better starting point in comparison with traditional networks.
We would like to generate images of objects subjected to some unknown transformation, which parameters can be treated as an object "pose." We would also like our model to generate an object as a set of its parts, which poses are connected to the pose of the object, but can differ from it.
Imagine we have a neuron corresponding to some visual concept, e.g. digit "4." This digit consists of few strokes. It can have different position, size, orientation as well as line width or other style features. Thus, we can supplement the activation of the neuron, which high value means that "4" should be generated, with a vector describing its pose. This is similar to capsules but working in the opposite direction, so we will call them generative capsules. Such a generative capsule should activate several capsules of the next level and set their poses depending on its own pose. Apparently, routing mechanisms are not needed here. Routing is needed to revert this generative process in the discriminative CapsNets.
Each generative capsule can activate several capsules of the next level, and different higher-level capsules can activate the same lower-level capsules, but with different poses. For example, "4" and "7" have the same element "|", but in different location and orientation. Thus, we introduce one dense layer, which is similar to a conventional dense layer, but a vector pose output is calculated in addition to the activation output.
In generative CapsNets, the trickier part is to go from capsules to convolutional feature maps. In discriminative CapsNets, conventional feature maps are used to activate primary capsules and to calculate pose vectors/matrices, which depend on positions of active neurons on feature maps. This is quite straightforward. Here, we need to calculate positions of neurons on the feature maps to be activated depending on pose vectors of generative capsules. This should not be done by conventional connections between neurons. In this case, the network will be able only to memorize, for which capsules and which poses to activate which neurons. It will not be able to generalize, which neurons should be activated for a novel pose.
Thus, some form of dynamic routing is needed here. However, it cannot be the same routing as used in the discriminative capsules (routing by agreement). So, we propose a routing by address. Let neurons on the feature map possess some internal addresses. A generative capsule produces an address vector, which is compared with internal addresses of neurons of the next layer and connects to those neurons, which addresses match better.
This can be considered not as a heuristic, but as a general mechanism belonging to the same class of dynamic addressing as used in the Neural Turing Machine and similar models.
If the internal addresses of neurons on the feature maps correspond to their coordinates, generative capsules will be able to calculate proper addresses for transformed patterns. These addresses can be considered as network parameters to be learned. However, it is very difficult to train such the network end-to-end without separating the training stages for filters, weights, and addresses. We tried this, but the performance of the network appeared to be superficial. Thus, we used pre-defined addresses corresponding to positions of neurons on feature maps in this study. These positions naturally follow from the convolution operation applied to maps, so they don’t carry additional prior information.
The next problem is that filters for the (transposed) convolution are learned in such a way that they are not arranged in any meaningful order. Imagine that one capsule represents the stroke "|", which can be shifted, rotated, and can have different thickness. The capsule can calculate (x,y) coordinates of the neurons on the feature maps to be activated, but how can it calculate, which feature map to select if feature maps don’t correspond to same filters systematically rotated by an increasing angle? If the capsule has been never trained within a certain range of poses, it will not be able to guess an appropriate index of the feature map to be activated. Thus, it will be able just to memorize, but not generalize/extrapolate, which poses correspond to which indices of feature maps. This is exactly the same problem, why discriminative capsules are incapable of extrapolating the rotation invariance as we found out earlier.
However, we can still hope that the network will be able to transfer its capability to generate images for lower-level capsules in different poses from one object to another. Thus, we don’t try to solve the problem of extrapolating how to address feature maps for completely novel poses. In our implementation, each capsule directly (by the traditional weighted summation) calculates activations of each feature map on the base of its pose instead of producing an index of the feature map to be activated as a part of the output address.
Opposite to original CapsNets, our model consists of a conventional convolutional network as an encoder, and a generative capsule network as a decoder.
We considered generative CapsNets with different parameters. Here, we analyze the following one. The encoder has two convolutional layers with kernel=4 and stride=2 (64 and 128 feature maps), intermediate dense layer with 1024 units and last dense layer with one categorical variable with 10 values treated as activations of primary generative capsules and 10 real-valued variables treated as a shared pose vector, which is extended with cos and sin of the angle, by which the input image should be rotated (or shifts, by which it should be shifted).
The decoder has one intermediate layer of generative capsules, which are activated by primary capsules and which activate neurons on the feature maps, and two (transposed) convolutional layers.
Our implementation of different models for different experiments can be found here.
At first, we checked the capability of our model to generalize the process of generating shifted images. Similar to our previous experiment with traditional convolutional autoencoders, we fed the model with the original images from MNIST and trained it to reconstruct digits shifted by provided values. These values for digits "4" and "9" belonged to a narrower range than for other digits. The question was if the model would be able to reconstruct "4" and "9" shifted by values outside the range presented in the training set. It appeared that generative capsule networks are capable of doing this in contrast to the traditional autoencoders (see the figure above):
Successful reconstruction for novel shifts by the generative CapsNet
This is also reflected in the reconstruction loss depending on the shift. Here, we show reconstruction losses for autoencoders and two models with generative capsules. An alternative capsule model which shows almost perfect generalization — the loss doesn’t increase outside the range of the training set:
Its code can be found here. It differs in that it doesn’t rely on the digit caps, but uses a continuous latent code.
Moreover, generative capsule networks are even capable to extrapolate the shift transform (far) beyond the range of shifts used for all examples in the training set. Of course, this is true only for pre-defined addresses. If the addresses are trainable internal parameters of neurons, then if some neurons were not activated during training, their addresses could not be learned, and the model would not be able to activate them appropriately for novel shifts. However, this doesn’t imply that the model cannot extrapolate outside the range of the transformation parameters presented in the training set, but a preliminary training of the addresses is required.
Next, we conducted the same experiments with reconstructing the rotated images as were described in our previous post. The idea was also to train the model to reconstruct images of all digits rotated in the whole range of angles except "4" and "9", for which a narrower range (specifically, [–45ᵒ, 45ᵒ]) was used.
Unfortunately, the results of the rotation by novel angles appeared to be not perfect:
Results of reconstruction with rotation by generative CapsNets
Surprisingly, if we extend the length of the latent code in cases of both traditional autoencoders and generative capsule networks with continuous variables, the result of transferring the capability to rotate image in the whole range of angles becomes very good. Here is the result for autoencoders with 128 elements in the latent code:
Improved transfer of rotations
This might contradict someone’s intuition that more complex models should have worse generalization capabilities. Of course, this experiment is not about pure generalization, but more about transferring. Indeed, if angles from some range were not used in the whole training set, the models would not be able to rotate images by these angles (more precisely, for angles not too close to the range of the training set) as we will see below. Nevertheless, it is still interesting, how this transfer happens.
If we activate only one neuron in the latent code and run the decoder for different rotation angles, we will see the following picture (columns stand for different neurons being activated):
“Concepts” learned by neurons of the latent code of the autoencoder
Apparently, the network learned some ‘basis functions’ and memorized the result of rotation for them. The latent code appears to be rich enough to represent any image from the training set as a sum of these functions. Since the network remembers how to rotate basis functions for all angles, it can rotate a digit to a novel angle. If the latent code is small enough, the network has to memorize the result of rotations of digits themselves. The result for large latent codes is not bad, since the capability to apply a transformation is transferred between different classes, but the learnt representation is not interpretable.
One more modification of this experiment is to train models on all angles for all digits except "4" and "9", and to train on "4" and "9" only with the rotation angle equal to 0.1π. Capsule networks appear better in this case, and autoencoders display a strange dependence of the loss on the angle (here, the size of the latent code is 20):
Surprisingly bad reconstruction in the case of rotation by small angles results from the fact that the autoencoders reconstruct the non-rotated version of "4":
Reconstruction of the memorized digit for small rotation angles by autoencoders
That is, it learns a specialized representation for reconstructing a non-rotated "4", and only for larger angles it uses a common representation for all digits. The reconstruction by the generative CapsNet doesn’t have this problem:
Correct reconstruction by generative CapsNets
True generalization supposes the extrapolation for novel values of transformation parameters, which were not encountered in the training set at all. In the following experiment, we trained the model on all digits for angles in the range [–180ᵒ, 180ᵒ], and checked what happens in the whole range of angles. Apparently, extrapolation is quite poor:Extrapolation beyond the range of the training set by generative CapsNets
Nevertheless, generative capsules with continuous latent code have lower error outside [–180ᵒ, 180ᵒ] in comparison with traditional models:
Published at DZone with permission of Alexey Potapov . See the original article here.
Opinions expressed by DZone contributors are their own.