Over a million developers have joined DZone.

Java Art Generation With Neural Style Transfer

DZone 's Guide to

Java Art Generation With Neural Style Transfer

Neural-style transfer is the process of creating a new image by mixing two images together. Check out how to do it using AI, neural networks, and Java.

· AI Zone ·
Free Resource

In this post, we are going to build a deep learning Java application using Deeplearning4j for the purpose of generating art. Besides being an attractive and fascinating topic, neural style transfer provides great insight into what deep convolution layers are learning. Feel free to run the application and try it with your own images.

Image title

What Is Neural-Style Transfer?

Neural-style transfer is the process of creating a new image by mixing two images together. Let's suppose we have this two images below:

And the generated art image will look like below:

And since we like the art on the bottom image, we would like to transfer that style into our own memory photos. Of course, we would prefer to save the photo's content as much as possible and at the same time transform it according to the art image style. 

We need to find a way to capture content and style image features so that we can mix them together in such a way that the output will look satisfactory to the eye.

Deep convolution neural networks like VGG-16 are already, in a way, capturing these features, due to the fact that they are able to classify/recognize a large variety of images (millions) with quite high accuracy. We just need to look deeper at neural layers and understand/visualize what they are doing.

What Are Convolutional Networks Learning?

A great paper already offers insight into this. They have developed quite a sophisticated way to visualize internal layers by using deconvolutional networks and other specific methods. Here, we will focus only on the high-level intuition of what neural layers are doing.

Let's first bring into focus the VGG-16 architecture we saw in our cat image recognition application:

While training with images, let's suppose we pick the first layer and start monitoring some of its units/neurons (nine to 12, usually) activation values. From all the activation's values, let's pick nine maximum values per each of the chosen units (9-12). For all of these nine values, we will visualize the patch of the images that cause those activations to maximize. In other words, part of the image is making those neurons fire bigger values.

Since we are just in the first layer, the units capture only a small part of the images and rather low-level features, as shown below:

It looks like the first neuron is interested in diagonal lines, with the third and fourth in vertical and diagonal lines, and the eighth for sure likes the color green. Is noticeable that all these are really small parts of images and the layer is rather capturing low-level features.

Let's move a bit deeper and choose the two layers:

In this layer, neurons start to detect more features; the second detects thin vertical lines, the sixth and seventh start capturing round shapes, and the 14th is obsessed with color yellow.

Deeper into the third layer:

This layer starts to detect more interesting stuff; the sixth is more activated for round shapes that look like tires, the tenth is not easy to explain but likes orange and round shapes, while the 11th starts detecting some humans.

Even deeper... into layers 4 and 5:

Image title

So the deeper we go, the more image neurons are detecting, therefore capturing high-level features (the second neuron on the fifth layer is really into dogs) of the image compared to low-level layers capturing rather small parts of the image.

This gives great insight into what deep convolutional layers are learning and also, coming back to our style transfer, we have insight into how to generate art and keep the content from two images. We just need to generate a new image that, when fed to neural networks as input, generates more or less the same activation values as the content (photo) and style (art painting) image.

Transfer Learning

One great thing about deep learning is the fact that it's highly portable between applications and even different programming languages and frameworks. The reason is simply because what a deep learning algorithm produces is just weights that are simply decimal values and they can be easily transported and imported on different environments.

Anyway, for our case, we are going to use VGG-16 architecture pre-trained with IMAGENET. Usually, VGG-19 is used but unfortunately, it results too slow on CPU — maybe on GPU it will be better. Here's the Java code:

  private ComputationGraph loadModel() throws IOException {
        ZooModel zooModel = new VGG16();
        ComputationGraph vgg16 = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);
        return vgg16;

Load Images

At the beginning, we have only the content image and styled image, so the combined image is a rather noisy image. Loading images is a fairly easy task:

private static final DataNormalization IMAGE_PRE_PROCESSOR = new VGG16ImagePreProcessor();
private static final NativeImageLoader LOADER = new NativeImageLoader(HEIGHT, WIDTH, CHANNELS);
INDArray content = loadImage(CONTENT_FILE);

INDArray style = loadImage(STYLE_FILE);
private INDArray loadImage(String contentFile) throws IOException {
    INDArray content = LOADER.asMatrix(new ClassPathResource(contentFile).getFile());
    return content;

Please note that after loading the pixels, we are normalizing (IMAGE_PRE_PROCESSOR) the pixels with the mean values from all images used during training the VGG-16 with the ImageNet dataset. Normalization helps speed up training and is something that is, more or less, always done.

Now, it's time to generate a noisy image:

private INDArray createCombinationImage() throws IOException {
    INDArray content = LOADER.asMatrix(new ClassPathResource(CONTENT_FILE).getFile());
    INDArray combination = createCombineImageWithRandomPixels();
    combination.muli(NOISE_RATION).addi(content.muli(1.0 - NOISE_RATION));
    return combination;

As we can see from the code, the combined image is not purely noisy but some of it is taken from content (NOISE_RATION controls the percentage). The idea is taken from this TensorFlow implementation and is done to speed up training, therefore getting good results faster. Anyway, the algorithm eventually will produce more or less the same results with pure noise images, but it will just take longer and will require more iterations.

Content Image Cost Function

As we mentioned earlier, we will use the intermediate layer activation values produced by a neural network as a metric showing how similar two images are. First, let's get those layer activations by doing a forward pass for the content and combined image using the VGG-16 pre-trained model:

Map<String, INDArray> activationsContentMap = vgg16FineTune.feedForward(content, true);
Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true);

Now, per each image, we have a map with layer name as key and activation values on that layer as a value. We will choose a deep layer (conv4_2) for our content image cost function because we want to capture as high-level as possible features. The reason we choose a deep layer is that we would like the combined image or the generated image to retain the look and shape of the content. At the same time, we choose only one layer because we don't want the combined image to look exactly like the content but rather leave some space for the art.

Once we have activations for the chosen layer for both images, content and combined, it's time to compare them and see how similar they are. In order to measure their similarity, we will use their squared difference divided by activation dimensions, as described by this paper:

Fij denotes the combined image layer activation values and Pij content image layer activation values. Basically, it's just the Euclidian distance between two activations in the particular layer.

What we want ideally is the difference to be zero. In other words, minimize as much as possible the difference between image features on that layer. In this way, we transfer features captured by that layer from the content image to combined image.

The implementation in Java of the cost function will look like below:

public double contentLoss(INDArray combActivations, INDArray contentActivations) {
    return sumOfSquaredErrors(contentActivations, combActivations) / (4.0 * (CHANNELS) * (WIDTH) * (HEIGHT));
public double sumOfSquaredErrors(INDArray a, INDArray b) {
    INDArray diff = a.sub(b); // difference
    INDArray squares = Transforms.pow(diff, 2); // element-wise squaring
    return squares.sumNumber().doubleValue();

The only nonessential difference with the mathematical formula from the paper is the division of the activation dimension rather than with 2.

Style Image Cost Function

The approach for the style image is quite similar to the content image in the way that we will still use neural layer activation values' differences as a similarity measurement of images. Anyway, there is some difference in the cost function for style images in how the activation's values are processed and calculated.

Recalling from the previous convolution layers post and our cat recognition application, a typical convolution operation will result in an output with several channels (third dimension) besides height and width (i.e. 16 X 20 X 356, w X h X c). Usually, convolution shrinks width and height and increases channels.

Style is defined as the correlation between each of units across channels in a specifically chosen layer. For example, if we have a layer with shape 12X12X32, then if we pick up the tenth channel, all 12X12=144 units of the tenth channel will be correlated with all 144 units of each of the other channels like 1,2,3,4,5,6,7,8,9, 11,12...32.

Mathematically, this is specifically called the Gram Matrix (G) and is calculated as the multiplication of the unit's values across channels in a layer. If values are almost the same, the Gram will output a big value, in contrast to when the values are completely different. So, Gram signals capture how related different channels are to each other (like correlation intuition). From the paper, it will look like below:

l is the chosen layer and k is an index that iterates over channels in a layer. Notice that k* is not iterating because this is the channel we compare with all other channels. i and j are referring to the unit.

The implementation in Java looks like below:

public double styleLoss(INDArray style, INDArray combination) {
    INDArray s = gramMatrix(style);
    INDArray c = gramMatrix(combination);
    int[] shape = style.shape();
    int N = shape[0];
    int M = shape[1] * shape[2];
    return sumOfSquaredErrors(s, c) / (4.0 * (N * N) * (M * M));
public INDArray gramMatrix(INDArray x) {
    INDArray flattened = flatten(x);
    INDArray gram = flattened.mmul(flattened.transpose());
    return gram;

Once we have the Gram Matrix, we do the same for the content: calculate the Euclidean distance (so the squared difference between Gram Matrices of the combined and style images' activation values).

Gijl is denoting the combined Gram values and Aijl is the style Gram values on specific layer l.

There is one last detail about the style cost function: usually, choosing more than one layer gives better results. So, for the final style cost function, we are going to choose four layers and add them together:

E is just the equation above and w1 denotes a weight per layer, so we are controlling the impact or the contribution of each layer. We maybe want the lower layer to contribute less than the upper layer but still have them.

Finally, in Java it looks like below:

private static final String[] STYLE_LAYERS = new String[]{
private Double allStyleLayersLoss(Map<String, INDArray> activationsStyleMap, Map<String, INDArray> activationsCombMap) {
    Double styles = 0.0;
    for (String styleLayers : STYLE_LAYERS) {
        String[] split = styleLayers.split(",");
        String styleLayerName = split[0];
        double weight = Double.parseDouble(split[1]);
        styles += styleLoss(activationsStyleMap.get(styleLayerName).dup(), activationsCombMap.get(styleLayerName).dup()) * weight;
    return styles;

Total Cost Function

Total cost measures how similar or different the combined image is from the content image features from the selected layer and from features selected from multiple layers the on style image. In order to have control over how much we want our combine image to look as content or style, two wights are introduced, α and β:

Increasing α will cause the combined image to look more like content while increasing β will cause the combined image to have more style. Usually, we decrease α and increase β.

Updating Combined Image

By now, we have a great way to measure how similar the combined image is with the content and the style. What we need now is to react to the comparison result in order to change the combined image so that next time, it will have less difference or a lower cost function value. Step by step, we will change the combined image to become closer and closer to content and style images layers features.

The amount of change is done by using a derivation of the total cost function. The derivation simply gives a direction to go. We then multiply the derivation value by a coefficient α that simply defines how much you want to progress or change. You don't want small values, as this will require a lot of iteration to improve, where bigger values will make the algorithm never converge or produce unstable values (see here for more).

If it were TensorFlow, we wpi;d be done by now since it handles the derivation or the cost function automatically for us. Although Deeplearning4j requires manual calculation of derivation (n4dj offers some autodiff; feel free to experiment for automatic derivation) and is not designed to work in the way style transfer learning requires, it has all the flexibility and pieces to build the algorithm.

Thanks to Jacob Schrum, we were able to build derivation implementation in Java. You can find the details on Deeplearning4j examples in the GitHub class implementation originally started on the MM-NEAT repository.

The last step is to update the combined image with the derivation value (multiplied by α, as well):

 AdamUpdater adamUpdater = createADAMUpdater();
        for (int iteration = 0; iteration < ITERATIONS; iteration++) {
            log.info("iteration  " + iteration);
            Map<String, INDArray> activationsCombMap = vgg16FineTune.feedForward(combination, true);

            INDArray styleBackProb = backPropagateStyles(vgg16FineTune, activationsStyleGramMap, activationsCombMap);

            INDArray backPropContent = backPropagateContent(vgg16FineTune, activationsContentMap, activationsCombMap);

            INDArray backPropAllValues = backPropContent.muli(ALPHA).addi(styleBackProb.muli(BETA));

            adamUpdater.applyUpdater(backPropAllValues, iteration);

            log.info("Total Loss: " + totalLoss(activationsStyleMap, activationsCombMap, activationsContentMap));
            if (iteration % SAVE_IMAGE_CHECKPOINT == 0) {
                //save image can be found at target/classes/styletransfer/out
                saveImage(combination.dup(), iteration);

We simply subtract the derivation value from the combined images and pixels each iteration, and the cost function guarantees that each iteration, we come closer to the image we want. In order for the algorithm to be effective, we update using ADAM, which simply helps gradient descent to converge more stabley. Basically, a simpler Updater will work fine, as well, but it will take slightly more time.

What we've described so far is gradient descent — more specifically, Stochastic gradient descent since we are updating only one sample at a time. Usually, for transfer learning, L-BFGS is used, but with Deeplearning4j, this will be harder, and I don't have any insight on how to approach it.


Originally, the case was implemented at MM-NEAT, together with Jacob Schrum — but later, one was contributed to the deeplearning4j-examples project, so feel free to download from any of the sources (from DL4J, it's slightly refactored).

Basically, the class code can be easily copied and run on different projects, as it has no other dependencies (besides Deeplearning4j, of course).

Usually, to get decent results, you need to run a minimum of 500 iterations, but 1,000 is often recommended, while 5,000 iteration produces really high-quality images. Anyway, expect to let the algorithm run for a couple of hours (three to four) for 1,000 iterations.

There a few parameters that we can play in order to affect the combined image to what it looks best to us:

  • Change loss α (impacts content) and β (impacts style) values simply affect how much you want your image to look like content or style. Some values suggested in other implementations are (0.025, 5), (5,100), and (10,40), but feel free to experiment; there is plenty of room for optimization.
  • Change style layers weights. Currently, we have bigger values for higher layers and smaller values for low layers. Anyway, there are other implementation showing very good results with equal wights, i.e. we have layers for style, so the weights will be all 0.2. It would be interesting to try also increasing low-level layers' impact and notice how the images are transformed.
  • Change content layer or style layers to lower layers or deeper layers and notice how the image is greatly affected.
  • There are other parameters related, mostly with the algorithm itself, that are worth considering in the context of speeding up the execution like ADAM beta and beta_2 momentum constants or learning rate.

Please find below some use cases. Feel free to share more use cases since there exist many more interesting art mixtures out there.

Example #1:

Example #2:

Example #3:

Results and Future Work

In general, the style transfer algorithm is slow because each time, it requires forward pass and several back propagation passes (a couple of hours with 800X600 resolution with TensorFlow). I didn't personally perform any comparison of Deeplearning4j with other frameworks like TensorFlow but at first look, I have the impression that it is slower — especially if you try to run with high resolution like 800 X 600, it becomes almost not commutable on CPU. Maybe running on GPU will help and I probably will do that, but again I did not, so feel free to suggest a new insights or experiments.

There is a new paper that suggests a state-of-the-art technique to make style transfer faster. The implementation is so efficient that it can also be applied to videos, so it will be quite interesting to find a way to implement it on Java. Please find below a few demos and implementations in TensorFlow:
  1. Fast Style Transfer
  2. Fast Neural Style
  3. Tensorflow Fast Style Transfer
ai ,neural networks ,image transfer ,tutorial ,deep learning ,deeplearning4j

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}