Deep Learning Achievements of 2017 (Part 1)

DZone 's Guide to

Deep Learning Achievements of 2017 (Part 1)

In this two-part series, we're taking stock of the most recent achievements in deep learning from the past year. In Part 1, we look at text, voice, and computer vision.

· AI Zone ·
Free Resource

At Statsbot, we’re constantly reviewing deep learning achievements to improve our models and product. Around Christmas time, our team decided to take stock of the recent achievements in deep learning over the past year (and a bit longer). We had a data scientist write this article to tell you about the most significant developments that can affect our future.


Google Neural Machine Translation

Almost a year ago, Google announced the launch of a new model for Google Translate. The company described in detail the network architecture: recurrent neural network (RNN).

The key outcome: closing down the gap between humans with the accuracy of the translation by 55-85% (estimated by people on a six-point scale). It is difficult to reproduce good results with this model without the huge dataset that Google has.

Negotiations: Will There Be a Deal?

You probably heard the silly news that Facebook turned off its chatbot, which went out of control and made up its own language. This chatbot was created by the company for negotiations. Its purpose is to conduct text negotiations with another agent and reach a deal: how to divide items (books, hats, etc.) by two. Each agent has their own goal in the negotiations that the other does not know about. It’s impossible to leave the negotiations without a deal.

For training, they collected a dataset of human negotiations and trained a supervised recurrent network. Then, they took a reinforcement learning trained agent and trained it to talk with itself, setting a limit — the similarity of the language to human language.

The bot has learned one of the real negotiation strategies: showing a fake interest in certain aspects of the deal, only to give up on them later and benefit from its real goals. It was the first attempt to create such an interactive bot, and it was quite successful.

The full story is in this article, and the code is publicly available.

Certainly, the news that the bot allegedly invented a language was inflated from scratch. When training (in negotiations with the same agent), they disabled the restriction of the similarity of the text to human, and the algorithm modified the language of interaction. Nothing unusual.

Over the past year, recurrent networks have been actively developed and used in many tasks and applications. The architecture of RNNs has become much more complicated, but in some areas, similar results were achieved by simple feedforward-networks (DSSM). For example, Google has reached the same quality as with LSTM previously for its mail feature, Smart Reply. In addition, Yandex launched a new search engine based on such networks.


WaveNet: A Generative Model for Raw Audio

Employees of DeepMind reported in their article about generating audio. Briefly, researchers made an autoregressive full-convolution WaveNet model based on previous approaches to image generation (PixelRNN and PixelCNN).

The network was trained end-to-end: text for the input, audio for the output. The researchers got an excellent result, as the difference compared to humans has been reduced by 50%.

The main disadvantage of the network is a low productivity as, because of the autoregression, sounds are generated sequentially and it takes about one to two minutes to create one second of audio.

Look at… sorry, hear... this example.

If you remove the dependence of the network on the input text and leave only the dependence on the previously generated phoneme, then the network will generate phonemes similar to the human language, but they will be meaningless.

Hear the example of the generated voice here.

This same model can be applied not only to speech but also, for example, to creating music. Imagine audio generated by the model, which was taught using the dataset of a piano game (again without any dependence on the input data).

Read a full version of DeepMind research if you’re interested.

Lip Reading

Lip reading is another deep learning achievement and victory over humans.

Google Deepmind, in collaboration with Oxford University, reported in the article, Lip Reading Sentences in the Wild on how their model, which had been trained on a television dataset, was able to surpass the professional lip reader from the BBC channel.

There are 100,000 sentences with audio and video in the dataset. The model shows LSTM on audio and CNN plus LSTM on video. These two state vectors are fed to the final LSTM, which generates the result (characters).

Different types of input data were used during training: audio, video, and audio plus video. In other words, it is an “omnichannel” model.

Synthesizing Obama: Synchronization of Lip Movement From Audio

The University of Washington has done a serious job of generating the lip movements of former US President Obama. The choice fell on him due to the huge number of his performance recordings online (17 hours of HD video).

They couldn’t get along with just the network, as they got too many artifacts. Therefore, the authors of the article made several crutches (or tricks, if you like) to improve the texture and timings.

You can see that the results are amazing. Soon, you couldn’t trust even the video with the president.

Computer Vision

OCR: Google Maps and Street View

In their post and article, Google Brain Team reported on how they introduced a new OCR (Optical Character Recognition) engine into its Maps, through which street signs and store signs are recognized.

In the process of technology development, the company compiled a new FSNS (French Street Name Signs), which contains many complex cases.

To recognize each sign, the network uses up to four of its photos. The features are extracted with the CNN and scaled with the help of the spatial attention (pixel coordinates are taken into account), and the result is fed to the LSTM.

The same approach is applied to the task of recognizing store names on signboards (there can be a lot of “noise” data, and the network itself must “focus” in the right places). This algorithm was applied to 80 billion photos.

Visual Reasoning

There is a type of task called visual reasoning where a neural network is asked to answer a question using a photo. For example: “Is there a same-size rubber thing in the picture as a yellow metal cylinder?” The question is truly nontrivial, and until recently, the problem was solved with an accuracy of only 68.5%.

And again, the breakthrough was achieved by the team from Deepmind; on the CLEVR dataset, they reached a super-human accuracy of 95.5%.

The network architecture is very interesting:

  1. Using the pre-trained LSTM on the text question, we get the embedding of the question.
  2. Using the CNN (just four layers) with the picture, we get feature maps (features that characterize the picture).
  3. Next, we form pairwise combinations of coordinatewise slices on the feature maps (yellow, blue, red in the picture below), adding coordinates and text embedding to each of them.
  4. We drive all these triples through another network and sum up.
  5. The resulting presentation is run through another feedforward network, which provides the answer on the softmax.


An interesting application of neural networks was created by the company Uizard: generating a layout code according to a screenshot from the interface designer.

This is an extremely useful application of neural networks, which can make life easier when developing software. The authors claim that they reached 77% accuracy. However, this is still under research and there is no talk on real usage yet.

There is no code or dataset in open source, but they promise to upload it.

SketchRNN: Teaching a Machine to Draw

Perhaps you’ve seen Quick, Draw! from Google, where the goal is to draw sketches of various objects in 20 seconds. The corporation collected this dataset in order to teach the neural network to draw, as Google described in their blog and article.

The collected dataset consists of 70 thousand sketches, which eventually became publicly available. Sketches are not pictures but detailed vector representations of drawings (at which point the user pressed the “pencil,” released where the line was drawn, and so on).

Researchers have trained the sequence-to-sequence variational autoencoder (VAE) using RNNs as a coding/decoding mechanism.

Eventually, as befits the auto-encoder, the model received a latent vector that characterizes the original picture.

Whereas the decoder can extract a drawing from this vector, you can change it and get new sketches.

And even perform vector arithmetic to create a catpig:


One of the hottest topics in deep learning is generative adversarial networks (GANs). Most often, this idea is used to work with images, so I will explain the concept using them.

The idea is in the competition of two networks: the generator and the discriminator. The first network creates a picture and the second one tries to understand whether the picture is real or generated.

Schematically, it looks like this:

During training, the generator from a random vector (noise) generates an image and feeds it to the input of the discriminator, which says whether it is fake or not. The discriminator is also given real images from the dataset.

It is difficult to train such construction, as it is hard to find the equilibrium point of two networks. Most often, the discriminator wins and the training stagnates. However, the advantage of the system is that we can solve problems in which it is difficult for us to set the loss-function (for example, improving the quality of the photo) — we give it to the discriminator.

A classic example of the GAN training result is pictures of bedrooms or people.

Previously, we considered the auto-coding (Sketch-RNN), which encodes the original data into a latent representation. The same thing happens with the generator.

The idea of generating an image using a vector is clearly shown in this project in the example of faces. You can change the vector and see how the faces change.

The same arithmetic works over the latent space: “a man in glasses” minus “a man” plus a “woman” is equal to “a woman with glasses.”

Changing Face Age With GANs

If you teach a controlled parameter to the latent vector during training, when you generate it, you can change it and so manage the necessary image in the picture. This approach is called conditional GAN.

So did the authors of the article, Face Aging With Conditional Generative Adversarial Networks. Having trained the engine on the IMDB dataset with a known age of actors, the researchers were given the opportunity to change the face age of the person.

Professional Photos

Google has found another interesting application to GAN: the choice and improvement of photos. GAN was trained on a professional photo dataset: the generator is trying to improve bad photos (professionally shot and degraded with the help of special filters) and improve the discriminator to distinguish “improved” photos and real professional ones.

A trained algorithm went through Google Street View panoramas in search of the best composition and received some pictures of professional and semi-professional quality (as per photographers’ ratings).

Synthesization of an Image From a Text Description

An impressive example of GANs is generating images using text.

The authors of this research suggest embedding text into the input of not only a generator (conditional GAN) but also a discriminator so that it verifies the correspondence of the text to the picture. In order to make sure the discriminator learned to perform this function, in addition to the training they added pairs with an incorrect text for the real pictures.


One of the eye-catching articles of 2016 was Image-to-Image Translation With Conditional Adversarial Networks by Berkeley AI Research (BAIR). Researchers solved the problem of image-to-image generation, when, for example, it was required to create a map using a satellite image or realistic texture of the objects using their sketch.

Here is another example of the successful performance of conditional GANs. In this case, the condition goes to the whole picture. Popular in image segmentation, UNet was used as the architecture of the generator, and a new PatchGAN classifier was used as a discriminator for combating blurred images (the picture is cut into N patches, and the prediction of fake/real goes for each of them separately).

The authors released an online demo of their networks, which attracted great interest from the users.

You can find the source code here.


In order to apply Pix2Pix, you need a dataset with the corresponding pairs of pictures from different domains. In the case, for example, with cards, it is not a problem to assemble such a dataset. However, if you want to do something more complicated like “transfiguring” objects or styling, then pairs of objects cannot be found in principle.

Therefore, authors of Pix2Pix decided to develop their idea and came up with CycleGAN for transfer between different domains of images without specific pairs Unpaired Image-to-Image Translation.

The idea is to teach two pairs of generator-discriminators to transfer the image from one domain to another and back, while we require a cycle consistency — after a sequential application of the generators, we should get an image similar to the original L1 loss. A cyclic loss is required to ensure that the generator did not just begin to transfer pictures of one domain to pictures from another domain, which are completely unrelated to the original image.

This approach allows you to learn the mapping of horses to zebras.

Such transformations are unstable and often create unsuccessful options:

You can find the source code here.

Development of Molecules in Oncology

Machine learning is now coming to medicine. In addition to recognizing ultrasounds, MRIs, and diagnoses, it can be used to find new drugs to fight cancer.

We already reported in detail about this research. Briefly, with the help of Adversarial Autoencoder (AAE), you can learn the latent representation of molecules and then use it to search for new ones. As a result, 69 molecules were found, half of which are used to fight cancer, and the others have serious potential.


Topics with adversarial-attacks are actively explored. What are adversarial-attacks? Standard networks trained, for example, on ImageNet, are completely unstable when adding special noise to the classified picture. In the example below, we see that the picture with noise for the human eye is practically unchanged, but the model goes crazy and predicts a completely different class.

Stability is achieved with, for example, the fast gradient sign method (FGSM): having access to the parameters of the model, you can make one or several gradient steps towards the desired class and change the original picture.

One of the tasks on Kaggle is related to this; the participants are encouraged to create universal attacks/defenses, which are all eventually run against each other to determine the best.

Why should we even investigate these attacks? First, if we want to protect our products, we can add noise to the CAPTCHA to prevent spammers from recognizing it automatically. Secondly, algorithms are more and more involved in our lives, with face recognition systems and self-driving cars. In this case, attackers can use the shortcomings of the algorithms.

Here is an example of when special glasses allow you to deceive the face recognition system and pass yourself off as another person. So, we need to take possible attacks into account when teaching models.

Such manipulations with signs also do not allow them to be recognized correctly.

Check out the following resources:

Stay tuned for Part 2, where we'll talk about 2017's advancements in reinforcement learning, 2017 deep learning news, and more.

ai ,computer vision ,data science ,deep learning ,machine learning ,neural networks ,text recognition ,voice recognition

Published at DZone with permission of Ed Tyantov . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}