DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Scaling ML Models Efficiently With Shared Neural Networks
  • The One-Pixel Threat: How Minuscule Changes Can Fool Deep Learning Systems
  • Two-Tower Model for Fraud Detection: A Comprehensive Guide
  • Exploring the Evolution of Transformers: From Basic To Advanced Architectures

Trending

  • Infrastructure as Code (IaC) Beyond the Basics
  • How GitHub Copilot Helps You Write More Secure Code
  • After 9 Years, Microsoft Fulfills This Windows Feature Request
  • Memory Leak Due to Time-Taking finalize() Method
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. RNN, Seq2Seq, Transformers: Introduction to Neural Architectures Commonly Used in NLP

RNN, Seq2Seq, Transformers: Introduction to Neural Architectures Commonly Used in NLP

In this article, we’ll give you a high-level introduction to deep learning in NLP, and we’ll explain, briefly, how an RNN and a transformer work.

By 
Rostyslav Demush user avatar
Rostyslav Demush
·
Updated Aug. 08, 19 · Opinion
Likes (4)
Comment
Save
Tweet
Share
11.9K Views

Join the DZone community and get the full member experience.

Join For Free

Just a few years ago, RNNs and their gated variants (that added multiplicative interactions and mechanisms for better gradient transfer) were the most popular architectures used for NLP.

Prominent researchers, such as Andrey Karpathy, were singing odes to RNNs' unreasonable effectiveness and large corporations were keen on adopting the models to put them into virtual agents and other NLP applications.

Now that Transformers (BERT, GPT-2) have come along, the community rarely even mentions recurrent nets.

In this article, we’ll give you a high-level introduction to deep learning in NLP; we’ll explain, briefly, how an RNN and a transformer work and which specific properties of the latter make it at a better architecture for a wide range of NLP tasks.

Here we go!

Let’s start with RNNs and why, up until recently, they were considered special.

Recurrent neural networks are a family of neural architectures with a cool property — a looping mechanism — that makes them a natural choice for processing sequential data of variable length. RNNs, unlike standard NNs, can hang on to information from previous layers while taking in a new input.

Here’s how it works

Say we’re building an e-commerce chatbot consisting of an RNN that processes text and a feed-forward net that predicts the intent behind it. The bot receives this message: "Hi! Do you guys have this shirt in any different colors?”

We have, as our input, 11 words (11 word embeddings) and the sequence, chopped into tokens, looks like this I1, I2…..I11.

The core idea behind RNN is that it applies the same weight matrix to every single input and also produces a sequence of hidden states (there’s going to be as many of them as we have inputs) that carry information from previous time steps.

Each hidden state (Ht) is computed based on the previous hidden state (Ht-1) and the current input (It); As we’ve mentioned, they’re really the same state that keeps being modified on each time step.

So, the processing starts with the first word embedding (I1) going into the model along with the initial hidden state (H0); inside the RNN's first unit, a linear transformation is performed on both I1 and H0, a bias is added, and the final value is put through some kind of nonlinearity (sigmoid, ReLU, etc.) — that’s how we get H1.

Afterwards, the model eats I2 paired with H1 and performs the same computations, then I3 with H2 go in, followed by I4 with H3, and so on until we crunch through the entire sequence.

Since we’re using the same weight matrix over and over, the RNN can work with lengthy sequences and not increase in size itself. Another advantage is that, theoretically, each time step can have access to data from many steps ago.

The issues

RNN’s distinctive feature — that it uses the same layer many times — is also what makes it extremely prone to the effects of vanishing and exploding gradients. In practice, it’s too difficult for these networks to preserve data over many steps.

Also, RNNs don’t see any hierarchy in a sequence. The models alter the hidden state every time a new input is being processed despite how insignificant it might be. Therefore, the data from earlier layers may end up being completely washed out by the time the network gets to the end of the sequence.

This means that in our example “Hi! Do you guys have this shirt in any different colors?” the feed-forward net may be left trying to predict the intent based on just “any different colors?” which wouldn’t be easy to do even for a human.

Another inherent disadvantage lies in the nature of sequential processing: since parts of the input are being processed one at a time (we can’t compute H2 unless we have H1) the network’s computations, overall, are very slow.

The gated variants

To fight the issues discussed above, different architectural modifications have been proposed to improve RNNs, the most popular being Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GTU).

The main idea behind LSTM, roughly speaking, is to have a cell state — a memory storage — inside every unit in addition to the hidden state (they are both vectors of the same size).

Also, these models have three gates (forget gate, input gate, output gate) that determine which info to write, read, or erase from the cell state.

All the gates are vectors of the same length as the hidden state, and here’s exactly what they’re for:

  • Forget gate determines what should be kept vs. what should be erased from the previous time step.
  • The input gate determines what new information should be let in into the cell state.
  • The output gate determines what data from the cell should be incorporated into the hidden state.

They’re all computed using the sigmoid function so they always output values between 0 and 1.

If a gate produces something closer to 1, it's considered open (data can be included in the cell state) and if it gives out a value closer to 0, the information is to be ignored.

GRUs operate similarly to LSTMs, but they’re way simpler in architecture; they do away with cell states and compute two gates instead of three before the hidden state.

The point of GRUs is to retain the power and robustness of LSTMs (in terms of mitigating vanishing gradients) and get rid of its complexities. GRU’s gates are:

Update gate determines what parts of the hidden state should be modified vs. what parts should be preserved. In a way, it does what input and forget gates do in LSTMs.

Reset gate determines what parts of the hidden state matter now. If it outputs a number closer to 1, we can just copy the previous state and free the network from having to update weights (no weight adjustments - no vanishing gradient.)

Both LSTMs and GRUs are capable of controlling the flow of information, grasping long-range dependencies, and having error messages flow at different strengths depending on the input.

Sequence-to-sequence (seq2seq) models and attention mechanisms

Sequence to sequence models, once so popular in the domain of neural machine translation (NMT), consist of two RNNs — an encoder and a decoder — stacked together.

Encoder processes inputs sequentially and produces a thought vector that holds data from every time step. Then, its output is passed on to decoder that uses that context to predict an appropriate target sequence (a translation, a chatbot’s reply, etc.)

Image title

The problem with a vanilla seq2seq, however, is that it tries to cram the context of the entire input into one fixed-sized vector, and there’s a limit to how much data it can carry.

Here’s where attention mechanisms help. They allow the decoder network to focus on relevant parts of the input when producing the output. They do so by adding an extra input to each decoding step that comes from the encoding steps.

RNN’s fall and transformers

Yes, we can have a lengthened short memory in RNNs using the LSTM build and even a long memory using attention. But we still can’t completely eliminate the impact of vanishing gradients, make these models (whose design inhibits parallel computing) faster, or get them to explicitly model long-range dependencies and hierarchy in sequences.

Transformer, a model introduced by researchers from Google in 2017, surmounts all RNN’s shortcomings. This new and revolutionary architecture enables us to do away with recurrent computations and achieve state-of-the-art results in a wide range of NLP tasks — NMT, question answering, etc. — by relying solely on attention mechanisms.

A transformer, too, consists of an encoder and a decoder. It actually has a stack of encoders on one side and a stack of decoders (with the same number of units) on the other side.

Encoders

Each encoder unit is comprised of a self-attention layer and a feed-forward layer.

Self-attention is the mechanism that allows a cell to compare the content of an input to all other inputs in the sequence and include the relationships between them into the embedding. If we’re talking about a word, self-attention allows representing which other words in a sentence it has strong relationships with.

In the transformer model, each position can interact with all the other positions in the input simultaneously; the network's computations are trivial to parallelize.

Self-attention layers are further enhanced by the multi-head attention mechanism that improves the model’s ability to focus on various positions and enables it to create representation subspaces (to have different weight matrices applied to the same input).

To establish the order of the input, Transformers add another vector to each embedding (this is known as positional encoding), which helps them recognize the position of each input in a sequence as well as distances between them.

Each encoder pushes its output up to the unit directly above it.

On the decoder side, the cells, too, have a self-attention layer, a feed-forward layer, and one additional element — an encoder decoder attention layer — in between. The decoder component in Transformers takes in the output from the top encoder — a series of attention vectors — and uses it to focus on relevant parts of the input when predicting target sequences.

Overall, transformers are lighter models than RNNs, they’re easier to train and lend themselves well to parallelization; they can learn long-range dependencies.

Concluding notes

The transformer architecture has become the base for many groundbreaking models.

Google researches used the ideas from the "Attention is all you need" paper to develop BERT — a powerful language representation model that can be easily adapted to various NLP tasks (through the addition of just one fine-tuned output layer) and OpenAI scientists have managed to create GPT-2 an unbelievably coherent language model that, according to them, is too dangerous to be released.

The multi-head attention technology is now being tried out in various areas of research. Soon, we might see it transform multiple industries in a profound way. This will be exciting.

neural network NLP Architecture

Opinions expressed by DZone contributors are their own.

Related

  • Scaling ML Models Efficiently With Shared Neural Networks
  • The One-Pixel Threat: How Minuscule Changes Can Fool Deep Learning Systems
  • Two-Tower Model for Fraud Detection: A Comprehensive Guide
  • Exploring the Evolution of Transformers: From Basic To Advanced Architectures

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!