What matters: attention mechanism

People would be attracted only on a part of an image, say a person on a photo. Similarly, for a given sequence of words, we should pay attention to few keywords instead of treating each word equally. For example, “this is an apple”, when you read it loudly, I am sure you will stress “apple” more rather than “is” or “an” because you will naturally pay attention to the word “apple” (meaningful in this sentence). In seq2seq models (check this post if you forget), we are learning some weights corresponding to the words, where important words get a higher weight.

In the paper Neural Machine Translation by Jointly Learning to Align and Translate, the attention mechanism was tried initially for the task of machine translation.


In this part, we keep the encoder as it is, but we have special components in the decoding part. In brief, when predicting the next word, we want to consider all the previous predicted word and the whole input information.
Here S_i is the hidden state of decoder, and h_i is the hidden state of the encoder (we use a bi-LSTM to achieve this, so it is actually a concatenation of forward and backward hidden state vectors). The content vector is defined as c_i = \sum_{j=1}^{T_x} \alpha_{i,j}h_j, where \alpha_{ij} is the weight at step i for hidden state j. To be clear, every time when we want to generate a word (say at step i), we will have a content vector. Simply, \alpha indicates how much attention the model pays to a certain hidden state at a certain step, where we usually normalize to sum up to 1.

where a is called an alignment model, showing how well the s and h matches. And it can be seen as a feedforward NN, where its parameters can be jointly trained.
The decoder decides which part and how much we should pay attention to given a sequence so far. The attention mechanism expand the sentence vector into multiple vectors in fact (because we consider all the hidden states so far not only the last one) ! There are no other parameters to be learned, so the whole process is done in a totally self-adapted way.

PyTorch Codes

As I mentioned in the previous blog, PyTorch has a nice tutorial on the seq2seq with attention module.

Attention Decoder
The bmm is the content vector c_i, and attn_weights is the \alpha.

In brief, in seq2seq, we used the last hidden state to initialize the decoder state, thus using a vector to represent the whole input sequence. With attention, we consider each encoding hidden state and take a linear combination of them at every generation step of decoding part.

How it helps with Summarization

Besides machine translation, attention also helps with summarization task from this paper: A Neural Attention Model for Abstractive Sentence Summarization.

Attention-based Summarization System

Different from a traditional noisy-channel approach, the neural model contains both a neural probabilistic language model and an encoder for a conditional summarization model.

Resources & Reference

Tutorial slides from Graham Neubig (CMU)
Seq2seq course from

2 thoughts on “What matters: attention mechanism

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s