People would be attracted only on a part of an image, say a person on a photo. Similarly, for a given sequence of words, we should pay attention to few keywords instead of treating each word equally. For example, “this is an apple”, when you read it loudly, I am sure you will stress “apple” more rather than “is” or “an” because you will naturally pay attention to the word “apple” (meaningful in this sentence). In seq2seq models (check this post if you forget), we are learning some weights corresponding to the words, where important words get a higher weight.
In the paper Neural Machine Translation by Jointly Learning to Align and Translate, the attention mechanism was tried initially for the task of machine translation.
In this part, we keep the encoder as it is, but we have special components in the decoding part. In brief, when predicting the next word, we want to consider all the previous predicted word and the whole input information.
Here is the hidden state of decoder, and
is the hidden state of the encoder (we use a bi-LSTM to achieve this, so it is actually a concatenation of forward and backward hidden state vectors). The content vector is defined as
, where
is the weight at step i for hidden state j. To be clear, every time when we want to generate a word (say at step i), we will have a content vector. Simply,
indicates how much attention the model pays to a certain hidden state at a certain step, where we usually normalize to sum up to 1.
where is called an alignment model, showing how well the
and
matches. And it can be seen as a feedforward NN, where its parameters can be jointly trained.
The decoder decides which part and how much we should pay attention to given a sequence so far. The attention mechanism expand the sentence vector into multiple vectors in fact (because we consider all the hidden states so far not only the last one) ! There are no other parameters to be learned, so the whole process is done in a totally self-adapted way.
PyTorch Codes
As I mentioned in the previous blog, PyTorch has a nice tutorial on the seq2seq with attention module.
The bmm
is the content vector , and
attn_weights
is the .
In brief, in seq2seq, we used the last hidden state to initialize the decoder state, thus using a vector to represent the whole input sequence. With attention, we consider each encoding hidden state and take a linear combination of them at every generation step of decoding part.
How it helps with Summarization
Besides machine translation, attention also helps with summarization task from this paper: A Neural Attention Model for Abstractive Sentence Summarization.

Different from a traditional noisy-channel approach, the neural model contains both a neural probabilistic language model and an encoder for a conditional summarization model.
Resources & Reference
Tutorial slides from Graham Neubig (CMU)
Seq2seq course from deeplearning.ai
2 thoughts on “What matters: attention mechanism”