# To copy or not, that is the question: copying mechanism

In our daily life, we always repeating something mentioned before in our dialogue, like the name of people or organizations. “Hi, my name is Pikachu”, “Hi, Pikachu,…” There is a high probability that the word “Pikachu” will not be in the vocabulary extracted from the training data. So in the paper (Incorporating Copying Mechanism in Sequence-to-Sequence Learning), the authors proposed CopyNet which brings copying mechanism to seq2seq models with encoder and decoder structure. Read from my old post to learn the prerequisite knowledge.

### Structure

Encoder: a bi-directional RNN is used. We define $h$ to to be RNN’s hidden state, and $c=\phi(\left\{h_1,...{h}_{T_s} \right\} )$ to be the context vector where $\phi$ summarizes the hidden states.
Note that we define a concatenation of vectors $M=\left\{ h_1,...{h}_{T_s} \right\}$, as the short-term memory of the word from $x_1$ to $x_{T_s}$, where $T_s$ is the length of a sentence sequence. $M$ is considered to be the new representation of the sentence sequence. We also define $s$ to be decoder state, normally, at the step t, we have $s_t = f(y_{t-1},s_{t-1},c)$ which considers the previous output of decoder, the previous decoder state and the context vector.
Decoder: we would get access to $M$ in multiple ways and predict the output within the decoder. We will how decoder works in the rest of the blog post.

### Prediction

There are two ways to generate a word, naturally, to copy from the input sentence or to generate from a vocabulary. We define $\nu$ as our vocabulary which contains a list of frequently-used words, $X$ contains a unique collection of words from our input sentence (every sentence we have a different collection) and a special token UNK to represent all other unknown words.
To generate a new word $y_t$ at decoding step t, we consider both generate-mode and copy-mode by simply add the probabilities together.

We define the two types of probabilities bellow:

where $\psi (\dot)$ is the scoring function. For generating, if the word is in the vocabulary we will generate (the first line), or it is an unknown word (the third line). For copying, we consider the cast that if the word is within the whole input sentence only (the forth line). To be more specific, there are four cases, if the word $y_t$:
* is UNK
* is a word from vocabulary $\nu$ only
* is a word from $X$ only
* is a word from $X$ and vocabulary $\nu$ at the same time

Now we will define the scoring functions.
Generate-Mode When the word is from vocabulary $\nu$ or UNK, we have:

where $W$ is a trainable matrix.
Copy-Mode We consider the case if the word is from the original input sentence:

In equation (6), we should sum up all the situations, because a word might have occurred multiple times in the input sentence.

### Update States

We introduced the decoder states $s_t = f(y_{t-1},s_{t-1},c)$ in a generic attention-based Seq2seq model. In CopyNet, we are making a change in how to represent $y_{t-1}$. we will consider more. In the paper, they define it to be $[e(y_{t-1}),\xi(y_{t-1})]^T$ where the first term is the embedding of the word $y_{t-1}$, and the second one is a weighted sum of $M$:

The authors called the $\xi$ as selective read. The word might exist in multiple positions, so we sum them up in a weighted way. Here, we use $K$ as a way of normalization, making $\rho$ a probability distribution, the same thing happened in equation (5) and (6) $Z$.

### Experiments

Synthetic dataset In the paper they used some rules to construct a dataset of simple copying data, for example:
abcdxef-> cdxg
abydxef->xdig
Compared with the Enc-Dec and RNNSearch model, the CopyNet can have a competitive accuracy. (More details please check their paper.)
Summarization
An example result shows bellow.

The Chinese parts give the tokenization and the underline words are OOV. The highlighted words are the copied ones (where the probabilities are higher than generate-mode probabilities).