Posted in Theory

What’s next: seq2seq models (1)

The short blog contains my notes from Seq2seq Tutorial. Please leave comments if you are interested in this topic.

Seq2seq model is a typical model which takes sequences as inputs and another sequence as the outputs. What are sequences? Think about an article as a sequence of words or a video file as a sequence of images. The model has been successfully applied in many NLP tasks like machine translation, summarization, etc. In terms of machine translation, we take a sequence of English words, for example, and we wish to translate the sentence into French, thus the output would be a sequence of French words. For summarization, we can take a long article as the input sequence and the output sequence would be the summary of the article, say the headline of a news.

Language Models

To grasp the idea of seq2seq model, we should know the two prerequisite topics: word embeddings and recurrent language models. Word embeddings transform a single word into a dense vector (my old post to review) so we can have the inputs as a sequence of vectors. Now we will take a look at the recurrent language models.

There was an early work A Neural Probabilistic Language Model talking about neural language models. In general, as the tutorial shows:

Predicting the next target word.

If we want to predict the target word (“mat”) in this case, we should know all previous word sequence, so then the conditional probability gives the distribution of the target words and we will need the one with the maximum probability. Also, it is shown here:

where w_i^j stands for the subsequence of w_i tow_j. For a short sequence which contains only 3 words, we can simply write it as:

\hat P (w_1,w_2,w_3) = \hat P(w_1) \hat P(w_2|w_1)\hat P(w_3|w_1,w_2)

With n-gram models, the current word is only related to its previous (n-1) words. By applying the chain rule, we have:

where you can imagine in our previous case,\hat P(w_3|w_1,w_2)\approx\hat P(w_3|w_2).

Typically, a seq2seq model consists of an encoder and a decoder. In the literature of deep learning, we use RNNs for both components. The following screenshot shows a nice example:


The blue part shows the encoder which takes a sequence (ABCD) as inputs, andv is the hidden state at the last timestamp. The purple part shows the decoder who outputs a sequence (XYZQ) by taking the last hidden state of encoder v as its initial hidden state. So here v is considered to have all the information of the input sequence.

To formulate the model, x_1..x_T is our input sequence, the output or target sequence is y_1..y_{T'}, because we would have different lengths (T ,T') for the sequences. The goal is to find the optimal target sequence which maximizes the conditional probability.

The vanilla encoder and decoder

Following nice graphs are from here.

Vanilla Encoder
Vanilla Decoder

Keep in mind that there is a softmax operation in the decoder part before the outputs since it is generating a probability distribution over the vocabulary.


  • Generate word by word from left to right. It is natural that we are generating the sequence word by word at each timestep, however, we may miss the best sequence (combination of the words). We can apply beam search to tackle this problem. For more info please check this paper: Beam Search Strategies for Neural Machine Translation
  • Fixed size embedding. The vector v captures all the information of the input sequence. Is it too much? There is a better way: Attention Mechanism!

Some tutorials on seq2seq models with code:
Tensorflow NMT
PyTorch Translation with attention (I do recommend this!)

In the following posts,  I will introduce:

  • Attention Mechanism
  • Copying Mechanism
  • Pointer Networks
  • Works on Summarization
Posted in Deep Learning, Theory

Deep Learning 16: Understanding Capsule Nets

This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.

Paper link: Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

Continue reading “Deep Learning 16: Understanding Capsule Nets”

Posted in Natural Language Processing, Problem Shooting, Python, Theory

Working with ROUGE 1.5.5 Evaluation Metric in Python

If you use ROUGE Evaluation metric for text summarization systems or machine translation systems, you must have noticed that there are many versions of them. So how to get it work with your own systems with Python? What packages are helpful? In this post, I will give some ideas based on engineering’s view (which means I am not going to introduce what is ROUGE). I also suffered from few issues and finally got them solved. My methods might not be the best ways but they worked.

Continue reading “Working with ROUGE 1.5.5 Evaluation Metric in Python”

Posted in Algorithm, Deep Learning, Machine Learning, Theory

Reinforcement Learning (1): Q-Learning basics

Hi! In the following posts, I will introduce Q-Learning, the first part to learn if you want to pick up reinforcement learning. But before that, let us shed light on some fundamental concepts in reinforcement learning (RL).

Kindergarten Example

Q-Learning works in this way: do an action and get reward and observation from the environment, as shown below. Image is taken from here :

Berkeley’s CS 294: Deep Reinforcement Learning by John Schulman & Pieter Abbeel

Imagine a baby boy in a kindergarten and how he performs on the first day. He does not know the kindergarten and knows nothing about how to behave. So he begins with random actions, say he hits the other kids, and when he performed this, he has no idea if it is right or not. Then the teacher becomes mad and gives him a punishment (a negative reward), then he knows hits others is not a good action; in the next time, the boy washes his lunch box, and the teacher rewards him with candy, then he knows this action is a good one. So in our kindergarten example, simply the Agent is the boy, who has no knowledge in the very beginning; Action is how he behaves; Environment contains all the objectives that he could perform on; Reward is something he gets from the environment (punishment or the candy), and Observation is what he could observe or the feedback from the environment.

Candies lol

Exploitation vs. Exploration

To understand how Q-learning works, it is important to know exploration and exploitation.
Let’s say our baby boy from the kindergarten goes home one day, and his mom prepares five boxes (we call them A-E), where there are different numbers of candies inside the boxes and he doesn’t know which one has more candies. So if his goal is to get as much candy as possible, what he would do?

Method 1: Obviously he could choose an arbitrary box each time. However, it is not guaranteed that he could get as much as possible.

Method 2: Another method would choose a “possible” box. Each time, he can choose the box with the maximum expectation of the candies. So to get a distribution of the candies, say he could open box 1000 times uniformly then keep track of the number of candies.

Method 3: If he has some prior knowledge about these boxes, for example, his mom told him that box A has 10 (expectation), box B has 20 (expectation) and others unknown. So based on his goal, it seems box B is a good choice. But box C might have even more candies! We could either choose box B, or randomly choose a box from C-E.

We call these methods policies in Q-learning. In brief, we choose our action (choose a box) based on our current state in a policy. So we define latex]\pi[/latex] as a policy, which maps states to actions.

Exploitation is to choose an action based on information that we have known. Method 2 is an exploitation-only policy. We say we know the expectations of all actions and then choose the best one.
Exploration is to explore the new actions that we have no information about. Method 1 is an exploration-only policy. Method 3 is a balanced version of these two. This provides us the idea of \epsilon-greedy policy.

Epsilon-greedy policy

Ranges from 0 to 1,\epsilon is the probability of exploration, which is set to search for new things. Typically, we just random a state and return that action. In practice, we initialize it with a value between 0 and 1; then we usually let it shrink during episodes t . An Episode is a whole game process from the start to the terminal state. Say in Flappy Bird, you start the game until the death state. Intuitively, imagine when an agent starts to play a new game, it has no “experience” about the game, so it is natural to go randomly; after some episodes, it is about to learn the skills and tricks, then it tends to use its own experience to play instead of randomly choose an action, because the more episodes it plays, the more confident it is about the experience (the more accurate the reward approximation is). There are various settings to \epsilon, say \epsilon=\frac{1}{ \sqrt { t } }, where t refers to episode.

Slide from Percy Liang


Q-learning has a table called Q-table, which is a table of states, actions and approximated rewards. Let’s get back to the kindergarten example.

Kindergarten states and actions

We simplify the problem: states for the boy are washing lunch box (wash) and hitting others (hit), there are four actions marked as action A to D. Our Q table shows bellow, where we could observe that each row is a state and the corresponding reward values to different actions. Some state-action pairs are illegal and reach no values. The values indicate number of candies as rewards.

Wash 10 -5
Hit -10 5

Q-table example

Posted in Algorithm, Deep Learning, Theory

Deep Learning 15: Unsupervised learning in DL? Try Autoencoder!

There are unsupervised learning models in multiple-level learning methods, for example, RBMs and Autoencoder. In brief, Autoencoder is trying to find a way to reconstruct the original inputs — another way to represent itself. In addition, it is useful for dimensionality reduction. For example, say there is a 32 * 32 -sized image, it is possible to represent it by using a fewer number of parameters. This is called you are “encoding” an image. The goal is to learn the new representation, so it is also applied as pre-training; then a traditional machine learning models could be applied depending on the tasks — it is a typical “two-stage” way for solving problems.[3] by Hinton is a great work on this problem, which shows the ability of “compressing” in neural networks, solving the bottleneck of massive information.

Continue reading “Deep Learning 15: Unsupervised learning in DL? Try Autoencoder!”

Posted in Algorithm, Machine Learning, Theory

ML Recap Slides sharing

Few friends with me did some works together since last October. All of us were looking for jobs in machine learning or deep learning. We all agreed that we need to review some interesting algorithms together. We had a draft of machine learning algorithms (part 1) during this new year:


Click here for a full version: mlrecap.

Also, we are working on part 2; there are some advanced algorithms which you can see from our outline. It is expected to finish around this June.



These slides are suitable for people to review old things. Some details are not included, so do not suggest readers learn some concepts from our slides. If you find mistakes, please leave comments. If you are interested in some particular algorithms, leave comments and we will consider updating our part 2 outline.

Posted in Python

TensorFlow 07: Word Embeddings (2) – Loading Pre-trained Vectors

A brief introduction on Word2vec please check this post. In this post, we try to load pre-trained Word2vec model, which is a huge file contains all the word vectors trained on huge corpora.


Download here .I downloaded the GloVe one, the vocabulary size is 4 million, dimension is 50. It is a smaller one trained on a “global” corpus (from wikipedia). There are models trained on Twitter as well in the page.


The model is formatted as (word vector) in each line, separated by a space. Below shows a screenshot: not only the words, but also some marks like comma are included in the model.image


There is an easy way for you to load the model by reading the vector file. Here I separate the words and vectors, because the words will be fed into vocabulary.

import numpy as np
filename = 'glove.6B.50d.txt'
def loadGloVe(filename):
    vocab = []
    embd = []
    file = open(filename,'r')
    for line in file.readlines():
        row = line.strip().split(' ')
    print('Loaded GloVe!')
    return vocab,embd
vocab,embd = loadGloVe(filename)
vocab_size = len(vocab)
embedding_dim = len(embd[0])
embedding = np.asarray(embd)

The vocab is a list of words or marks. The embedding is the huge 2-d array with all the word vectors. We initialize the embedding size to be the number of column of the embedding array.

Embedding Layer

After loading in the vectors, we need to use them to initialize W of the embedding layer in your network.

W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

Here W is first built as Variables, but initialized by constant zeros. Be careful with the shape: [vocab_size, embedding_dim], where we can know after loading the model. If trainable is set to be False, it would not be updated during training. Change to True for a trainable setup. Then an embedding_placeholder is set up to receive the real values (fed from the feed_dict in, and at last W is assigned.

After creating a session and initialize global variables, run the embedding_init operation by feeding in the 2-D array embedding., feed_dict={embedding_placeholder: embedding})


Suppose you have raw documents, the first thing you need to do is to build a vocabulary, which will map each word into an id. TensorFlow process the following code to lookup embeddings:

tf.nn.embedding_lookup(W, input_x)

where W is the huge embedding matrix, input_x is a tensor with ids. In another word, it will lookup embeddings by given Ids.

So we would choose the pre-trained model when we build the vocabulary: word-id maps.

from tensorflow.contrib import learn
#init vocab processor
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain =
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

First init the vocab processor by passing in a max_document_length, in default, shorter sentences would be padded by zeros. Then we fit the processor by the vocab list to build the word-id maps. Finally, use the processor to transform from real raw documents.

Now you are ready to train your own network with pre-trained word vectors!