Keep Updating

# Theory

# Slideshare (3): Unsupervised Transfer Learning Methods

A brief introduction on unsupervised transfer learning methods.

The presentation focused on unsupervised transfer learning methods, introducing feature-based and model-based strategies and few recent papers from ICML, ACL.

Unsupervised Transfer Learning

Comments are welcomed!

# TensorFlow 08: save and restore a subset of variables

TensorFlow provides save and restore functions for us to save and re-use the model parameters. If you have a trained VGG model, for example, it will be helpful for you to restore the first few layers then apply them in your own networks. This may raise a problem, how do we restore a subset of the parameters? You can always check the TF official document here. In this post, I will take some code from the document and add some practical points.

# To copy or not, that is the question: copying mechanism

In our daily life, we always repeating something mentioned before in our dialogue, like the name of people or organizations. “Hi, my name is Pikachu”, “Hi, Pikachu,…” There is a high probability that the word “Pikachu” will not be in the vocabulary extracted from the training data. So in the paper (Incorporating Copying Mechanism in Sequence-to-Sequence Learning), the authors proposed CopyNet which brings copying mechanism to seq2seq models with encoder and decoder structure. Read from my old post to learn the prerequisite knowledge.

# What matters: attention mechanism

People would be attracted only on a part of an image, say a person on a photo. Similarly, for a given sequence of words, we should pay attention to few keywords instead of treating each word equally. For example, “this is an apple”, when you read it loudly, I am sure you will stress “apple” more rather than “is” or “an” because you will naturally pay attention to the word “apple” (meaningful in this sentence). In seq2seq models (check this post if you forget), we are learning some weights corresponding to the words, where important words get a higher weight.

# What’s next: seq2seq models

The short blog contains my notes from Seq2seq Tutorial. Please leave comments if you are interested in this topic.

# Deep Learning 16: Understanding Capsule Nets

This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.

Paper link: Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

# Working with ROUGE 1.5.5 Evaluation Metric in Python

If you use ROUGE Evaluation metric for text summarization systems or machine translation systems, you must have noticed that there are many versions of them. So how to get it work with your own systems with Python? What packages are helpful? In this post, I will give some ideas based on engineering’s view (which means I am not going to introduce what is ROUGE). I also suffered from few issues and finally got them solved. My methods might not be the best ways but they worked.

# Reinforcement Learning (1): Q-Learning basics

Hi! In the following posts, I will introduce Q-Learning, the first part to learn if you want to pick up reinforcement learning. But before that, let us shed light on some fundamental concepts in reinforcement learning (RL).

## Kindergarten Example

Q-Learning works in this way: do an action and get reward and observation from the environment, as shown below. Image is taken from here :

*Berkeley’s CS 294: Deep Reinforcement Learning by John Schulman & Pieter Abbeel*

Imagine a baby boy in a kindergarten and how he performs on the first day. He does not know the kindergarten and knows nothing about how to behave. So he begins with random actions, say he hits the other kids, and when he performed this, he has no idea if it is right or not. Then the teacher becomes mad and gives him a punishment (a negative reward), then he knows hits others is not a good action; in the next time, the boy washes his lunch box, and the teacher rewards him with candy, then he knows this action is a good one. So in our kindergarten example, simply the **Agent** is the boy, who has no knowledge in the very beginning; Action is how he behaves; **Environment** contains all the objectives that he could perform on; **Reward** is something he gets from the environment (punishment or the candy), and **Observation** is what he could observe or the feedback from the environment.

*Candies lol*

## Exploitation vs. Exploration

To understand how Q-learning works, it is important to know exploration and exploitation.

Let’s say our baby boy from the kindergarten goes home one day, and his mom prepares five boxes (we call them A-E), where there are different numbers of candies inside the boxes and he doesn’t know which one has more candies. So if his goal is to get as much candy as possible, what he would do?

*Method 1:* Obviously he could choose an arbitrary box each time. However, it is not guaranteed that he could get as much as possible.

*Method 2:* Another method would choose a “possible” box. Each time, he can choose the box with the maximum expectation of the candies. So to get a distribution of the candies, say he could open box 1000 times uniformly then keep track of the number of candies.

*Method 3:* If he has some prior knowledge about these boxes, for example, his mom told him that box A has 10 (expectation), box B has 20 (expectation) and others unknown. So based on his goal, it seems box B is a good choice. But box C might have even more candies! We could either choose box B, or randomly choose a box from C-E.

We call these methods **policies** in Q-learning. In brief, we choose our action (choose a box) based on our current state in a policy. So we define latex]\pi[/latex] as a policy, which maps states to actions.

**Exploitation** is to choose an action based on information that we have known. *Method 2* is an exploitation-only policy. We say we know the expectations of all actions and then choose the best one.

**Exploration** is to explore the new actions that we have no information about. *Method 1* is an exploration-only policy. *Method 3* is a balanced version of these two. This provides us the idea of -greedy policy.

## Epsilon-greedy policy

Ranges from 0 to 1, is the probability of exploration, which is set to search for new things. Typically, we just random a state and return that action. In practice, we initialize it with a value between 0 and 1; then we usually let it shrink during episodes . An **Episode** is a whole game process from the start to the terminal state. Say in Flappy Bird, you start the game until the death state. Intuitively, imagine when an agent starts to play a new game, it has no “experience” about the game, so it is natural to go randomly; after some episodes, it is about to learn the skills and tricks, then it tends to use its own experience to play instead of randomly choose an action, because the more episodes it plays, the more confident it is about the experience (the more accurate the reward approximation is). There are various settings to , say , where refers to episode.

*Slide from Percy Liang*

## Q-table

**Q-learning** has a table called Q-table, which is a table of states, actions and approximated rewards. Let’s get back to the kindergarten example.

*Kindergarten states and actions*

We simplify the problem: states for the boy are washing lunch box (wash) and hitting others (hit), there are four actions marked as action A to D. Our Q table shows bellow, where we could observe that each row is a state and the corresponding reward values to different actions. Some state-action pairs are illegal and reach no values. The values indicate number of candies as rewards.

A | B | C | D | |
---|---|---|---|---|

Wash | 10 | — | — | -5 |

Hit | — | -10 | 5 | — |

*Q-table example*