Posted in Algorithm, Deep Learning, Machine Learning, Theory

Reinforcement Learning (1): Q-Learning basics

Hi! In the following posts, I will introduce Q-Learning, the first part to learn if you want to pick up reinforcement learning. But before that, let us shed light on some fundamental concepts in reinforcement learning (RL).

Kindergarten Example

Q-Learning works in this way: do an action and get reward and observation from the environment, as shown below. Image is taken from here :


Berkeley’s CS 294: Deep Reinforcement Learning by John Schulman & Pieter Abbeel

Imagine a baby boy in a kindergarten and how he performs on the first day. He does not know the kindergarten and knows nothing about how to behave. So he begins with random actions, say he hits the other kids, and when he performed this, he has no idea if it is right or not. Then the teacher becomes mad and gives him a punishment (a negative reward), then he knows hits others is not a good action; in the next time, the boy washes his lunch box, and the teacher rewards him with candy, then he knows this action is a good one. So in our kindergarten example, simply the Agent is the boy, who has no knowledge in the very beginning; Action is how he behaves; Environment contains all the objectives that he could perform on; Reward is something he gets from the environment (punishment or the candy), and Observation is what he could observe or the feedback from the environment.


Candies lol

Exploitation vs. Exploration

To understand how Q-learning works, it is important to know exploration and exploitation.
Let’s say our baby boy from the kindergarten goes home one day, and his mom prepares five boxes (we call them A-E), where there are different numbers of candies inside the boxes and he doesn’t know which one has more candies. So if his goal is to get as much candy as possible, what he would do?

Method 1: Obviously he could choose an arbitrary box each time. However, it is not guaranteed that he could get as much as possible.

Method 2: Another method would choose a “possible” box. Each time, he can choose the box with the maximum expectation of the candies. So to get a distribution of the candies, say he could open box 1000 times uniformly then keep track of the number of candies.

Method 3: If he has some prior knowledge about these boxes, for example, his mom told him that box A has 10 (expectation), box B has 20 (expectation) and others unknown. So based on his goal, it seems box B is a good choice. But box C might have even more candies! We could either choose box B, or randomly choose a box from C-E.

We call these methods policies in Q-learning. In brief, we choose our action (choose a box) based on our current state in a policy. So we define latex]\pi[/latex] as a policy, which maps states to actions.

Exploitation is to choose an action based on information that we have known. Method 2 is an exploitation-only policy. We say we know the expectations of all actions and then choose the best one.
Exploration is to explore the new actions that we have no information about. Method 1 is an exploration-only policy. Method 3 is a balanced version of these two. This provides us the idea of \epsilon-greedy policy.

Epsilon-greedy policy

Ranges from 0 to 1,\epsilon is the probability of exploration, which is set to search for new things. Typically, we just random a state and return that action. In practice, we initialize it with a value between 0 and 1; then we usually let it shrink during episodes t . An Episode is a whole game process from the start to the terminal state. Say in Flappy Bird, you start the game until the death state. Intuitively, imagine when an agent starts to play a new game, it has no “experience” about the game, so it is natural to go randomly; after some episodes, it is about to learn the skills and tricks, then it tends to use its own experience to play instead of randomly choose an action, because the more episodes it plays, the more confident it is about the experience (the more accurate the reward approximation is). There are various settings to \epsilon, say \epsilon=\frac{1}{ \sqrt { t } }, where t refers to episode.


Slide from Percy Liang

Q-table

Q-learning has a table called Q-table, which is a table of states, actions and approximated rewards. Let’s get back to the kindergarten example.


Kindergarten states and actions

We simplify the problem: states for the boy are washing lunch box (wash) and hitting others (hit), there are four actions marked as action A to D. Our Q table shows bellow, where we could observe that each row is a state and the corresponding reward values to different actions. Some state-action pairs are illegal and reach no values. The values indicate number of candies as rewards.

A B C D
Wash 10 -5
Hit -10 5

Q-table example

Advertisements
Posted in Algorithm, Deep Learning, Theory

Deep Learning 15: Unsupervised learning in DL? Try Autoencoder!

There are unsupervised learning models in multiple-level learning methods, for example, RBMs and Autoencoder. In brief, Autoencoder is trying to find a way to reconstruct the original inputs — another way to represent itself. In addition, it is useful for dimensionality reduction. For example, say there is a 32 * 32 -sized image, it is possible to represent it by using a fewer number of parameters. This is called you are “encoding” an image. The goal is to learn the new representation, so it is also applied as pre-training; then a traditional machine learning models could be applied depending on the tasks — it is a typical “two-stage” way for solving problems.[3] by Hinton is a great work on this problem, which shows the ability of “compressing” in neural networks, solving the bottleneck of massive information.

Continue reading “Deep Learning 15: Unsupervised learning in DL? Try Autoencoder!”

Posted in Algorithm, Machine Learning, Theory

ML Recap Slides sharing

Few friends with me did some works together since last October. All of us were looking for jobs in machine learning or deep learning. We all agreed that we need to review some interesting algorithms together. We had a draft of machine learning algorithms (part 1) during this new year:

17269603_979221052213463_1450538885_o
17273458_979219372213631_1993658890_o

Click here for a full version: mlrecap.

Also, we are working on part 2; there are some advanced algorithms which you can see from our outline. It is expected to finish around this June.

16252540_953507928118109_5048700468146419459_o

16387102_953509771451258_8312895342483013558_n

These slides are suitable for people to review old things. Some details are not included, so do not suggest readers learn some concepts from our slides. If you find mistakes, please leave comments. If you are interested in some particular algorithms, leave comments and we will consider updating our part 2 outline.

Posted in Algorithm, Theory

Deep Learning 14 : Optimization, an Overview

As a cardinal part of deep learning or machine learning, optimization has long been a mathematical problem for researchers. Why we need optimization? Remember you have a loss function for linear regression, and then you would need to find the optimum of the function to minimize the square error, for example. You might also very familiar about gradient descent, especially when you start learning neural networks. In this blog, we will cover some method in optimization, and in what conditions should we apply them.

We could split the optimization problem into two groups: constraint and unconstraint. Unconstraint means given a function, we try to minimize or maximize it without any other conditions. Constraint means we would fit some conditions while optimizing a function.

Unconstraint Optimization

Definition: \min_{x\in R^n}{f(x)}, where x^* is the optimum.
There are some stochastic and iterative methods for this kind of unconstraint problem. For example, Gradient Descent, Newton Method and Quasi Newton Method (an optimized Newton Method). Compared with GD, the other two would converge faster. In the graph, we could notice that point: the red arrow denotes GD and the green is Newton Method.

pastedimage0

The original “Newton Method” (also known as the Newton–Raphson method) is trying to find the roots of a function, f(x)=0. It still needs some iterations. If you look at the fantastic gif below, which I borrowed from wiki, you would have a quick point of view. So the job is to find the x, where we want f(x)=0. First, we select a point randomly x1, and get f(x1), then we get the tangent at that point (x1,f(x1)). The next point to choose is exactly the x where the tangent and x-axis has the intersection, so x_{2}=x_{1}-{\frac {f(x_{1})}{f'(x_{1})}}. Then we do it many times until a sufficiently accurate value is reached.

newtoniteration_ani

Now you understand that Newton Method is perfect for solving something like f(x)=0 problem. Then how it is going to help the optimization. You need some math knowledge then. If the function f is a twice-differentiable function f, and you want to find the max or min of it. Then the problem is to find the roots of the derivative f’ (solutions to f ‘(x)=0), also known as the stationary points of f.

Constraint Optimization: Equality Optimization

Definition: \min{f(x, y)}, subject to g(x,y)=c.
In the equality optimization problem, the equality constraints could be more than one. Here we would talk about one constraint for convenience. The method is called Lagrange Multiplier. When you have constraints, a natural way is try to eliminate them. So it goes this way. We brings in a new variable \lambda and create a Lagrange function:

0

Then what we need to do is to calculate the following equations:

1

When we successfully get the right \lambda, remember it can not be zero, we could bring it into the Lagrange function. When the Lagrange function has the optimum, the f(x,y) has, too. Because you have g(x,y)-c is always 0.

Constraint Optimization: Inequality Optimization

Definition:

image
Following the same idea, the Lagrange Multiplier could be extended. So a Generalized Lagrange function is written in this way:

image0

So all \alpha and \beta are Lagrange multipliers and the \alpha >=0.

 

Posted in Algorithm, Deep Learning, Theory

Deep Learning 13: Understanding Generative Adversarial Network

Proposed in 2014, the interesting Generative Adversarial Network (GAN) has now many variants. You might not surprised that the relevant papers are more like statistics research. When a model was proposed, the evaluations would be based on some fundamental probability distributions, where generalized applications start. Continue reading “Deep Learning 13: Understanding Generative Adversarial Network”

Posted in Algorithm, Theory

Two sample problem(2): kernel function, feature space and reproducing kernel map

Find Two sample problem (1) here.

We will take a look at RHKS (Reproducing Hilbert Kernel Space ) in this post. You might think of it a very statistical term but it is amazing because of various applications. You will need to refresh your mind for some linear algebra computations. We start with some basic terms and definitions. Continue reading “Two sample problem(2): kernel function, feature space and reproducing kernel map”