# Slideshare (2): Machine Learning Recap Slides sharing

Few friends with me did some works together since last October. All of us were looking for jobs in machine learning or deep learning. We all agreed that we need to review some interesting algorithms together. We had a draft of machine learning algorithms (part 1) during this new year:  Also, we are working on part 2; there are some advanced algorithms which you can see from our outline. It is expected to finish around this June.  These slides are suitable for people to review old things. Some details are not included, so do not suggest readers learn some concepts from our slides. If you find mistakes, please leave comments. If you are interested in some particular algorithms, leave comments and we will consider updating our part 2 outline.

# Deep Learning 14 : Optimization, an Overview

As a cardinal part of deep learning or machine learning, optimization has long been a mathematical problem for researchers. Why we need optimization? Remember you have a loss function for linear regression, and then you would need to find the optimum of the function to minimize the square error, for example. You might also very familiar about gradient descent, especially when you start learning neural networks. In this blog, we will cover some method in optimization, and in what conditions should we apply them.

We could split the optimization problem into two groups: constraint and unconstraint. Unconstraint means given a function, we try to minimize or maximize it without any other conditions. Constraint means we would fit some conditions while optimizing a function.

### Unconstraint Optimization

Definition: $\min_{x\in R^n}{f(x)}$, where $x^*$ is the optimum.
There are some stochastic and iterative methods for this kind of unconstraint problem. For example, Gradient Descent, Newton Method and Quasi Newton Method (an optimized Newton Method). Compared with GD, the other two would converge faster. In the graph, we could notice that point: the red arrow denotes GD and the green is Newton Method. The original “Newton Method” (also known as the Newton–Raphson method) is trying to find the roots of a function, f(x)=0. It still needs some iterations. If you look at the fantastic gif below, which I borrowed from wiki, you would have a quick point of view. So the job is to find the x, where we want f(x)=0. First, we select a point randomly x1, and get f(x1), then we get the tangent at that point (x1,f(x1)). The next point to choose is exactly the x where the tangent and x-axis has the intersection, so $x_{2}=x_{1}-{\frac {f(x_{1})}{f'(x_{1})}}$. Then we do it many times until a sufficiently accurate value is reached. Now you understand that Newton Method is perfect for solving something like f(x)=0 problem. Then how it is going to help the optimization. You need some math knowledge then. If the function f is a twice-differentiable function f, and you want to find the max or min of it. Then the problem is to find the roots of the derivative f’ (solutions to f ‘(x)=0), also known as the stationary points of f.

### Constraint Optimization: Equality Optimization

Definition: $\min{f(x, y)}$, subject to $g(x,y)=c$.
In the equality optimization problem, the equality constraints could be more than one. Here we would talk about one constraint for convenience. The method is called Lagrange Multiplier. When you have constraints, a natural way is try to eliminate them. So it goes this way. We brings in a new variable $\lambda$ and create a Lagrange function: Then what we need to do is to calculate the following equations: When we successfully get the right $\lambda$, remember it can not be zero, we could bring it into the Lagrange function. When the Lagrange function has the optimum, the f(x,y) has, too. Because you have $g(x,y)-c$ is always 0.

### Constraint Optimization: Inequality Optimization

Definition: Following the same idea, the Lagrange Multiplier could be extended. So a Generalized Lagrange function is written in this way: So all $\alpha$ and $\beta$ are Lagrange multipliers and the $\alpha$ >=0.

# Deep Learning 13: Understanding Generative Adversarial Network

Proposed in 2014, the interesting Generative Adversarial Network (GAN) has now many variants. You might not surprised that the relevant papers are more like statistics research. When a model was proposed, the evaluations would be based on some fundamental probability distributions, where generalized applications start. Continue reading

# Is your model good enough? Evaluation metrics in Classification and Regression

Was working on my research with sklearn, but realized that choosing the right evaluation metrics was always a problem to me. If someone asks me ,”does your model performs well?” The first thing in my mind is “accuracy”. Besides the accuracy, there are a lot, depending on your own problem. Continue reading

# Two sample problem(2): kernel function, feature space and reproducing kernel map

Find Two sample problem (1) here.

We will take a look at RHKS (Reproducing Hilbert Kernel Space ) in this post. You might think of it a very statistical term but it is amazing because of various applications. You will need to refresh your mind for some linear algebra computations. We start with some basic terms and definitions. Continue reading

# Understanding SVM(2)

A brief Introduction here. (Wrote a blog about it last year, but do not think it is detailed.)

This blog is learning notes from this video (English slides but Chinese speaker). First a quick introduction on SVM, then the magic of how to solve max/min values. Also, you could find Kernel SVM. Continue reading

# Two sample problem(1): Parzen Windows, Maximum Mean Discrepancy

There is a nice tutorial from Alex. I expanded the math part to show you more details. I used latex then posted screenshots. Continue reading

# NLP 05: From Word2vec to Doc2vec: a simple example with Gensim

#### Introduction

First introduced by Mikolov 1 in 2013, the word2vec is to learn distributed representations (word embeddings) when applying neural network. It is based on the distributed hypothesis that words occur in similar contexts (neighboring words) tend to have similar meanings. Two models here: cbow ( continuous bag of words) where we use a bag of words to predict a target word and skip-gram where we use one word to predict its neighbors. For more, although not highly recommended, have a look at TensorFlow tutorial here. Continue reading

# Deep Learning 12: Energy-Based Learning (2)–Regularization & Loss Functions

First, let’s see what is regularization from a simple example. Then we will have a look at some different types of loss functions.

## Regularization

Reviewed the definition of regularization today from Andrew’s lecture videos. Continue reading