Deep Learning 12: Energy-Based Learning (2)–Regularization & Loss Functions

First, let’s see what is regularization from a simple example. Then we will have a look at some different types of loss functions.

Regularization

Reviewed the definition of regularization today from Andrew’s lecture videos.
You will probably have noticed that we had a regularizer in some of the loss functions. So what’s that for? It’s mainly for preventing the overfitting, especially when we have some prior knowledge. Take the house prices for the example:

To address overfitting, we can either reduce the number of features by manually select features, for example. Or we could use regularization, and it performs well when we have a number of features and each of them contributes little. We keep them all, but reduce the magnitude or number of parameters.

So we want to find the relations between size and price, given the size features as x, and parameters as ${\theta}_{0},..,{\theta}_{4}$. Choose from three models shown in the screenshot. Obviously the first is under-fitting and the third one is over-fitting. The middle one takes three parameters into consideration and performs “just right”. So to some extend, more features, parameters are not always cool. We need to “panelize” some of them, to make them contribute less in the model.

The loss function we choose is $\underset { \theta }{ min } \frac { 1 }{ 2m } \sum _{ i=1 }^{ m }{ { (({ h }_{ \theta }({ x }^{ i }))-{ y }^{ i }) }^{ 2 } }$, we will minimize the energy and find the parameters $\theta$. The reason for the $\frac{1}{2m}$ is for the derivative in the square part. As written in the above screen shot, if we want to panelize ${\theta}_{3}$ and ${\theta}_{4}$, we could add two new terms:
$\underset { \theta }{ min } \frac { 1 }{ 2m } \sum _{ i=1 }^{ m }{ { (({ h }_{ \theta }({ x }^{ i }))-{ y }^{ i }) }^{ 2 } } +\quad 1000{ \quad \theta }_{ 3 }+1000{ \quad \theta }_{ 4}$
We times ${\theta}_{3}$ and ${\theta}_{4}$ with a large value. When we minimize the energy, we will get extremely small ${\theta}_{3}$ and ${\theta}_{4}$. Then they will contributes less compared with the rest of the parameters.
In general, we can penalize other parameters, next is a regularized negative log-likelihood loss function:

$J(\theta )=-[\frac { 1 }{ m } \sum _{ i=i }^{ m }{ { y }^{ (i) }log{ h }_{ \theta } } ({ x }^{ (i) })+(1-{ y }^{ (i) })log(1-{ h }_{ \theta }({ x }^{ (i) }))] + \frac { \lambda }{ 2m } \sum _{ i=1 }^{ m }{ { { \theta }_{ j } }^{ 2 } }$

So the $\lambda$ controls the balance of overfitting as well as underfitting at the same time. In other words, overfitting is a sign of our bias, that is the degree for us to emphasis on the training data. Underfitting is a sign of variance. A larger lambda gives more ability to avoid bias.

Loss Functions

Let’s forget about the regularization for a moment, and go through some standard loss functions.

Energy Loss

The energy loss is straightforward but only works when pushing down Energies.

It is popular among regression and neural network training.

Generalized Perceptron Loss

Similarly, we give the definition below:

We minus the minimum value (lower bound) of the energy loss, which makes the perception loss to be positive.

Generalized Margin Losses

This type contains a lot: hinge loss, LVQ2 loss, minimum classification error loss (MCE), square-square loss, square-exponential loss, etc. You have definitely seen some of them in general machine learning approaches . The main goal is to create an energy gap between the correct answer and the incorrect ones.

TBC..