First, let’s see what is regularization from a simple example. Then we will have a look at some different types of loss functions.
Reviewed the definition of regularization today from Andrew’s lecture videos.
You will probably have noticed that we had a regularizer in some of the loss functions. So what’s that for? It’s mainly for preventing the overfitting, especially when we have some prior knowledge. Take the house prices for the example:
To address overfitting, we can either reduce the number of features by manually select features, for example. Or we could use regularization, and it performs well when we have a number of features and each of them contributes little. We keep them all, but reduce the magnitude or number of parameters.
So we want to find the relations between size and price, given the size features as x, and parameters as . Choose from three models shown in the screenshot. Obviously the first is under-fitting and the third one is over-fitting. The middle one takes three parameters into consideration and performs “just right”. So to some extend, more features, parameters are not always cool. We need to “panelize” some of them, to make them contribute less in the model.
The loss function we choose is , we will minimize the energy and find the parameters . The reason for the is for the derivative in the square part. As written in the above screen shot, if we want to panelize and , we could add two new terms:
We times and with a large value. When we minimize the energy, we will get extremely small and . Then they will contributes less compared with the rest of the parameters.
In general, we can penalize other parameters, next is a regularized negative log-likelihood loss function:
So the controls the balance of overfitting as well as underfitting at the same time. In other words, overfitting is a sign of our bias, that is the degree for us to emphasis on the training data. Underfitting is a sign of variance. A larger lambda gives more ability to avoid bias.
Let’s forget about the regularization for a moment, and go through some standard loss functions.
Generalized Perceptron Loss
Generalized Margin Losses
This type contains a lot: hinge loss, LVQ2 loss, minimum classification error loss (MCE), square-square loss, square-exponential loss, etc. You have definitely seen some of them in general machine learning approaches . The main goal is to create an energy gap between the correct answer and the incorrect ones.