As a part of our goals, it is absolutely important to look back and think about the loss functions we applied, for example, the cross entropy. There are other types, however, targeting on different practical problems and you will need to think about which one is suitable. Besides, the Energy-Based Models (EBMs) provides more. These are learning notes from A Tutorial on Energy-Based Learning.
The purpose of learning, is to find out the energy function, which assigns low energies to correct values and higher energies to incorrect ones. Then we need to minimize the whole energy in the system during training. This gives us a common inference/learning framework in any types of statistical models (probabilistic or especially, non-probabilistic).
Considering an image classification problem, suppose we have 6 classes: Human, Animal, Airplane, Car, Truck and “None of the above” . We will put images as the inputs (X). They are vector (say RGB channels, values range in [0,255]). As the output Y, we are expecting the model to provide probabilities for each class, like the results after a softmax in NNs. From the view of the energy function, we will need to measure our quality of the model. If we provide an animal image, we are expecting the energy of “animal” to be the lowest, while others are higher. Which means, small energy values a high compatibility between the values of input X and output Y.
We use as the energy function. is the final model produced result, chosen from a set .So given all the possible results, we need to find one who provides the smallest energy:.
A small set of is easy to find out the result by simply go through all possible one and find out the minimum energy value. A huge number of element in costs a lot during training, which means have too many classes (like face recognition, even though the result set is discrete and finite). Other cases including NLP tasks.
We will use inference procedure to deal with those cases. Such a special strategy will produce an approximate result, may or not may provide a global minimum value of . In the practice, we would use a non-convex function, in which the local optimizers are easier to be found. There are some cases that the energy function has equivalent values of minima. We will see different types of them.
From the view of application, we will treat them as four types:
1. Prediction, classification and decision-making: find out the best Y, given X! The model is going to tell you an answer the class (which class the image belongs to) or the decision to be made (like “steer left” in self-driving cars).
2. Ranking: Which one is more compatible with given X? Similar to the first one, but the model is able to provide multiple results that satisfy a given input. Like to recommend top-k items, not only one item is selected at each time.
3. Detection: is the current Y compatible with X? Like the face detection task. We will need thresholds as criteria, and they are unknown in general.
4. Conditional density estimation: find out . Usually as new inputs to other systems.
Combines results: Gibbs Distribution
Sometimes we need to combine results from different models. And we know the energy function values are measured in arbitrary units, and they are uncalibrated. We need to deal with the different scales. Thus, the most common method is to use a Gibbs distribution:
The denominator is the partition function, it is the sum up of all values, with the goal of normalization. Since we will transfer all energies into values between 0 and 1, and sum up to 1 (features of a probability distribution).
When we say “train” a model, we mean we design a model first and learn parameters W. In that way, we have:, which is a set of parameterized energy functions.
A set of training samples is given as .
The main job is to find out the best energy function, so we need an approach to measure the quality ( loss functional). So with a loss function , we could find out the best W, who produces the lowest energy given energy function E and training set S. We simply define it to be:
where P is the total number of training samples. That means the first term on the right, gives an average energy (the per-sample loss). The second term is a regularizer, contains our prior knowledge about which energy functions in the set are preferable.
Before introducing some famous loss functions, we first keep in mind some Ys:
Suppose we had trained a model for image classification for 4 classes [flower, dog, people, fish], and now you give a new input image of a dog, then you will get a tensor with probabilities: [flower=0.01,dog=0.70, people=0.20, fish=0.09]. Then obviously, the correct answer is dog. The is dog (the highest probability). The is people (the lowest energy among incorrect answers).
We want to “push down” the correct energies and “pull up” on the incorrect ones. Here is a short summary :