Let’s try some ways to speedup our learning!
Cross Entropy
CE is a widely-used cost function. Always larger than 0, and approximate to 0 when the accuracy is huge -> Good feature of Cost function. CE also has a good ability to prevent learning rate decay. It is defined as below [1]:
(need some calculates here…)
Eventually, you will get:
Where a = σ(z), sigmoid function.
From the partial direvatives, we could notice that the learning speed is sensative to (σ(z)−y ), the error of inputs. When we are making a terrible mistake, the error is huge, then the learning speed is huge.
Softmax
Instead of applying a Sigmoid function after the activation function to get outputs, we use a Softmax here.
According to Micheal [1], the activation value at neuro j is defined as below:
where the denominator gives the output sum up of all neurons. If one of the activation value a increases, the others would decrease. Because they sum up to 1.
Softmax could be treated as a probability distribution: 1) ranges in (0,1); 2) sum up to 1. In practice, we would like to map real values to probabilities, which help us to classify. Softmax functions convert a raw value into a posterior probability. And also that’s exactly the non-linear feature exist.
* Sigmoid:
Comparison [3]:
ReLU (red), softplus function, has range [0,∞).
Sigmoid (green)
Sigmoid function is useful for binary outputs (0 or 1), and it shows a high belief when the inputs are very big or small.
Softmax function, designed for multi-class outputs, performs well in our case.
Other activation functions like Tan-Sigmoid, Linear combination, Step Function etc, please find in [2].
[1] http://neuralnetworksanddeeplearning.com/chap3.html
[2] https://en.wikibooks.org/wiki/Artificial_Neural_Networks/Activation_Functions
[3]
https://www.quora.com/What-is-special-about-rectifier-neural-units-used-in-NN-learning