Let’s try some ways to speedup our learning!
CE is a widely-used cost function. Always larger than 0, and approximate to 0 when the accuracy is huge -> Good feature of Cost function. CE also has a good ability to prevent learning rate decay. It is defined as below :
(need some calculates here…)
Eventually, you will get:
Where a = σ(z), sigmoid function.
From the partial direvatives, we could notice that the learning speed is sensative to (σ(z)−y ), the error of inputs. When we are making a terrible mistake, the error is huge, then the learning speed is huge.
Instead of applying a Sigmoid function after the activation function to get outputs, we use a Softmax here.
According to Micheal , the activation value at neuro j is defined as below:
where the denominator gives the output sum up of all neurons. If one of the activation value a increases, the others would decrease. Because they sum up to 1.
Softmax could be treated as a probability distribution: 1) ranges in (0,1); 2) sum up to 1. In practice, we would like to map real values to probabilities, which help us to classify. Softmax functions convert a raw value into a posterior probability. And also that’s exactly the non-linear feature exist.
ReLU (red), softplus function, has range
Sigmoid function is useful for binary outputs (0 or 1), and it shows a high belief when the inputs are very big or small.
Softmax function, designed for multi-class outputs, performs well in our case.
Other activation functions like Tan-Sigmoid, Linear combination, Step Function etc, please find in .