# Deep Learning 09: Small Tricks(2)

Let’s try some ways to speedup our learning!

Cross Entropy
CE is a widely-used cost function. Always larger than 0, and approximate to 0 when the accuracy is huge -> Good feature of Cost function. CE also has a good ability to prevent learning rate decay. It is defined as below : (need some calculates here…)
Eventually, you will get:  Where a = σ(z), sigmoid function.
From the partial direvatives, we could notice that the learning speed is sensative to (σ(z)−y ), the error of inputs. When we are making a terrible mistake, the error is huge, then the learning speed is huge.

Softmax
Instead of applying a Sigmoid function after the activation function to get outputs, we use a Softmax here.
According to Micheal , the activation value at neuro j is defined as below: where the denominator gives the output sum up of all neurons. If one of the activation value a increases, the others would decrease. Because they sum up to 1.
Softmax could be treated as a probability distribution: 1) ranges in (0,1); 2) sum up to 1. In practice, we would like to map real values to probabilities, which help us to classify. Softmax functions convert a raw value into a posterior probability. And also that’s exactly the non-linear feature exist.
* Sigmoid: Comparison :

ReLU (red), softplus function,  has range [0,∞).

Sigmoid (green) Sigmoid function is useful for binary outputs (0 or 1), and it shows a high belief when the inputs are very big or small.
Softmax function, designed for multi-class outputs, performs well in our case.

Other activation functions like Tan-Sigmoid, Linear combination, Step Function etc, please find in .