I was optimizing my code for the ConvNet these days. Not all the methods are doing good, because I do not have very strong knowledge on the hyper-parameters. Anyway, just write something I’ve learnt here.
Dropout
I made a mistake on the previous version of my ConvNet, I put the dropout in the middle of the Conv layers. Usually, to prevent overfitting, we add dropout in the middle of fully connected layers.
In tensorflow, with a probability keep_prob, scaling up all the connection, thus keeping the expected sum unchanged. Training using 0.5. It is said there are different “default” probs, in CV maybe 0.5 is good, in NLP, 0.2 is nice.
Recently went to a meeting given by a NLP-specific group. They used a CNN first, then a dropout layer, following by an LSTM. The inputs are sentences. However, when eliminating dropout layer, accuracy gives an improvement. Their explanation is that dropout layer would give up some REALLY IMPORTANT features (or connections) here, which will lead to a bad performance. So I was thinking if the initialization of the probabilities will be a reasonable one.
When testing, usually keep dropout to be 1.0.
Learning Rate
Starting from a small learning rate will gives a slower convergence but a better model. In contrast, a large learning rate has a faster convergence. To give a large number of epochs is not always good…but with a small learning rate, be patient to wait until the loss is small enough.
These are parts of my results:
(Learning rate 0.1, constant)
(Init learning rate 0.001, exponential learning rate decay)
When the loss is 3.08.., drops slowly (compared to below) to 2.3, 2.24…
(Init learning rate 0.1, exponential learning rate decay)
What we want is,when the loss is huge, learning rate is relative big, then it could learn faster. When the loss is small, slow down the speed of learning. The Exponential decay could achieve it.