Learning notes for Lecture 7 Modeling sequences: A brief overview. by Geoffrey Hinton 
Targets of Sequence modeling:
– Turn an input seq into an output seq that lives in a different domain (voice recognition)
– Predict the next term in the input sequence (1 step advance, cv pixels)
– Blurs the supervised and unsupervised learning
1. Autoregressive models
Using a number of previous nodes, to predict the next one. (The assumption is that all what had happened in the past could effect the future.) —linear
2. Feed-forward neural nets
Using one or more hidden units. –non-linear
Beyond memoryless models
-Stores info in hidden states for a long time
-Real word is noisy: infer a probability distribution over the space of hidden state vectors! (Probs help a lot.)
Two types of hidden units are traceable:
1. Linear Dynamical Systems
-Driving inputs directly determine the hidden units, and hidden units effect outputs.
-To infer the hidden state could help us to predict the output: a linearly transformed Gaussian is a Gaussian. Can be computed using “Kalman filtering” efficiently, a recursive way for updating representation of hidden states given new observations.
-Hidden state has linear dynamics with Gaussian noise (applying a linear model with Gaussian noise) —>Do a good job.
“So, given the observations of outputs, we can not really get the exact hidden state it was in, but we could estimate a Gaussian distribution over the possible hidden states, by assuming the model is correct based on the observing reality.”
Transactions between states are stochastic and controlled by a transition matrix. Output model is also stochastic-> not sure which state produced a given output. (state is “hidden”)But it’s easy to represent a probability distribution across N states with N numbers (the hidden units).
To predict the next output, we need to infer the prob. dist. over hidden states. (HMMs have efficient algos for inference and learning, speech recognition for example)
(Well personally, I like HMMs a lot. It shows a higher accuracy when compared with others, but depends on the applications. )
A limitation of it, memory is short, and states are huge!
Then we have the RNNs
Combine two properties:
– Distributed hidden state, storing info about the past!
– Non-linear dynamics, updating hidden states in complicated ways.
– Point attractor -> easy for memory retrieving
– Chaotically behaves -> bad for info processing
– Computation is expensive (Tony Robinson’s speech recognizer.)
– (Interesting) :
An example for binary addition
By using feedforward only, there are few restrictions:
– must decide the max number of digits
– using different weights, doesn’t generalize to the end of the long number.
How it determines the next state?
For instance, given the current state is “no carry print 0”, if the next column is “01”, then it turns out to be “no carry print 1”.
So basically we move from right to left:
Two input units and one output unit.
Two time steps after for the output (one time step for inputs to hidden units, another one for hidden units to output.)
It is able to learn the patterns, at one time step to vote for the hidden activity pattern at the next time step.
Powerful: with N hidden neurons, 2^N possible binary activity vectors, but only N^2 weights (fully interconnected).
Challenges for Training a RNN
In the forward pass, logistic functions provide outputs range in 0 and 1, preventing the activity vectors from exploding —> Non-linear
In the backward pass, if you double the error derivatives at the final layer, all the error derivatives will double —> Linear!
When we bp:
– small weights -> gradients (back to many time steps) shrink exponentially [vanish]
– big weights ->gradients grow exponentially [explode]
(In a traditional feed-forward NN, only a few hidden layers, we can easy cope with them.)
However, to train a long sequences, the gradients can easily explode or vanish.(Recent work shows we can avoid these by initializing weighs carefully)
Four effective ways to learn an RNN
-LSTM (Next section)
-Hessian Free Optimization : using a fancy optimizer that can detect directions with a tiny gradient but even smaller curvature.
-Echo State Networks: initialize carefully and hieraticially (from input to hidden, hidden to output).
-Good Initialization of Momentum: how to init? we learn all of the connections using momentum. For momentum, go here.
Remember things for a very long time! LSTM Cell: an analog using a circuit.
All gates controlled by 0 and 1.
We first set the keep gate to be 0, and write gate to be 1, we will write a value 1.7 to the cell. Then keep gate sets to be 1, so we will keep the value. Next also keep gate is 1, and we set read gate to be 1, thus reading the value from the cell. Last the keep gate to be 0, we flush out the value eventually.
A natural task for an RNN: Reading cursive handwriting 
Inputs is a sequence of (x,y,p), (x,y) the locations of the pen, p is for whether the pen is up or down. The output is a sequence of characters.
They used a sequence of small images as input instead.
A live demo (interesting and relevant) can be found here . It generates the typed strings into a certain handwritten style by using LSTM.