This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.
Paper link: Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017
Capsule
The existing CNNs are good. But people found that more improvements are possible. Let’s take a look at the following example:
In a typical CNN model, we might have a large number of the first type of pictures in training dataset. Usually, it is a little bit hard for the model to recognize the second type of pictures. Well, sometimes it is hard for a human, too. In CNNs, a neuron is set to detect a certain pattern, like an edge or the combinations of edges.
In Capsule Net, we let each capsule to detect a certain type of patterns, which means it accepts any rotations. If we say, the “cat” is a pattern we want to find. We know the two cat images are the same pattern, and one capsule is needed. However, in CNN, you will need two neurons to find them. If you meet something like the following, one capsule is enough….
For each output vector v of the capsule:
1. Each dimension of v represents the characteristics of patterns.
2. The norm of v represents the existence.
In another word, the vector stands for how confident the pattern exists, and what the input picture looks like? (you can reconstruct the input image using the vector)
How capsule works
Like the neurons, we might have multiple levels of capsules. Then outputs of the last level of capsules could be the input to the current level. Let us assume now we have two output vectors from the last level v1 and v2, then we do some computation in the capsule, and generate our output v. So the magic happens in the blue area in the screenshot:
It is very straightforward:
where we have , we can think that they are weights.
Then there is a “Squashing” operation:
When the norm of s is small, then v trends to be 0; when the norm is huge, then v trends to be close to 1 — sounds just like probability, that is how the norm reflects how confident we are about if a pattern exists or not.
During the whole process, the weight matrices W1 and W2 are trained. But and
are called coupling coefficients and are determined by dynamic routing during the testing stage.
How dynamic routing works
Now let us figure out how to get the value of c. In the paper, there is the Routing Algorithm.
Routing Algorithm from paper
Prof Hung-Yi provided a “simplified version”:
A simplified Routing Algorithm
After we muliply the outputs with weight matrics from last level of Capsules, we could get a couple of new vector
. We initialize
at iteration 0 with
and set them to be zero. Here, we will do
iterations, and we pre-define it as any other hyperparameter. The superscipt
means the current value of the iteration. Because we want all the
to be in the range of 0 to 1, here comes the first line in the for-loop. Then we calculate
by the new
, and naturally calculate the new
using the Squashing operation. Finally,
is the output of the Squashing, and we will update the
. To understand the last line:
Let us assume that in iteration , the
and
are closer, while
is far from them. After calculation, we find that
is somewhere closer to
and
. Which means, we should condsider more to use
and
to represent
, considerless on
, because it is far. So, we should give higher value to the weights of
and
, and the weights are exactly
and
and they should be improved. By the operation of the last line in the algorithm, we updated all value of
. In this case, although we “add” something to
as well, after a softmax, (remember
and
are improved, and all
sum up to 1),
is reduced from the previous iteration.
There are people saying that Hinton attempts to drop BP Algorithm, and dynamic routing is somehow trying to do that. For the choice of , the paper used 3, which is enough for the model to converge. Is there any theory supporting this idea? We need to figure out.
Compare with NNs, CNNs and RNNs
From Scalar to Vector
A special interesting point for CapsNet is “Vector in vector out”. Some people say that capsule is a “series” neurons. But we can compare it with traditional neuron as well. The next screenshot is from https://github.com/naturomics/CapsNet-Tensorflow, you can also find TensorFlow code for CapsNets:
Let’s recall pooling operations in CNNs, we have scalars as output (picture from Stanford CS231n lecture):
In CapsNets, we are keeping a vector, whose norm represents the existence, while the elements inside the vector represent the characteristics of patterns. Compared with CNNs, CapsNets provide richer information.
From Vector to Matrix
It is possible to expand the ‘Vector in Vector out’ idea to matrix level. There is a new submission for ICLR 2018:
Matrix capsules with EM routing
It came out to be another paper by Hinton.
Dynamic Routing is the highlight of the article. Think about how we find values of c, we set T = 3. We also need backpropagation to learn their values. So in Prof Hung-Yi Lee‘s lecture, he mentioned that this is somehow similar with RNN, we are feeding one value to the next timestamp.
Pros and Cons
Explainable AI. A big difference is we moved from a scalar to a vector with meaningful features. Also, in the experiments, we could see some elements are controlling something that human would understand. I think we are one step closer to the “Explainable AI” now.
Robustness to Affine Transformations. Stronger generalization ability. Possible to be applied to Transfer Learning.
Dynamic Routing, as stated in the paper title is the main new idea. However, people may blame the efficiency. Well, it has not been tested on a large scale setting, say the ImageNet dataset. It would be very impressive to see if it scales and performs well.
References
Click to access 1710.09829.pdf
https://openreview.net/forum?id=HJWLfGWRb
http://cs231n.github.io/convolutional-networks/#pool
http://speech.ee.ntu.edu.tw/~tlkagk/index.html
https://zhuanlan.zhihu.com/p/33244896?group_id=939479988387405824
http://blog.csdn.net/yangdelong/article/details/78443872