This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.
The existing CNNs are good. But people found that more improvements are possible. Let’s take a look at the following example:
In a typical CNN model, we might have a large number of the first type of pictures in training dataset. Usually, it is a little bit hard for the model to recognize the second type of pictures. Well, sometimes it is hard for a human, too. In CNNs, a neuron is set to detect a certain pattern, like an edge or the combinations of edges.
In Capsule Net, we let each capsule to detect a certain type of patterns, which means it accepts any rotations. If we say, the “cat” is a pattern we want to find. We know the two cat images are the same pattern, and one capsule is needed. However, in CNN, you will need two neurons to find them. If you meet something like the following, one capsule is enough….
For each output vector v of the capsule:
1. Each dimension of v represents the characteristics of patterns.
2. The norm of v represents the existence.
In another word, the vector stands for how confident the pattern exists, and what the input picture looks like? (you can reconstruct the input image using the vector)
How capsule works
Like the neurons, we might have multiple levels of capsules. Then outputs of the last level of capsules could be the input to the current level. Let us assume now we have two output vectors from the last level v1 and v2, then we do some computation in the capsule, and generate our output v. So the magic happens in the blue area in the screenshot:
It is very straightforward:
where we have , we can think that they are weights.
Then there is a “Squashing” operation:
When the norm of s is small, then v trends to be 0; when the norm is huge, then v trends to be close to 1 — sounds just like probability, that is how the norm reflects how confident we are about if a pattern exists or not.
During the whole process, the weight matrices W1 and W2 are trained. But and are called coupling coefficients and are determined by dynamic routing during the testing stage.
How dynamic routing works
Now let us figure out how to get the value of c. In the paper, there is the Routing Algorithm.
Routing Algorithm from paper
Prof Hung-Yi provided a “simplified version”:
A simplified Routing Algorithm
After we muliply the outputs with weight matrics from last level of Capsules, we could get a couple of new vector . We initialize at iteration 0 with and set them to be zero. Here, we will do iterations, and we pre-define it as any other hyperparameter. The superscipt means the current value of the iteration. Because we want all the to be in the range of 0 to 1, here comes the first line in the for-loop. Then we calculate by the new , and naturally calculate the new using the Squashing operation. Finally, is the output of the Squashing, and we will update the . To understand the last line:
Let us assume that in iteration , the and are closer, while is far from them. After calculation, we find that is somewhere closer to and . Which means, we should condsider more to use and to represent , considerless on , because it is far. So, we should give higher value to the weights of and , and the weights are exactly and and they should be improved. By the operation of the last line in the algorithm, we updated all value of . In this case, although we “add” something to as well, after a softmax, (remember and are improved, and all sum up to 1), is reduced from the previous iteration.
There are people saying that Hinton attempts to drop BP Algorithm, and dynamic routing is somehow trying to do that. For the choice of , the paper used 3, which is enough for the model to converge. Is there any theory supporting this idea? We need to figure out.
Compare with NNs, CNNs and RNNs
From Scalar to Vector
A special interesting point for CapsNet is “Vector in vector out”. Some people say that capsule is a “series” neurons. But we can compare it with traditional neuron as well. The next screenshot is from https://github.com/naturomics/CapsNet-Tensorflow, you can also find TensorFlow code for CapsNets:
Let’s recall pooling operations in CNNs, we have scalars as output (picture from Stanford CS231n lecture):
In CapsNets, we are keeping a vector, whose norm represents the existence, while the elements inside the vector represent the characteristics of patterns. Compared with CNNs, CapsNets provide richer information.
It came out to be another paper by Hinton.
Dynamic Routing is the highlight of the article. Think about how we find values of c, we set T = 3. We also need backpropagation to learn their values. So in Prof Hung-Yi Lee‘s lecture, he mentioned that this is somehow similar with RNN, we are feeding one value to the next timestamp.
Pros and Cons
Explainable AI. A big difference is we moved from a scalar to a vector with meaningful features. Also, in the experiments, we could see some elements are controlling something that human would understand. I think we are one step closer to the “Explainable AI” now.
Robustness to Affine Transformations. Stronger generalization ability. Possible to be applied to Transfer Learning.
Dynamic Routing, as stated in the paper title is the main new idea. However, people may blame the efficiency. Well, it has not been tested on a large scale setting, say the ImageNet dataset. It would be very impressive to see if it scales and performs well.