Posted in Deep Learning, Theory

Deep Learning 16: Understanding Capsule Nets

This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.

Paper link: Sara Sabour, Nicholas Frosst, Geoffrey E. Hinton, “Dynamic Routing Between Capsules”, NIPS, 2017

Capsule

The existing CNNs are good. But people found that more improvements are possible. Let’s take a look at the following example:

 


In a typical CNN model, we might have a large number of the first type of pictures in training dataset. Usually, it is a little bit hard for the model to recognize the second type of pictures. Well, sometimes it is hard for a human, too. In CNNs, a neuron is set to detect a certain pattern, like an edge or the combinations of edges.

In Capsule Net, we let each capsule to detect a certain type of patterns, which means it accepts any rotations. If we say, the “cat” is a pattern we want to find. We know the two cat images are the same pattern, and one capsule is needed. However, in CNN, you will need two neurons to find them. If you meet something like the following, one capsule is enough….

21742485_1566984729991019_2964233363654705152_n

For each output vector v of the capsule:

27292512_1176795339122699_415695806_n

1. Each dimension of v represents the characteristics of patterns.
2. The norm of v represents the existence.

In another word, the vector stands for how confident the pattern exists, and what the input picture looks like? (you can reconstruct the input image using the vector)

How capsule works

Like the neurons, we might have multiple levels of capsules. Then outputs of the last level of capsules could be the input to the current level. Let us assume now we have two output vectors from the last level v1 and v2, then we do some computation in the capsule, and generate our output v. So the magic happens in the blue area in the screenshot:

Capsule

It is very straightforward:
u^1=W^1v^1,u^2=W^2v^2\\s=c_1u^1+c_2u^2
where we have c_1+c_2=1, we can think that they are weights.

Then there is a “Squashing” operation:

v=\frac { \left\| s \right\| ^{ 2 } }{ 1+\left\| s \right\| ^{ 2 } } \frac { s }{ \left\| s \right\| }

When the norm of s is small, then v trends to be 0; when the norm is huge, then v trends to be close to 1 — sounds just like probability, that is how the norm reflects how confident we are about if a pattern exists or not.

During the whole process, the weight matrices W1 and W2 are trained. But c_1 and c_2 are called coupling coefficients and are determined by dynamic routing during the testing stage.

How dynamic routing works

Now let us figure out how to get the value of c. In the paper, there is the Routing Algorithm.
Routing Algorithm

Routing Algorithm from paper

Prof Hung-Yi provided a “simplified version”:
https://scontent.fbed1-2.fna.fbcdn.net/v/t35.0-12/27144589_1177148749087358_1946096447_o.png?oh=292a4481674b793496fd3f9bf7367d42&oe=5A6A430B

27267037_1177150705753829_842833390_o

A simplified Routing Algorithm

After we muliply the outputs v_i with weight matrics from last level of Capsules, we could get a couple of new vector u_i. We initialize c_i at iteration 0 with b_i^0 and set them to be zero. Here, we will do T iterations, and we pre-define it as any other hyperparameter. The superscipt r means the current value of the iteration. Because we want all the c_i^r to be in the range of 0 to 1, here comes the first line in the for-loop. Then we calculate s^r by the new c_i^r, and naturally calculate the new a^r using the Squashing operation. Finally, a^r is the output of the Squashing, and we will update the b_i^r. To understand the last line:
27044559_1177164879085745_739904922_n

Let us assume that in iteration r, the u_1 and u_2 are closer, while u_3 is far from them. After calculation, we find that a is somewhere closer to u_1 and u_2. Which means, we should condsider more to use u_1 and u_2 to represent a, considerless on u_3, because it is far. So, we should give higher value to the weights of u_1 and u_2, and the weights are exactly b_1 and b_2 and they should be improved. By the operation of the last line in the algorithm, we updated all value of b_i. In this case, although we “add” something to b_3 as well, after a softmax, (remember c_1 and c_2 are improved, and all c sum up to 1), c_3 is reduced from the previous iteration.

There are people saying that Hinton attempts to drop BP Algorithm, and dynamic routing is somehow trying to do that. For the choice of T, the paper used 3, which is enough for the model to converge. Is there any theory supporting this idea? We need to figure out.

Compare with NNs, CNNs and RNNs

From Scalar to Vector
A special interesting point for CapsNet is “Vector in vector out”. Some people say that capsule is a “series” neurons. But we can compare it with traditional neuron as well. The next screenshot is from https://github.com/naturomics/CapsNet-Tensorflow, you can also find TensorFlow code for CapsNets:
Comparison

Let’s recall pooling operations in CNNs, we have scalars as output (picture from Stanford CS231n lecture):

maxpool

In CapsNets, we are keeping a vector, whose norm represents the existence, while the elements inside the vector represent the characteristics of patterns. Compared with CNNs, CapsNets provide richer information.

From Vector to Matrix
It is possible to expand the ‘Vector in Vector out’ idea to matrix level. There is a new submission for ICLR 2018:
Matrix capsules with EM routing

It came out to be another paper by Hinton.

Dynamic Routing is the highlight of the article. Think about how we find values of c, we set T = 3. We also need backpropagation to learn their values. So in Prof Hung-Yi Lee‘s lecture, he mentioned that this is somehow similar with RNN, we are feeding one value to the next timestamp.

ttt

Pros and Cons

Explainable AI. A big difference is we moved from a scalar to a vector with meaningful features. Also, in the experiments, we could see some elements are controlling something that human would understand. I think we are one step closer to the “Explainable AI” now.
27480091_1178637412271825_79524045_o

Robustness to Affine Transformations. Stronger generalization ability. Possible to be applied to Transfer Learning.

Dynamic Routing, as stated in the paper title is the main new idea. However, people may blame the efficiency. Well, it has not been tested on a large scale setting, say the ImageNet dataset. It would be very impressive to see if it scales and performs well.

References

https://arxiv.org/pdf/1710.09829.pdf
https://openreview.net/forum?id=HJWLfGWRb
http://cs231n.github.io/convolutional-networks/#pool
http://speech.ee.ntu.edu.tw/~tlkagk/index.html
https://zhuanlan.zhihu.com/p/33244896?group_id=939479988387405824
http://blog.csdn.net/yangdelong/article/details/78443872

Author:

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s