# Deep Learning 16: Understanding Capsule Nets

This post is the learning notes from Prof Hung-Yi Lee‘s lecture, the pdf could be found here (page40-52). I have read few articles, and I found this is a must-read. It is simple, and you can easily understand what is going on. I would say it is a good starting point for further readings.

## Capsule

The existing CNNs are good. But people found that more improvements are possible. Let’s take a look at the following example:

In a typical CNN model, we might have a large number of the first type of pictures in training dataset. Usually, it is a little bit hard for the model to recognize the second type of pictures. Well, sometimes it is hard for a human, too. In CNNs, a neuron is set to detect a certain pattern, like an edge or the combinations of edges.

In Capsule Net, we let each capsule to detect a certain type of patterns, which means it accepts any rotations. If we say, the “cat” is a pattern we want to find. We know the two cat images are the same pattern, and one capsule is needed. However, in CNN, you will need two neurons to find them. If you meet something like the following, one capsule is enough…. For each output vector v of the capsule: 1. Each dimension of v represents the characteristics of patterns.
2. The norm of v represents the existence.

In another word, the vector stands for how confident the pattern exists, and what the input picture looks like? (you can reconstruct the input image using the vector)

## How capsule works

Like the neurons, we might have multiple levels of capsules. Then outputs of the last level of capsules could be the input to the current level. Let us assume now we have two output vectors from the last level v1 and v2, then we do some computation in the capsule, and generate our output v. So the magic happens in the blue area in the screenshot: It is very straightforward: $u^1=W^1v^1,u^2=W^2v^2\\s=c_1u^1+c_2u^2$
where we have $c_1+c_2=1$, we can think that they are weights.

Then there is a “Squashing” operation: $v=\frac { \left\| s \right\| ^{ 2 } }{ 1+\left\| s \right\| ^{ 2 } } \frac { s }{ \left\| s \right\| }$

When the norm of s is small, then v trends to be 0; when the norm is huge, then v trends to be close to 1 — sounds just like probability, that is how the norm reflects how confident we are about if a pattern exists or not.

During the whole process, the weight matrices W1 and W2 are trained. But $c_1$ and $c_2$ are called coupling coefficients and are determined by dynamic routing during the testing stage.

## How dynamic routing works

Now let us figure out how to get the value of c. In the paper, there is the Routing Algorithm. Routing Algorithm from paper

Prof Hung-Yi provided a “simplified version”:  A simplified Routing Algorithm

After we muliply the outputs $v_i$ with weight matrics from last level of Capsules, we could get a couple of new vector $u_i$. We initialize $c_i$ at iteration 0 with $b_i^0$ and set them to be zero. Here, we will do $T$ iterations, and we pre-define it as any other hyperparameter. The superscipt $r$ means the current value of the iteration. Because we want all the $c_i^r$ to be in the range of 0 to 1, here comes the first line in the for-loop. Then we calculate $s^r$ by the new $c_i^r$, and naturally calculate the new $a^r$ using the Squashing operation. Finally, $a^r$ is the output of the Squashing, and we will update the $b_i^r$. To understand the last line: Let us assume that in iteration $r$, the $u_1$ and $u_2$ are closer, while $u_3$ is far from them. After calculation, we find that $a$ is somewhere closer to $u_1$ and $u_2$. Which means, we should condsider more to use $u_1$ and $u_2$ to represent $a$, considerless on $u_3$, because it is far. So, we should give higher value to the weights of $u_1$ and $u_2$, and the weights are exactly $b_1$ and $b_2$ and they should be improved. By the operation of the last line in the algorithm, we updated all value of $b_i$. In this case, although we “add” something to $b_3$ as well, after a softmax, (remember $c_1$ and $c_2$ are improved, and all $c$ sum up to 1), $c_3$ is reduced from the previous iteration.

There are people saying that Hinton attempts to drop BP Algorithm, and dynamic routing is somehow trying to do that. For the choice of $T$, the paper used 3, which is enough for the model to converge. Is there any theory supporting this idea? We need to figure out.

## Compare with NNs, CNNs and RNNs

From Scalar to Vector
A special interesting point for CapsNet is “Vector in vector out”. Some people say that capsule is a “series” neurons. But we can compare it with traditional neuron as well. The next screenshot is from https://github.com/naturomics/CapsNet-Tensorflow, you can also find TensorFlow code for CapsNets: Let’s recall pooling operations in CNNs, we have scalars as output (picture from Stanford CS231n lecture): In CapsNets, we are keeping a vector, whose norm represents the existence, while the elements inside the vector represent the characteristics of patterns. Compared with CNNs, CapsNets provide richer information.

From Vector to Matrix
It is possible to expand the ‘Vector in Vector out’ idea to matrix level. There is a new submission for ICLR 2018:
Matrix capsules with EM routing

It came out to be another paper by Hinton.

Dynamic Routing is the highlight of the article. Think about how we find values of c, we set T = 3. We also need backpropagation to learn their values. So in Prof Hung-Yi Lee‘s lecture, he mentioned that this is somehow similar with RNN, we are feeding one value to the next timestamp. ## Pros and Cons

Explainable AI. A big difference is we moved from a scalar to a vector with meaningful features. Also, in the experiments, we could see some elements are controlling something that human would understand. I think we are one step closer to the “Explainable AI” now. Robustness to Affine Transformations. Stronger generalization ability. Possible to be applied to Transfer Learning.

Dynamic Routing, as stated in the paper title is the main new idea. However, people may blame the efficiency. Well, it has not been tested on a large scale setting, say the ImageNet dataset. It would be very impressive to see if it scales and performs well.