A multi-layer NN, able to process images and voice signals (2D), keep stability in rotations.

**Weights Sharing**

A very impressive feature of CNN.

a 1000 times 1000 pixels pic as the input, 1 million hidden units:

Fully Connected (top left): each unit is connected with each pixel….that is totally 10^12 connections, which means 10^12 weights to be trained.

Locally Connected (top right): each unit is connected only with a 10 times 10 small region (we call it *“Patch” or “Kernel” *)…that’s 100 million connections, same number of weights. Compared with Fully Connected NN, obviously efficiency.

More importantly, according to xxx theory (maybe “image stationary?”, I do not know yet), we might guess that, for each of the unit and its 100 connections to the input pic, the 100 weights are the same to other hidden units. We can assume that the features are very similar or even same. So, eventually, locally connected NN, has only 100 weights to be trained. (That’s the main idea of “ Convolutions”)

**Filters**

(Check the terms from the under picture.)

The previous 100 weights are for just one feature.

We need to train more features. Say for an image, we have the size of it, *Width, Height* and 3 “features” (the *“Depth”*), Red, Green, Blue. So initially we have 3 *Feature Maps*, we can extract k features (also the number of filters). Input is x, a pre-defined sized patch. Out put will be y, after feature extractions, there will be k 2-D matrices. So filters help to train different features.

If 100 filters is set, then totally there will be 100 times 100 parameters or weights to be trained.

**Stride**

Now you have the Patch, I like to call it maybe a “sliding window” from the input layer. So if you start by initialise a sliding window, next you have to move it to the next patch. You can think of the stride similarly as the size of the moving step.

Personally, I think, every unit or filter might be trained at the same time to increase efficiency, but I didn’t check if there are papers on “parallelised training of CNN”.

When implement, there is a 4-D input : strides, widths,heights, depths.

**Architecture – Pyramid**

From the left to right, space become smaller, while the depth become larger. More blur in the picture but more features learnt. Looks like a pyramid.

Two Basic Steps: Convolution and SubSampling in LeNet-5.

A work by Yann around 1998 (he started research on neural nets even before 1989! Before I was born…).

The whole design looks like below, including input and output layer, there are 7 layers:

We could see that, convolutional and subsample layers are connected in turns. How they work in detail?

When the input image is given, there is a self-defined function f as a convolutional filter, following by adding up a bias vector bx. Now everything changes to a huge matrix, as we marked as layer. Notice that Cx is a numerical result. That’s the convolution process.

Then we use pooling, change 4 pixels in to 1 (for example) . Might choose Max or Average pooling. It is the feature extraction step. Now add a linear model, times by weights and added with biases. Finally, numerical values are transferred into probabilities via sigmoid or softmax function. Now we have the layer. That’s the subsample process.

Differently, S layers are always featured pictures, while C layers are numerical matrices. Then the layer will be the input of layer, and so on. That is the reason why C and S layers come up in turns.

If you went through Andrew’s Machine Learning on Coursera, you must be familiar with the picture above, the demo of Yann’s work. Now you understand what is going on from the left part of the slide, it is illustrating the Convolutional layers. The deeper the layer is, the more abstract feature it learns.

Let’s go back to Yann’s model.

After S4 layers, there are two fully connected layers, C5 and F6.

C5 is doing 1 times 1 convolution, nothing special for my understanding right now.But actually, there is a method of optimisation, maybe talk in the future, a paper by google mentioned it.

F6 has 12 times 7 (84) units, for output design. It is a standard NN, then pass into a Sigmoid function as a classifier.

**Dropout** [*]

Training will lead to overfitting, especially combining the predictions of many large neural nets. The main idea of dropout is to drop some units from the whole net while training.

Next will talk about the official TensorFlow tutorial, how to use a CNN to do the same job –> Go here!

More advanced….Objects Detection…Let’s explore more together!

**R-CNN (Regions with CNN features)**

Special thanks to my friend BaoChen Sun, for his patient suggestions 😀

http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

http://web.engr.illinois.edu/~slazebni/spring14/lec24_cnn.pdf

http://blog.csdn.net/zouxy09/article/details/8782018 (Chinese)

[*] http://www.deeplearningbook.org/contents/convnets.html