## Deep Learning 15: Unsupervised learning in DL? Try Autoencoder!

There are unsupervised learning models in multiple-level learning methods, for example, RBMs and Autoencoder. In brief, Autoencoder is trying to find a way to reconstruct the original inputs — another way to represent itself. In addition, it is useful for dimensionality reduction. For example, say there is a 32 * 32 -sized image, it is possible to represent it by using a fewer number of parameters. This is called you are “encoding” an image. The goal is to learn the new representation, so it is also applied as pre-training; then a traditional machine learning models could be applied depending on the tasks — it is a typical “two-stage” way for solving problems.[3] by Hinton is a great work on this problem, which shows the ability of “compressing” in neural networks, solving the bottleneck of massive information.

### Typical Autoencoder Models

Suppose we have only input features, denoted by $x = \{x_1,x_2,....,x_n\}$. Think about PCA, we can easily find a way to reconstruct them into lower dimensional representations. Autoencoder could achieve the similar goal, but no restricted to a lower dimension.
The figure is from Andrew Ng’s lecture notes [1], a simple model of the autoencoder. It is a 3-layer fully-connected neural network with bias units. The model is to learn the function which let $h_{W,b}(x)\approx x$, instead of “=” in the middle — make no sense. Usually, we add some noise during training — will explain later in the post.

Usually, we could set a different number of units in the hidden layer: when smaller than that of the input layer, it is a compressed representation; also could be the same with or greater than that of the input layer, then it is a reconstruction, we will add a sparsity constraint.
Informally, the process L1->L2 is called ‘encoding’, L2->L3 is ‘decoding’. Think about you want to send a high-quality picture to your friend, you might want to compress it first to make it smaller for an email attachment (encoding); after receiving, your friend would decompress and view the picture (decoding). The below figure from [2] shows the model:

We focus on the hidden layer representation $z$. Here the input is denoted by $x$, output is $x'$. The error is $L = ||x-x'||^2$. During training, the error would be minimized.

### Autoencoder Variations

#### Denoising autoencoder

Firstly proposed by Vincent et al. in Extracting and composing robust features with denoising autoencoders. Denoising means we add noises: say original input is 64-dim, we randomly set few of them elements to be zero.

##### Model

This approach [4] can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

“Our training procedure for the denoising autoencoder involves learning to recover a clean input from a corrupted version, a task known as denoising.” The Denoising Autoencoder is based on the idea of “unsupervised initialization by explicit fill-in-the-blanks training” on a deep learning model. Imagine that how our brain works, even though you cover a small part of an image, our brain could still distinguish the object. In other words, provided by clean inputs, we create some noise or blanks manually and attempt to learn the autoencoder with good robustness.

The model is shown in the above Figure form [4]. Given clean inputs $\textbf{x}$, try to corrupt through partially destroy. By means of mapping $\widetilde { \textbf{x} } \sim q_D(\widetilde { \textbf{x} } | \textbf{x})$ , we would have a “noisy” version $\tilde { \textbf{x} }$ of the original input. For example, some elements would be simply treated as zeros, pretending giving blank information. Then a hidden layer of representation is mapped from $\tilde { \textbf{x} }$ to  $\textbf{y}$ through  $f_\theta$. The next step is to reconstruct a clean input $\textbf{z}=g_\theta' (y)$, known as denoising. We expect the reconstructed output $\textbf{z}$is close to uncorrupted input  $\textbf{x}$. The parameters are optimizing the construction cross-entropy which is defined as:$L_{ H }(x,z)=H(B_{ x }||B_{ z })=-\sum _{ k=1 }^{ d }{ \left[ x_{ d }\log z_{ k }+(1-x_{ k })\log (1-z_k) \right] }$

Keep the trained $f_\theta$ parameters and use them directly on input $\textbf{x}$ yielding higher level representation. Then finish the first layer of denoising autoencoding. Treat the new representation layer  $\textbf{y}$ as the input of the next denoising autoencoding layer, we will get the second layer. Follow the same procedure, eventually we noticed they are stacked one by one. After a number of layers, the stacked denoising autoencoding is finished training in an unsupervised way, and below figure is from [4].

In the second procedure, the trained stacked dAE could be connected with a classifier, such as a linear Support Vector Machine (SVM). The choose of the classifier depends on the classification task. During training, it is possible to fine-tune the stacked dAE parameters for each layer $f_\theta^(i)$ by gradient descent.

##### Comments

Unsupervised: the function of the stacked dAE is to learn the sharing representations across all those domains. Since no labels are required in training, data from both domains could be helpful. The readers could also think about the Word2vec model, which trained from free-text and eventually give embeddings as representations. In addition, only a single round of training is required for the stacked dAE, which is an attractive point for the model reuse. Then the pre-trained stacked dAE can be fine-tuned and cooperate with other models for any specific tasks.
Scalability: it requires massive of data from domains to train the networks, so the scalability is always a consideration when measuring the task complexity. The stack dAE based deep learning model is proved to be scalable on an industrial-scale dataset of up to 22 domains on Amazon reviews. [4]

#### Sparse autoencoder

Another variant is sparse autoencoder. It is when we consider the sparsity of the hidden layer. Compared with an original autoencoder, we keep the loss but add one more penalty term( sometimes take K-L divergence) during training.  The penalty term would work by comparing the probability distribution of the hidden unit activations with some low desired value), or by manually zeroing all but the few strongest hidden unit activations ( check [1,5] for more).