Understanding Variational Graph Auto-Encoders

Variational Auto-Encoders

My post about Auto-encoder.
For Variational Auto-Encoders (VAE) (from paper Auto-Encoding Variational Bayes), we actually add latent variables to the existing Autoencoders. The main idea is, we want to restrict the parameters from a known distribution. Why we want this? We wish the generative model to provide more “creative” things. If the model only sees the trained samples, it will eventually lose the ability to “create” more! So we add some “noises” to the parameters by forcing the parameters to adapt to a known distribution.

vgae1

As shown in the above picture, in the encoder part, things are more complex than the original Autoencoders. We keep the decoder to be the same with that of Autoencoders. The goal is to let the hidden representation $Z_1$ to be sampled from a normal distribution, aka., $p(Z)=N(0,I)$ . However, in practice, we are able to make $p(Z|X)=N(\mu,\sigma^2)$ . For each input Xi, we have:

$\log q _ { \phi } ( \mathbf { z } | \mathbf { x } ^ { ( i ) } ) = \log \mathcal { N } \left( \mathbf { z } ; \boldsymbol { \mu } ^ { ( i ) } , \boldsymbol { \sigma } ^ { 2 ( i ) } \mathbf { I } \right)$

In fact, if we force $p(Z_i|X_i)=N(0|1)$ , thus

$p ( Z ) = \sum _ { X } p ( Z | X ) p ( X ) = \sum _ { X } \mathcal { N } ( 0 , I ) p ( X ) = \mathcal { N } ( 0 , I ) \sum _ { X } p ( X ) = \mathcal { N } ( 0 , I )$

When we have a input $X_1$ , the encoder first generate two outputs: the mean vector $\mu$ and the std vector $\sigma ^2$ through a function $f$ (can be a neural net, you can define it). We want to force the generated parameters to be $N(0|1)$ , so we use KL-divergence to caculate the difference of these two, then becomes a part of the loss function.

$K L \left( N \left( \mu , \sigma ^ { 2 } \right) \| N ( 0,1 ) \right) = \frac { 1 } { 2 } \left( - \log \sigma ^ { 2 } + \mu ^ { 2 } + \sigma ^ { 2 } - 1 \right)$

Variational Graph Auto-Encoders (VGAE)

The post about Graph Convolutional Networks.
As you may have guessed from the title, the input will be the whole graph, and the output will be a reconstructed graph. Let us formulate the task. The adjacency matrix is defined as A, and the node feature matrix is X. At the same time, Z is the hidden variables. Here we are going to show the model that only reconstruct the adjacency matrix which has the information about the graph structure.

Inference Step (or encoding) In the paper, you can see the author actually provided different names rather than encode and decode. The first step is to do inference to get latent variables Z. From the previous section, we know that the encoder will generate the mean vector $\mu$ and the std vector $\sigma ^2$ through a function $f$ . The main idea is we just define the function $f$ to be a two-layer GCN! Where we will have:

$\mu=GCN_{\mu}(X,A)$
$\sigma=GCN_{\sigma}(X,A)$
$GCN(X,A)=\tilde { A } ReLU(\tilde { A } XW_0)W_1$

Similarly, for each input i we want:

$q(Z_i|X,A) = N(Z_i|\mu_i,diag(\sigma_i^2))$

Generative Model (or decoding) After getting Z, we will reconstruct $\tilde { A }$ . For this part, we only use an inner product to reconstruct each element in matrix $\tilde { A }$

$p ( \mathbf { A } | \mathbf { Z } ) = \prod _ { i = 1 } ^ { N } \prod _ { j = 1 } ^ { N } p \left( A _ { i j } | \mathbf { z } _ { i } , \mathbf { z } _ { j } \right)$

$\text { with } \quad p \left( A _ { i j } = 1 | \mathbf { z } _ { i } , \mathbf { z } _ { j } \right) = \sigma \left( \mathbf { z } _ { i } ^ { \top } \mathbf { z } _ { j } \right)$

Learning The loss function during learning will be two parts: constructing loss and the latent variable restriction loss. Constructing loss is to see if the constructed adjacency matrix is similar to the input one; the other loss is to apply KL-divergence to measure how similar the distribution of the latent variable and a normal distribution.

$\mathcal { L } = \mathbb { E } _ { q ( \mathbf { Z } | \mathbf { X } , \mathbf { A } ) } [ \log p ( \mathbf { A } | \mathbf { Z } ) ] - \mathrm { KL } [ q ( \mathbf { Z } | \mathbf { X } , \mathbf { A } ) \| p ( \mathbf { Z } ) ]$

Here we have $p ( \mathbf { Z } ) = \prod _ { i } p \left( \mathbf { z } _ { \mathbf { i } } \right) = \prod _ { i } \mathcal { N } \left( \mathbf { z } _ { i } | 0 , \mathbf { I } \right)$ .

Graph Auto-Encoders (GAE)

For a simple GAE, we will get rid of the distribution restrictions, simply take a GCN as the encoder and an inner product function as the decoder:

$\hat { \mathbf { A } } = \sigma \left( \mathbf { Z Z } ^ { \top } \right)$
$\mathbf{Z}=GCN(\mathbf{X},\mathbf{A})$

References

https://zhuanlan.zhihu.com/p/34998569 (Chinese)
https://arxiv.org/abs/1312.6114
https://arxiv.org/pdf/1611.07308.pdf

3 thoughts on “Understanding Variational Graph Auto-Encoders”

thechainplayer says:

March 7, 2019 at 3:29 am

i am not a cs major student but I like your blog!

LikeLike

1. Irene says:
  
  March 7, 2019 at 5:41 pm
  
  thank you
  
  LikeLike
  
Pingback: Deep Learning 21: Diffusion Models (1) – IreneLi

Understanding Variational Graph Auto-Encoders

Variational Auto-Encoders

Variational Graph Auto-Encoders (VGAE)

Graph Auto-Encoders (GAE)

References

Published by Irene

3 thoughts on “Understanding Variational Graph Auto-Encoders”

Leave a reply to thechainplayer Cancel reply

Variational Auto-Encoders

Variational Graph Auto-Encoders (VGAE)

Graph Auto-Encoders (GAE)

References

Share this:

Related

Published by Irene

3 thoughts on “Understanding Variational Graph Auto-Encoders”

Leave a reply to thechainplayer Cancel reply