Understanding Variational Graph Auto-Encoders

Variational Auto-Encoders

My post about Auto-encoder.
For Variational Auto-Encoders (VAE) (from paper Auto-Encoding Variational Bayes), we actually add latent variables to the existing Autoencoders. The main idea is, we want to restrict the parameters from a known distribution. Why we want this? We wish the generative model to provide more “creative” things. If the model only sees the trained samples, it will eventually lose the ability to “create” more! So we add some “noises” to the parameters by forcing the parameters to adapt to a known distribution.

vgae1

As shown in the above picture, in the encoder part, things are more complex than the original Autoencoders. We keep the decoder to be the same with that of Autoencoders. The goal is to let the hidden representation Z_1 to be sampled from a normal distribution, aka., p(Z)=N(0,I). However, in practice, we are able to make p(Z|X)=N(\mu,\sigma^2). For each input Xi, we have:

\log q _ { \phi } ( \mathbf { z } | \mathbf { x } ^ { ( i ) } ) = \log \mathcal { N } \left( \mathbf { z } ; \boldsymbol { \mu } ^ { ( i ) } , \boldsymbol { \sigma } ^ { 2 ( i ) } \mathbf { I } \right)

In fact, if we force p(Z_i|X_i)=N(0|1), thus

p ( Z ) = \sum _ { X } p ( Z | X ) p ( X ) = \sum _ { X } \mathcal { N } ( 0 , I ) p ( X ) = \mathcal { N } ( 0 , I ) \sum _ { X } p ( X ) = \mathcal { N } ( 0 , I )

When we have a input X_1 , the encoder first generate two outputs: the mean vector \mu and the std vector \sigma ^2 through a function f (can be a neural net, you can define it). We want to force the generated parameters to be N(0|1), so we use KL-divergence to caculate the difference of these two, then becomes a part of the loss function.

K L \left( N \left( \mu , \sigma ^ { 2 } \right) \| N ( 0,1 ) \right) = \frac { 1 } { 2 } \left( - \log \sigma ^ { 2 } + \mu ^ { 2 } + \sigma ^ { 2 } - 1 \right)

Variational Graph Auto-Encoders (VGAE)

The post about Graph Convolutional Networks.
As you may have guessed from the title, the input will be the whole graph, and the output will be a reconstructed graph. Let us formulate the task. The adjacency matrix is defined as A, and the node feature matrix is X. At the same time, Z is the hidden variables. Here we are going to show the model that only reconstruct the adjacency matrix which has the information about the graph structure.

vgae3

Inference Step (or encoding) In the paper, you can see the author actually provided different names rather than encode and decode. The first step is to do inference to get latent variables Z. From the previous section, we know that the encoder will generate the mean vector \mu and the std vector \sigma ^2 through a function f . The main idea is we just define the function f to be a two-layer GCN! Where we will have:

\mu=GCN_{\mu}(X,A)
\sigma=GCN_{\sigma}(X,A)
GCN(X,A)=\tilde { A } ReLU(\tilde { A } XW_0)W_1

Similarly, for each input i we want:

q(Z_i|X,A) = N(Z_i|\mu_i,diag(\sigma_i^2))

Generative Model (or decoding) After getting Z, we will reconstruct \tilde { A }. For this part, we only use an inner product to reconstruct each element in matrix \tilde { A }

p ( \mathbf { A } | \mathbf { Z } ) = \prod _ { i = 1 } ^ { N } \prod _ { j = 1 } ^ { N } p \left( A _ { i j } | \mathbf { z } _ { i } , \mathbf { z } _ { j } \right)

\text { with } \quad p \left( A _ { i j } = 1 | \mathbf { z } _ { i } , \mathbf { z } _ { j } \right) = \sigma \left( \mathbf { z } _ { i } ^ { \top } \mathbf { z } _ { j } \right)

Learning The loss function during learning will be two parts: constructing loss and the latent variable restriction loss. Constructing loss is to see if the constructed adjacency matrix is similar to the input one; the other loss is to apply KL-divergence to measure how similar the distribution of the latent variable and a normal distribution.

\mathcal { L } = \mathbb { E } _ { q ( \mathbf { Z } | \mathbf { X } , \mathbf { A } ) } [ \log p ( \mathbf { A } | \mathbf { Z } ) ] - \mathrm { KL } [ q ( \mathbf { Z } | \mathbf { X } , \mathbf { A } ) \| p ( \mathbf { Z } ) ]

Here we have p ( \mathbf { Z } ) = \prod _ { i } p \left( \mathbf { z } _ { \mathbf { i } } \right) = \prod _ { i } \mathcal { N } \left( \mathbf { z } _ { i } | 0 , \mathbf { I } \right).

Graph Auto-Encoders (GAE)

For a simple GAE, we will get rid of the distribution restrictions, simply take a GCN as the encoder and an inner product function as the decoder:

\hat { \mathbf { A } } = \sigma \left( \mathbf { Z Z } ^ { \top } \right)
\mathbf{Z}=GCN(\mathbf{X},\mathbf{A})

References

https://zhuanlan.zhihu.com/p/34998569 (Chinese)
https://arxiv.org/abs/1312.6114
https://arxiv.org/pdf/1611.07308.pdf

Published by Irene

Keep calm and update blog.

3 thoughts on “Understanding Variational Graph Auto-Encoders

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: