Deep Learning 21: Diffusion Models (1)

Background: Autoencoders

The diffusion model belongs to the group of generative models, and the goal is to learn the truth data distribution p(x) given some data samples $x$. In this blog, we focus on autoregressive models.

Let’s first talk about variational autoencoders (VAEs). The original data is defined as x, and latent representation (variable) is defined as z. What we could observe is the joint probability p(x,z). Likelihood-based generative models target learning a model to maximize likelihood p(x). While it is difficult to do this directly, a way is to look at: p(x)=\frac{p(x,z)}{p(z|x)}.

So then, in the following figure, we show the relationship between the true data and latent variables. We treat q(z|x) as the encoder and p(x|z) as the decoder.


Autoencoder: based on [1].

We would let the latent variables z form a normal distribution, and we measure how similar the generated data x' and the truth data x. The loss is defined as:

Loss=\mathbf{E}_{q(z|x)}[\log p(x|z)]-D_{KL}(q(z|x)\parallel p(z))

The first term is the reconstruction loss on the generated data and truth data, and the second term is the restriction of the variational distribution. Remember that p(z)=\mathcal{N}(\bold{0},\bold{1}).

Hierarchical Variational Autoencoders

Then it is possible to extend this process into multiple steps, or “layers.” Imagine that we will sample multiple T steps not only once, then we end up with the Hierarchical Variational Autoencoders (HVAE).

The major difference is that in HVAE, we want the latent variables at the last time step to form a normal distribution: p(z_{T})=\mathcal{N}(\bold{0}|\bold{1}), as highlighted in the following image.

Hierarchical Autoencoder: based on [1].

The loss function is more complex, as it comes from three parts: 1) reconstruction loss, similar to VAE the first term; 2) prior matching loss, similar to VAE the second term involves with the latent variables at the last step; 3) the consistency term, which is used to restrict the middle steps, highlighted in the following figure (when t=1):

Loss =\mathbf{E}_{q(x_1|x)}[\log p(x_0|x_1)]-\mathbf{E}_{q(z_{T-1}|x)}[D_{KL}(q(z_{t}|z_{t-1})\parallel p(z_T))] -\\ \Sigma_{t=1}^{T-1}\mathbf{E}_{q(z_{t-1},z_{t+1}|x)}[D_{KL}(q(z_{t}|z_{t-1})\parallel p(z_t|z_{t+1}))]

consistency term when t=1
consistency term when t=1

In later posts, we will cover Denoising Diffusion Probabilistic Models (DDPM), and recent applications.

More readings:

My previous blog on auto-encoders, and variational graph auto-encoders.


[1] Understanding Diffusion Models: A Unified Perspective (Calvin Luo)
[2] Denoising Diffusion-based Generative Modeling: Foundations and Applications (CVPR 2022 Tutorial)

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: