Deep Learning 22: Diffusion Models (2)

Previously, we introduced Autoencoders and Hierarchical Variational Autoencoders (HVAEs). In this post, we will cover the details of Denoising Diffusion Probabilistic Models (DDPM).

Diffusion Models

We can treat DDPM as a restricted HVAE. Here, each x_t only depends on x_{t-1}. In DDPM, we do not have parameters to add noises, and it is a predefined Gaussian model. This would bring us some computational convenience to obtain any arbitrary x_t quickly.

As shown in the following image, we can see that DDPM has two phases: 1) forward diffusion: adding noises to an input image, at T steps, the image becomes pure noises; 2) reverse process: we try to generate the original image input based on the noised version at T.

Illustration of VDM, based on [2].
Forward Diffusion: x_0\to x _T

Similarly, each step is to use a linear Gaussian to add noises based on the input from the previous step. So the forward process from time step 0 to T is:

q(x_{1:T}|x_0)=\prod_{t=1}^{T} q(x_t|x_{t-1}),

And for each q(x_t|x_{t-1}) we have:

q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}).

Note that \sqrt{1-\beta_t} is the mean, and \beta_t is the variance at t, ranges from 0 to 1. Let’s first look at this variance value \beta_t. Because at time step T, we should have exactly q(x_T|x_0)\approx\mathcal{N}(\mathbf{0},\mathbf{I}), so one way is to start with a smaller value, and increase it: \beta_1<\beta_2<...<\beta_T . So then the mean value \sqrt{1-\beta_t} is in a reversed trend.

As I mentioned before, such a definition makes it possible to obtain sample x_t at any arbitrary forward step given x_0. Because the sum of independent Gaussian is still a Gaussian:

q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar{\alpha_t}}x_0,(1-\bar{\alpha_t})\mathbf{I}),

where \alpha_t=1-\beta_t, and \bar{\alpha_t}=\prod_{s=1}^t\alpha_s.

Reverse Process: x_T\to x_0

While the forward diffusion is unparameterized, the reverse process is parameterized with \theta (eliminated in the image). So we defined the following:

p_\theta(x_{0:T})=p(x_T)\prod_{t=1}^Tp_\theta(x_{t-1}|x_t).

Given that x_T is pure noise, \mathcal{N}(x_T; \mathbf{0},\mathbf{I}), so we do not have \theta here.

The issue is how to define the objective function \log{p_\theta(x_t)}. It is not possible to look at all possible directions on p_\theta(x_t).

\log{p_\theta(x_t)}\geq \mathbf{E}{q(x_0|x{1:T})}[\log p_\theta(x_0|x_{1:T})]-\mathbf{D}{KL}(q(x{1:T}|x_0)\parallel p_\theta(x_{1:T})).

Similar to VAEs, we have the first term here to be reconstruction loss and consistency loss. Actually, we are eliminating the loss at step T (since this is a known distribution, we simply ignore it). Since the reverse steps are also Gaussians, we have:

p_\theta\left(x_{t-1} \mid x_t\right)=\mathcal{N}\left(x_{t-1} ; \mu_\theta\left(x_t, t\right), \Sigma_\theta\left(x_t, t\right)\right).

Differently, we learn the mean value \mu_\theta but fix the variance \Sigma_\theta. After reparameterization, we transfer our loss to predict this noise \epsilon (w_t is a weight at step t):

x_t=\sqrt{\bar{\alpha}t} x_0+\sqrt{1-\bar{\alpha}_t} \epsilon, \epsilon \sim \mathcal{N}(0,I),

\text { loss } =\mathbb{E}{x_{0, \epsilon, t}}\left[w_t\left|\epsilon-\epsilon_\theta\left(x_t, t\right)\right|^2\right]

There are some small tricks to obtain such simplification, and we will include some details in a slide file, stay tuned!

References

[1]https://youtu.be/fbLgFrlTnGU

[2] Understanding Diffusion Models: A Unified Perspective (Calvin Luo)

 

Published by Irene

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: