# Deep Learning 22: Diffusion Models (2)

Previously, we introduced Autoencoders and Hierarchical Variational Autoencoders (HVAEs). In this post, we will cover the details of Denoising Diffusion Probabilistic Models (DDPM).

#### Diffusion Models

We can treat DDPM as a restricted HVAE. Here, each $x_t$ only depends on $x_{t-1}$. In DDPM, we do not have parameters to add noises, and it is a predefined Gaussian model. This would bring us some computational convenience to obtain any arbitrary $x_t$ quickly.

As shown in the following image, we can see that DDPM has two phases: 1) forward diffusion: adding noises to an input image, at T steps, the image becomes pure noises; 2) reverse process: we try to generate the original image input based on the noised version at $T$.

##### Forward Diffusion： $x_0\to x _T$

Similarly, each step is to use a linear Gaussian to add noises based on the input from the previous step. So the forward process from time step 0 to T is: $q(x_{1:T}|x_0)=\prod_{t=1}^{T} q(x_t|x_{t-1})$,

And for each $q(x_t|x_{t-1})$ we have: $q(x_t|x_{t-1})=\mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I})$.

Note that $\sqrt{1-\beta_t}$ is the mean, and $\beta_t$ is the variance at $t$, ranges from 0 to 1. Let’s first look at this variance value $\beta_t$. Because at time step $T$, we should have exactly $q(x_T|x_0)\approx\mathcal{N}(\mathbf{0},\mathbf{I})$, so one way is to start with a smaller value, and increase it: $\beta_1<\beta_2<...<\beta_T$. So then the mean value $\sqrt{1-\beta_t}$ is in a reversed trend.

As I mentioned before, such a definition makes it possible to obtain sample $x_t$ at any arbitrary forward step given $x_0$. Because the sum of independent Gaussian is still a Gaussian: $q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar{\alpha_t}}x_0,(1-\bar{\alpha_t})\mathbf{I})$,

where $\alpha_t=1-\beta_t$, and $\bar{\alpha_t}=\prod_{s=1}^t\alpha_s$.

##### Reverse Process： $x_T\to x_0$

While the forward diffusion is unparameterized, the reverse process is parameterized with $\theta$ (eliminated in the image). So we defined the following: $p_\theta(x_{0:T})=p(x_T)\prod_{t=1}^Tp_\theta(x_{t-1}|x_t)$.

Given that $x_T$ is pure noise, $\mathcal{N}(x_T; \mathbf{0},\mathbf{I})$, so we do not have $\theta$ here.

The issue is how to define the objective function $\log{p_\theta(x_t)}$. It is not possible to look at all possible directions on $p_\theta(x_t)$. $\log{p_\theta(x_t)}\geq \mathbf{E}{q(x_0|x{1:T})}[\log p_\theta(x_0|x_{1:T})]-\mathbf{D}{KL}(q(x{1:T}|x_0)\parallel p_\theta(x_{1:T}))$.

Similar to VAEs, we have the first term here to be reconstruction loss and consistency loss. Actually, we are eliminating the loss at step $T$ (since this is a known distribution, we simply ignore it). Since the reverse steps are also Gaussians, we have: $p_\theta\left(x_{t-1} \mid x_t\right)=\mathcal{N}\left(x_{t-1} ; \mu_\theta\left(x_t, t\right), \Sigma_\theta\left(x_t, t\right)\right)$.

Differently, we learn the mean value $\mu_\theta$ but fix the variance $\Sigma_\theta$. After reparameterization, we transfer our loss to predict this noise $\epsilon$ ( $w_t$ is a weight at step $t$): $x_t=\sqrt{\bar{\alpha}t} x_0+\sqrt{1-\bar{\alpha}_t} \epsilon, \epsilon \sim \mathcal{N}(0,I)$, $\text { loss } =\mathbb{E}{x_{0, \epsilon, t}}\left[w_t\left|\epsilon-\epsilon_\theta\left(x_t, t\right)\right|^2\right]$

There are some small tricks to obtain such simplification, and we will include some details in a slide file, stay tuned!

#### References

 Understanding Diffusion Models: A Unified Perspective (Calvin Luo)