The diffusion model belongs to the group of generative models, and the goal is to learn the truth data distribution given some data samples $x$. In this blog, we focus on autoregressive models.
Let’s first talk about variational autoencoders (VAEs). The original data is defined as , and latent representation (variable) is defined as . What we could observe is the joint probability . Likelihood-based generative models target learning a model to maximize likelihood . While it is difficult to do this directly, a way is to look at:
So then, in the following figure, we show the relationship between the true data and latent variables. We treat as the encoder and as the decoder.
We would let the latent variables form a normal distribution, and we measure how similar the generated data and the truth data . The loss is defined as:
The first term is the reconstruction loss on the generated data and truth data, and the second term is the restriction of the variational distribution. Remember that .
Hierarchical Variational Autoencoders
Then it is possible to extend this process into multiple steps, or “layers.” Imagine that we will sample multiple steps not only once, then we end up with the Hierarchical Variational Autoencoders (HVAE).
The major difference is that in HVAE, we want the latent variables at the last time step to form a normal distribution: , as highlighted in the following image.
The loss function is more complex, as it comes from three parts: 1) reconstruction loss, similar to VAE the first term; 2) prior matching loss, similar to VAE the second term involves with the latent variables at the last step; 3) the consistency term, which is used to restrict the middle steps, highlighted in the following figure (when ):
In later posts, we will cover Denoising Diffusion Probabilistic Models (DDPM), and recent applications.
 Understanding Diffusion Models: A Unified Perspective (Calvin Luo)
 Denoising Diffusion-based Generative Modeling: Foundations and Applications (CVPR 2022 Tutorial)