What are Diffusion Models?

Diffusion models are a type of generative AI that create data like images or audio by starting from random noise and gradually refining it into meaningful output. They learn this process by adding noise to real data during training and then reversing it step by step. This allows them to generate realistic and high-quality samples from scratch.

Diffusion models are generative models that create realistic data by learning to remove noise from random inputs.

During training, noise is gradually added to real data so the model learns how data degrades.
The model is trained to reverse this process by removing noise step-by-step.
This helps it learn complex data patterns and generate high-quality outputs.
At inference, it starts with random noise and gradually denoises it to produce data like images, audio or text.

Key Components

Forward Diffusion Process (Noise Addition): A fixed, predefined process that gradually adds Gaussian noise to data over many timesteps, converting structured data into nearly pure noise.
Reverse Diffusion Process (Denoising): A learnable process where a neural network (often a U-Net) predicts and removes noise step by step to reconstruct realistic data from noise.
Score Function (Noise Estimation): Estimates the gradient of the data distribution (or directly predicts noise), guiding each denoising step toward more realistic samples.
Time-Step Conditioning: The model is conditioned on the current timestep, allowing it to understand how much noise is present and how to remove it effectively at each stage.
Sampling Strategy: New data is generated by starting from random noise and iteratively applying the learned reverse steps, often using techniques to improve speed and quality.

Architecture of Diffusion Models

Diffusion models are built on a two-stage probabilistic framework that transforms data into noise and then learns to reverse this process to generate new samples.

1. Forward Diffusion Process (Noise Addition)

The forward process is a fixed (non-learnable) Markov chain that gradually adds Gaussian noise to the data over multiple timesteps until it becomes pure noise. At each step, a small amount of noise is added:

x_t = \sqrt{\bar{\alpha}_t} , x_0 + \sqrt{1 - \bar{\alpha}_t} , \epsilon]

where

\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i

The process follows the Markov property, meaning each step depends only on the previous step. As the number of steps T increases, the data distribution converges to pure Gaussian noise.

\alpha_{t}=1-\beta_{t} controls how much noise is added at each step typically increases gradually (linear or cosine schedule).
Early steps retain most structure, while later steps destroy it completely.

2. Reverse Diffusion Process (Denoising)

The reverse process is the core learning component, where a neural network learns to remove noise step-by-step. It models the reverse transition:

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_\theta(x_t, t))

where,

\mu_\theta and \sigma_\theta are learned parameters.

Instead of directly predicting x_t-1, the model typically predicts the noise \epsilon added at each step. This simplifies training and leads to better stability.

How Diffusion Models Work

Diffusion models generate realistic data by learning how to gradually convert random noise into structured data through a sequence of small, manageable denoising steps.

1. Forward Process (Diffusion / Noise Addition)

The forward process is the first stage where clean data is slowly corrupted by adding Gaussian noise over many steps.

At early steps only a small amount of noise is added, in middle steps structure starts breaking and at final steps the data becomes completely random noise.
After enough steps (T), the data loses all its structure and becomes indistinguishable from pure Gaussian noise.
The forward diffusion process is fixed and does not involve any learning.
Each step depends only on the previous one (Markov property) and the noise level is controlled by a gradually increasing variance schedule

Starting with original data x₀ noise is added step by step:

x_t = \sqrt{1 - \beta_t}\, x_{t-1} + \sqrt{\beta_t}\, \epsilon

where \beta_{t} controls how much noise is added at step t and \epsilon \sim \mathcal{N}(0, I) is random Gaussian noise.

2. Reverse Process (Learning to Denoise)

In this step, the model learns to reverse the noise added during the forward process by gradually denoising the data to recover meaningful structure.

Predicting noise is simpler and more stable since noise follows a known Gaussian distribution, allowing the model to learn effectively.
It enables the model to focus on small corrections at each step instead of generating data in one complex step.
This step-by-step denoising approach makes training more stable and improves the quality of generated outputs.

The reverse process is modeled as:

p_{\theta}(x_{t-1} \mid x_t)

The denoising step can be written as:

x_{t-1} = \frac{1}{\sqrt{1 - \beta_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \beta_t}} \epsilon_\theta(x_t, t) \right)

where \epsilon_{\theta}(x_{t},t) is the neural network that predicts the noise present in x_{t}.

3: Training Objective (Loss Function)

During training, the model learns to predict the noise added at each step by comparing its predictions with the actual noise.

The objective function is:

L(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

It represents the Mean Squared Error (MSE) between the actual noise and the predicted noise, derived from the variational inference objective (ELBO).

The model minimizes the difference between true noise \epsilon and predicted noise \epsilon_{\theta}(x_{t},t)
This simplifies learning by converting a complex generation task into a noise prediction problem.
As training progresses, predictions improve, enabling the model to denoise effectively at every step.

4. Sampling and Data Generation

After training, the model generates new data by starting from random noise and gradually refining it into meaningful structure.

The process begins with pure noise and applies the learned denoising steps iteratively until a realistic sample is formed.
Each step removes a small amount of noise, allowing patterns and structure to gradually emerge.
More steps generally produce higher-quality outputs, while fewer steps make generation faster but slightly less detailed.

Step By Step Implementation

Here, we implement a diffusion model to understand how noise is added, learned and removed to generate new data.

Step 1: Install Necessary Libraries

Import all the necessary libraries required for building and training the diffusion model.

Python

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

Step 2: Beta Schedule and Noise Schedule

Define a linear noise schedule that controls how much noise is added at each timestep and compute related parameters like \alpha_{t} and cumulative products used in the diffusion process.

Python

def linear_beta_schedule(timesteps):
    beta_start = 0.0001
    beta_end = 0.02
    return torch.linspace(beta_start, beta_end, timesteps)

T = 200
betas = linear_beta_schedule(T)

alphas = 1. - betas
alphas_cumprod = torch.cumprod(alphas, axis=0)

Step 3: Forward Diffusion Process

Here we define a function that adds noise to the original image at a given timestep using precomputed parameters, returning both the noisy image and the noise added for training.

Python

def forward_diffusion_sample(x_0, t, noise=None):
    """
    Add noise to the image x_0 at timestep t
    """
    if noise is None:
        noise = torch.randn_like(x_0)
    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod[t])[:, None, None, None]
    sqrt_one_minus_alphas_cumprod = torch.sqrt(1 - alphas_cumprod[t])[:, None, None, None]
    return sqrt_alphas_cumprod * x_0 + sqrt_one_minus_alphas_cumprod * noise, noise

Step 4: Neural Network (U Net or simple CNN)

Here, a simple convolutional neural network is built to take a noisy image as input and predict the noise present in it, which is then used for the denoising process.

Python

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 32, 3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 1, 3, padding=1),
        )

    def forward(self, x, t):
        return self.net(x)

Step 5: Load Data and Train the Model

Prepares the dataset and trains the model to learn noise prediction.

Loads the MNIST dataset, applies normalization and creates batches using a DataLoader.
Randomly selects a timestep and adds noise to images using the forward diffusion process.
Trains the model by predicting noise and minimizing the MSE loss using backpropagation.

Python

def get_data():
    transform = transforms.Compose([
        transforms.ToTensor(), 
        lambda x: x * 2 - 1 
    ])
    dataset = MNIST(root="./data", train=True, download=True, transform=transform)
    dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
    return dataloader

def train(model, dataloader, optimizer, epochs=5):
    for epoch in range(epochs):
        for step, (x, _) in enumerate(dataloader):
            x = x.to(device)
            t = torch.randint(0, T, (x.shape[0],), device=device).long()
            x_noisy, noise = forward_diffusion_sample(x, t)
            noise_pred = model(x_noisy, t)
            loss = F.mse_loss(noise_pred, noise)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        print(f"Epoch {epoch}: Loss {loss.item():.4f}")

Step 6: Sampling (Generating New Images)

This step generates new images by starting from random noise and gradually denoising it using the trained model.

Initializes random noise and iteratively applies the reverse diffusion process from timestep T to 0.
At each step, the model predicts noise and removes it using the learned denoising formula.
Gradually refines the noisy input into a realistic image sample.

Python

@torch.no_grad()
def sample(model, image_size, num_samples):
    model.eval()
    x = torch.randn((num_samples, 1, image_size, image_size), device=device)
    for t in reversed(range(T)):
        t_tensor = torch.full((num_samples,), t, device=device, dtype=torch.long)
        pred_noise = model(x, t_tensor)
        alpha = alphas[t]
        alpha_bar = alphas_cumprod[t]
        beta = betas[t]
        if t > 0:
            noise = torch.randn_like(x)
        else:
            noise = 0
        x = (1 / torch.sqrt(alpha)) * (
            x - (beta / torch.sqrt(1 - alpha_bar)) * pred_noise
        ) + torch.sqrt(beta) * noise
    return x

Step 7: Running the Model

Initializes the model and optimizer, trains the diffusion model on the dataset and then generates and visualizes new image samples produced by the model.

Python

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SimpleModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
dataloader = get_data()

train(model, dataloader, optimizer)

samples = sample(model, 28, 16)
grid = torchvision.utils.make_grid(samples.cpu(), nrow=4, normalize=True)
plt.imshow(grid.permute(1, 2, 0))
plt.axis("off")
plt.show()

Output:

Download the source code from here

Applications

Image Generation: Diffusion models are widely used to generate realistic images from random noise. This application is especially popular in fields like art, gaming, advertising and graphic design where high quality visuals are essential.
Image Editing and Inpainting: They enable advanced editing by filling in missing or damaged parts of an image. This is useful in photo restoration, object removal or editing specific regions without affecting the whole image.
Text to Image Generation: By converting written prompts into images, diffusion models allow creators to bring their ideas to life visually. This is used in storytelling, concept design, marketing and more.
Super Resolution: Diffusion models can improve the quality of low resolution images by enhancing details. This application benefits medical imaging, satellite photos and surveillance footage.

Advantages

Flexibility: They can model complex data distributions without requiring explicit likelihood estimation.
High Quality Generation: Diffusion models generate high quality samples often surpassing other generative models like GANs.
Stable Training: Unlike GANs diffusion models avoid issues like mode collapse and unstable training dynamics.
Theoretical Foundations: Based on well understood principles from stochastic processes and statistical mechanics.

Limitations

Slow Sampling: Generating samples can be slow because of the many steps needed for the reverse diffusion process.
Complexity: The architecture and training process can be complex making them challenging to implement and understand.
Memory Usage: High memory consumption during training due to the need to store multiple intermediate steps.
Fine Tuning: Requires careful tuning of noise schedules and other hyperparameters to achieve optimal performance.

What are Diffusion Models?

Key Components

Architecture of Diffusion Models

1. Forward Diffusion Process (Noise Addition)

2. Reverse Diffusion Process (Denoising)

How Diffusion Models Work

1. Forward Process (Diffusion / Noise Addition)

2. Reverse Process (Learning to Denoise)

3: Training Objective (Loss Function)

4. Sampling and Data Generation

Step By Step Implementation

Step 1: Install Necessary Libraries

Step 2: Beta Schedule and Noise Schedule

Step 3: Forward Diffusion Process

Step 4: Neural Network (U Net or simple CNN)

Step 5: Load Data and Train the Model

Step 6: Sampling (Generating New Images)

Step 7: Running the Model

Applications

Advantages

Limitations

Explore