ℹ️ Definition Diffusion Models are generative models that learn to generate data by gradually denoising random noise through a reverse diffusion process, trained to invert a forward process that progressively adds Gaussian noise to data until it becomes pure noise.
By the end of this lesson, you will:
In Lessons 10-12, we explored VAEs and GANs for generative modeling. Now we'll learn Diffusion Models - the current state-of-the-art for image generation!
What makes diffusion models special:
Real-world applications:
Idea: Gradually add noise to an image until it becomes pure noise
Process:
Clean image x₀ → Noisy x₁ → Noisier x₂ → ... → Pure noise x_T
Example:
[Photo of cat] → [Slightly noisy] → [Very noisy] → [Random noise]
Mathematical formulation:
q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ) xₜ₋₁, βₜI)
Where:
Idea: Learn to reverse the forward process - denoise step by step
Process:
Pure noise x_T → Denoised xₜ₋₁ → ... → Clean image x₀
Challenge: We need a model to predict how to denoise!
Solution: Train a neural network εθ(xₜ, t) to predict the noise added at each step

Key insight: We can jump to any timestep t directly!
Direct sampling formula:
xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε, where ε ~ N(0, I)
Where:
Benefit: Training is efficient - sample any timestep directly!
Common schedules:
One. Linear Schedule (original DDPM):
beta_start = 0.0001
beta_end = 0.02
betas = np.linspace(beta_start, beta_end, T)
2. Cosine Schedule (improved):
def cosine_beta_schedule(timesteps):
s = 0.008
steps = timesteps + 1
x = np.linspace(0, timesteps, steps)
alphas_cumprod = np.cos(((x / timesteps) + s) / (1 + s) * np.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return np.clip(betas, 0.0001, 0.9999)
Why cosine?: More gradual noise addition, better sample quality
Goal: Train model εθ to predict noise ε given noisy image xₜ and timestep t
Loss function (simplified):
L = E_{x₀, ε, t} [ ||ε - εθ(xₜ, t)||² ]
Where:
Training algorithm:
1. Sample x₀ from dataset
2. Sample timestep t ~ Uniform(1, T)
3. Sample noise ε ~ N(0, I)
4. Compute xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
5. Predict noise: ε_pred = εθ(xₜ, t)
6. Compute loss: L = ||ε - ε_pred||²
7. Backpropagate and update θ
Reverse process (iterative denoising):
# Start from pure noise
x_T = torch.randn(batch_size, 3, 64, 64)
# Iteratively denoise
for t in reversed(range(T)):
# Predict noise
eps_pred = model(x_t, t)
# Compute previous step
alpha_t = alphas[t]
alpha_t_bar = alphas_cumprod[t]
# Mean of reverse distribution
mu = (x_t - ((1 - alpha_t) / sqrt(1 - alpha_t_bar)) * eps_pred) / sqrt(alpha_t)
# Add noise (except last step)
if t > 0:
noise = torch.randn_like(x_t)
sigma_t = sqrt(betas[t])
x_t_minus_1 = mu + sigma_t * noise
else:
x_t_minus_1 = mu
x_t = x_t_minus_1
# x_0 is the generated image
Challenge: 1000 steps = slow sampling! (Solutions: DDIM, Latent Diffusion)
Requirements:
U-Net design:
Encoder (downsampling) ─┐
Bottleneck
Decoder (upsampling) ─┘
+ Skip connections
import torch
import torch.nn as nn
class SinusoidalPositionEmbeddings(nn.Module):
"""Encode timestep t as vector"""
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, time):
device = time.device
half_dim = self.dim // 2
embeddings = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
embeddings = time[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
return embeddings
class ResidualBlock(nn.Module):
"""Residual block with timestep conditioning"""
def __init__(self, in_channels, out_channels, time_emb_dim):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
# Time embedding projection
self.time_mlp = nn.Linear(time_emb_dim, out_channels)
# Residual connection
self.residual = nn.Conv2d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()
self.norm1 = nn.GroupNorm(8, out_channels)
self.norm2 = nn.GroupNorm(8, out_channels)
self.act = nn.SiLU()
def forward(self, x, time_emb):
h = self.conv1(x)
h = self.norm1(h)
h = self.act(h)
# Add timestep conditioning
time_emb = self.act(self.time_mlp(time_emb))
h = h + time_emb[:, :, None, None] # Broadcast to spatial dims
h = self.conv2(h)
h = self.norm2(h)
h = self.act(h)
return h + self.residual(x)
class UNet(nn.Module):
def __init__(self, in_channels=3, out_channels=3, base_channels=64, time_emb_dim=128):
super().__init__()
# Timestep embedding
self.time_mlp = nn.Sequential(
SinusoidalPositionEmbeddings(time_emb_dim),
nn.Linear(time_emb_dim, time_emb_dim),
nn.SiLU(),
)
# Encoder (downsampling)
self.enc1 = ResidualBlock(in_channels, base_channels, time_emb_dim)
self.enc2 = ResidualBlock(base_channels, base_channels * 2, time_emb_dim)
self.enc3 = ResidualBlock(base_channels * 2, base_channels * 4, time_emb_dim)
self.pool = nn.MaxPool2d(2)
# Bottleneck
self.bottleneck = ResidualBlock(base_channels * 4, base_channels * 8, time_emb_dim)
# Decoder (upsampling)
self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
self.dec3 = ResidualBlock(base_channels * 8 + base_channels * 4, base_channels * 4, time_emb_dim)
self.dec2 = ResidualBlock(base_channels * 4 + base_channels * 2, base_channels * 2, time_emb_dim)
self.dec1 = ResidualBlock(base_channels * 2 + base_channels, base_channels, time_emb_dim)
# Output
self.out = nn.Conv2d(base_channels, out_channels, 1)
def forward(self, x, t):
# Timestep embedding
time_emb = self.time_mlp(t)
# Encoder
enc1 = self.enc1(x, time_emb)
enc2 = self.enc2(self.pool(enc1), time_emb)
enc3 = self.enc3(self.pool(enc2), time_emb)
# Bottleneck
bottleneck = self.bottleneck(self.pool(enc3), time_emb)
# Decoder with skip connections
dec3 = self.dec3(torch.cat([self.upsample(bottleneck), enc3], dim=1), time_emb)
dec2 = self.dec2(torch.cat([self.upsample(dec3), enc2], dim=1), time_emb)
dec1 = self.dec1(torch.cat([self.upsample(dec2), enc1], dim=1), time_emb)
# Output
return self.out(dec1)
One. Sinusoidal Embeddings: Encode timestep t as continuous vector 2. Residual Blocks: Conv + GroupNorm + SiLU activation + skip connection 3. Time Conditioning: Inject timestep embedding into each block 4. Skip Connections: Preserve high-frequency details from encoder

DDPM sampling: Requires 1000 steps (slow!)
Key insight: Deterministic sampling with fewer steps
DDIM sampling (non-Markovian):
# Can skip timesteps!
timesteps = [1000, 900, 800, ..., 100, 0] # 10 steps instead of 1000
for t, t_prev in zip(timesteps[:-1], timesteps[1:]):
# Predict noise
eps_pred = model(x_t, t)
# Predict x_0 directly
alpha_t_bar = alphas_cumprod[t]
x_0_pred = (x_t - sqrt(1 - alpha_t_bar) * eps_pred) / sqrt(alpha_t_bar)
# Compute x_{t-1} deterministically
alpha_t_prev_bar = alphas_cumprod[t_prev]
x_t_prev = sqrt(alpha_t_prev_bar) * x_0_pred + sqrt(1 - alpha_t_prev_bar) * eps_pred
x_t = x_t_prev
Benefits:
Goal: Generate images conditioned on class label y
Naive approach: Train class-conditional model pθ(x|y)
Classifier guidance approach:
Score = εθ(xₜ, t) - √(1-ᾱₜ) ∇log p(y|xₜ)
Problem: Requires separate classifier p(y|xₜ)
Better approach: Train single model for both conditional and unconditional
Training:
# Randomly drop conditioning 10% of time
if random() < 0.1:
cond = null # Unconditional
else:
cond = y # Conditional
# Train model εθ(xₜ, t, cond)
Sampling (guidance):
# Predict noise both ways
eps_uncond = model(x_t, t, cond=null)
eps_cond = model(x_t, t, cond=y)
# Guidance scale (w > 1 amplifies conditioning)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)
Effect of guidance scale w:
Challenge: High-resolution images are expensive
Idea: Run diffusion in compressed latent space!
Architecture:
Text → CLIP Encoder → Text Embedding
↓
Image → VAE Encoder → Latent z (64×64×4)
↓
U-Net Diffusion (in latent space)
↓
VAE Decoder → Image (512×512×3)
Benefits:
Components:
One. VAE (Variational Autoencoder):
2. CLIP Text Encoder:
3. U-Net with Cross-Attention:
4. Scheduler:

from diffusers import StableDiffusionPipeline
import torch
# Load model (Stable Diffusion 2.1)
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# Generate image
prompt = "A beautiful landscape with mountains and a lake, digital art, highly detailed"
negative_prompt = "blurry, low quality, distorted"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50, # DDIM steps (lower = faster)
guidance_scale=7.5, # CFG scale
height=512,
width=512,
).images[0]
image.save("generated.png")
Key parameters:
Use case: Transform existing images based on prompt
from diffusers import StableDiffusionImg2ImgPipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id)
# Load initial image
init_image = Image.open("photo.jpg").resize((512, 512))
# Generate variation
image = pipe(
prompt="Turn this into a watercolor painting",
image=init_image,
strength=0.75, # How much to transform (0-1)
guidance_scale=7.5,
).images[0]
Strength parameter:
Use case: Fill masked regions of image
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id)
# Load image and mask
image = Image.open("photo.jpg")
mask = Image.open("mask.png") # White = inpaint, Black = keep
# Inpaint
result = pipe(
prompt="A red apple on a table",
image=image,
mask_image=mask,
).images[0]
Use case: Control generation with edge maps, poses, depth maps
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
model_id,
controlnet=controlnet,
)
# Generate with edge map guidance
image = pipe(
prompt="A beautiful house",
image=edge_map, # Canny edge detection of layout
).images[0]
| Aspect | VAE | GAN | Diffusion |
|---|---|---|---|
| Training Stability | ✅ Stable | ❌ Unstable | ✅ Very Stable |
| Sample Quality | ⚠️ Blurry | ✅ Sharp | ✅ Sharp |
| Mode Coverage | ✅ Good | ❌ Mode collapse | ✅ Excellent |
| Likelihood | ✅ Computable | ❌ No | ⚠️ Lower bound |
| Sampling Speed | ✅ Fast (1 step) | ✅ Fast (1 step) | ❌ Slow (50-1000 steps) |
| Controllability | ⚠️ Limited | ⚠️ Limited | ✅ Excellent (CFG) |
| Training Time | ✅ Fast | ⚠️ Medium | ❌ Slow |
When to use each:
Models:
Use cases:
Techniques:
Recent advances:
Challenge: Temporal consistency across frames
Methods:
Models:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Initialize model
model = UNet(in_channels=3, out_channels=3).to("cuda")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
# Noise schedule
T = 1000
betas = torch.linspace(0.0001, 0.02, T).to("cuda")
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
# Training loop
for epoch in range(100):
for batch_idx, (x_0, _) in enumerate(train_loader):
x_0 = x_0.to("cuda")
# Sample random timestep
t = torch.randint(0, T, (x_0.shape[0],), device="cuda")
# Sample noise
eps = torch.randn_like(x_0)
# Add noise to x_0
sqrt_alpha_bar = torch.sqrt(alphas_cumprod[t])[:, None, None, None]
sqrt_one_minus_alpha_bar = torch.sqrt(1 - alphas_cumprod[t])[:, None, None, None]
x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * eps
# Predict noise
eps_pred = model(x_t, t)
# Compute loss
loss = nn.MSELoss()(eps, eps_pred)
# Backprop
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
One. Dataset: Start with CIFAR-10 (32x32) or CelebA (64x64) 2. Noise schedule: Cosine schedule works better than linear 3. Learning rate: 1e-4 with warmup 4. Batch size: As large as GPU allows (64-128) 5. Training time: 100K-1M steps for good results 6. GPU: T4 or better recommended
| Parameter | Typical Value | Effect |
|---|---|---|
| Timesteps (T) | 1000 | More steps = smoother process |
| Noise schedule | Cosine | Controls noise addition rate |
| Inference steps | 20-100 | More steps = better quality |
| Guidance scale | 7-10 | Higher = stronger conditioning |
| Learning rate | 1e-4 | Standard for diffusion |
| Batch size | 64-128 | Larger = more stable |