Demo Mode

No student ID available

Concept 13 of 18

Concept 13: Diffusion Models

Diffusion Models

ℹ️ Definition Diffusion Models are generative models that learn to generate data by gradually denoising random noise through a reverse diffusion process, trained to invert a forward process that progressively adds Gaussian noise to data until it becomes pure noise.

Learning Objectives

By the end of this lesson, you will:

Understand the forward and reverse diffusion processes
Learn the mathematical framework (DDPM, DDIM)
Implement a denoising U-Net architecture
Apply classifier-free guidance for conditional generation
Compare diffusion models to GANs and VAEs
Generate high-quality images using Stable Diffusion

Introduction

In Lessons 10-12, we explored VAEs and GANs for generative modeling. Now we'll learn Diffusion Models - the current state-of-the-art for image generation!

What makes diffusion models special:

Stable training (no adversarial dynamics like GANs)
High-quality samples (photorealistic images)
Flexible conditioning (text-to-image, image-to-image)
Principled framework (probabilistic modeling)

Real-world applications:

Stable Diffusion (text-to-image generation)
DALL-E 2 (OpenAI's image generator)
Midjourney (artistic image creation)
Image editing (inpainting, style transfer)

Intuition: What is Diffusion?

Forward Process (Destruction)

Idea: Gradually add noise to an image until it becomes pure noise

Process:

css

Clean image x₀ → Noisy x₁ → Noisier x₂ → ... → Pure noise x_T

Example:

css

[Photo of cat] → [Slightly noisy] → [Very noisy] → [Random noise]

Mathematical formulation:

css

q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ) xₜ₋₁, βₜI)

Where:

βₜ: Noise schedule (controls how much noise to add at step t)
xₜ: Image at timestep t (progressively noisier)
T: Total number of steps (e.g., 1000)

Reverse Process (Generation)

Idea: Learn to reverse the forward process - denoise step by step

Process:

css

Pure noise x_T → Denoised xₜ₋₁ → ... → Clean image x₀

Challenge: We need a model to predict how to denoise!

Solution: Train a neural network εθ(xₜ, t) to predict the noise added at each step

Denoising Diffusion Probabilistic Models (DDPM)

Forward Process (Closed-Form)

Key insight: We can jump to any timestep t directly!

Direct sampling formula:

scss

xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε,  where ε ~ N(0, I)

Where:

αₜ = 1 - βₜ (complement of noise rate)
ᾱₜ = ∏ᵢ₌₁ᵗ αᵢ (cumulative product)
ε: Random noise N(0, I)

Benefit: Training is efficient - sample any timestep directly!

Noise Schedule

Common schedules:

One. Linear Schedule (original DDPM):

python

beta_start = 0.0001
beta_end = 0.02
betas = np.linspace(beta_start, beta_end, T)

2. Cosine Schedule (improved):

python

def cosine_beta_schedule(timesteps):
    s = 0.008
    steps = timesteps + 1
    x = np.linspace(0, timesteps, steps)
    alphas_cumprod = np.cos(((x / timesteps) + s) / (1 + s) * np.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return np.clip(betas, 0.0001, 0.9999)

Why cosine?: More gradual noise addition, better sample quality

Training Objective

Goal: Train model εθ to predict noise ε given noisy image xₜ and timestep t

Loss function (simplified):

css

L = E_{x₀, ε, t} [ ||ε - εθ(xₜ, t)||² ]

Where:

x₀: Clean image (from dataset)
t: Random timestep (1 to T)
xₜ: Noisy version at timestep t
ε: Actual noise added
εθ(xₜ, t): Predicted noise by model

Training algorithm:

scss

1. Sample x₀ from dataset
2. Sample timestep t ~ Uniform(1, T)
3. Sample noise ε ~ N(0, I)
4. Compute xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
5. Predict noise: ε_pred = εθ(xₜ, t)
6. Compute loss: L = ||ε - ε_pred||²
7. Backpropagate and update θ

Sampling (Generation)

Reverse process (iterative denoising):

python

# Start from pure noise
x_T = torch.randn(batch_size, 3, 64, 64)

# Iteratively denoise
for t in reversed(range(T)):
    # Predict noise
    eps_pred = model(x_t, t)

    # Compute previous step
    alpha_t = alphas[t]
    alpha_t_bar = alphas_cumprod[t]

    # Mean of reverse distribution
    mu = (x_t - ((1 - alpha_t) / sqrt(1 - alpha_t_bar)) * eps_pred) / sqrt(alpha_t)

    # Add noise (except last step)
    if t > 0:
        noise = torch.randn_like(x_t)
        sigma_t = sqrt(betas[t])
        x_t_minus_1 = mu + sigma_t * noise
    else:
        x_t_minus_1 = mu

    x_t = x_t_minus_1

# x_0 is the generated image

Challenge: 1000 steps = slow sampling! (Solutions: DDIM, Latent Diffusion)

U-Net Architecture

Why U-Net for Diffusion?

Requirements:

Input/output same size (image -> denoised image)
Multi-scale features (global + local structure)
Timestep conditioning (model must know t)

U-Net design:

scss

Encoder (downsampling)  ─┐
                        Bottleneck
Decoder (upsampling)   ─┘
    + Skip connections

Complete U-Net for Diffusion

python

import torch
import torch.nn as nn

class SinusoidalPositionEmbeddings(nn.Module):
    """Encode timestep t as vector"""
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

class ResidualBlock(nn.Module):
    """Residual block with timestep conditioning"""
    def __init__(self, in_channels, out_channels, time_emb_dim):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)

        # Time embedding projection
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)

        # Residual connection
        self.residual = nn.Conv2d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()

        self.norm1 = nn.GroupNorm(8, out_channels)
        self.norm2 = nn.GroupNorm(8, out_channels)
        self.act = nn.SiLU()

    def forward(self, x, time_emb):
        h = self.conv1(x)
        h = self.norm1(h)
        h = self.act(h)

        # Add timestep conditioning
        time_emb = self.act(self.time_mlp(time_emb))
        h = h + time_emb[:, :, None, None]  # Broadcast to spatial dims

        h = self.conv2(h)
        h = self.norm2(h)
        h = self.act(h)

        return h + self.residual(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=3, base_channels=64, time_emb_dim=128):
        super().__init__()

        # Timestep embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.SiLU(),
        )

        # Encoder (downsampling)
        self.enc1 = ResidualBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResidualBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResidualBlock(base_channels * 2, base_channels * 4, time_emb_dim)

        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = ResidualBlock(base_channels * 4, base_channels * 8, time_emb_dim)

        # Decoder (upsampling)
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        self.dec3 = ResidualBlock(base_channels * 8 + base_channels * 4, base_channels * 4, time_emb_dim)
        self.dec2 = ResidualBlock(base_channels * 4 + base_channels * 2, base_channels * 2, time_emb_dim)
        self.dec1 = ResidualBlock(base_channels * 2 + base_channels, base_channels, time_emb_dim)

        # Output
        self.out = nn.Conv2d(base_channels, out_channels, 1)

    def forward(self, x, t):
        # Timestep embedding
        time_emb = self.time_mlp(t)

        # Encoder
        enc1 = self.enc1(x, time_emb)
        enc2 = self.enc2(self.pool(enc1), time_emb)
        enc3 = self.enc3(self.pool(enc2), time_emb)

        # Bottleneck
        bottleneck = self.bottleneck(self.pool(enc3), time_emb)

        # Decoder with skip connections
        dec3 = self.dec3(torch.cat([self.upsample(bottleneck), enc3], dim=1), time_emb)
        dec2 = self.dec2(torch.cat([self.upsample(dec3), enc2], dim=1), time_emb)
        dec1 = self.dec1(torch.cat([self.upsample(dec2), enc1], dim=1), time_emb)

        # Output
        return self.out(dec1)

One. Sinusoidal Embeddings: Encode timestep t as continuous vector 2. Residual Blocks: Conv + GroupNorm + SiLU activation + skip connection 3. Time Conditioning: Inject timestep embedding into each block 4. Skip Connections: Preserve high-frequency details from encoder

Denoising Diffusion Implicit Models (DDIM)

Problem with DDPM

DDPM sampling: Requires 1000 steps (slow!)

Each step: model forward pass
Total time: ~50 seconds for one image (T4 GPU)

DDIM Solution

Key insight: Deterministic sampling with fewer steps

DDIM sampling (non-Markovian):

python

# Can skip timesteps!
timesteps = [1000, 900, 800, ..., 100, 0]  # 10 steps instead of 1000

for t, t_prev in zip(timesteps[:-1], timesteps[1:]):
    # Predict noise
    eps_pred = model(x_t, t)

    # Predict x_0 directly
    alpha_t_bar = alphas_cumprod[t]
    x_0_pred = (x_t - sqrt(1 - alpha_t_bar) * eps_pred) / sqrt(alpha_t_bar)

    # Compute x_{t-1} deterministically
    alpha_t_prev_bar = alphas_cumprod[t_prev]
    x_t_prev = sqrt(alpha_t_prev_bar) * x_0_pred + sqrt(1 - alpha_t_prev_bar) * eps_pred

    x_t = x_t_prev

Benefits:

10-50 steps instead of 1000 (10-100x faster!)
Deterministic: Same noise -> same image
Interpolation: Can interpolate in latent space

Conditional Generation

Classifier Guidance

Goal: Generate images conditioned on class label y

Naive approach: Train class-conditional model pθ(x|y)

Classifier guidance approach:

css

Score = εθ(xₜ, t) - √(1-ᾱₜ) ∇log p(y|xₜ)

Problem: Requires separate classifier p(y|xₜ)

Classifier-Free Guidance (CFG)

Better approach: Train single model for both conditional and unconditional

Training:

python

# Randomly drop conditioning 10% of time
if random() < 0.1:
    cond = null  # Unconditional
else:
    cond = y  # Conditional

# Train model εθ(xₜ, t, cond)

Sampling (guidance):

python

# Predict noise both ways
eps_uncond = model(x_t, t, cond=null)
eps_cond = model(x_t, t, cond=y)

# Guidance scale (w > 1 amplifies conditioning)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

Effect of guidance scale w:

w = 0: Unconditional (ignores prompt)
w = 1: Standard conditional
w = 7: Strong guidance (Stable Diffusion default)
w = 20: Very strong (may oversaturate)

Latent Diffusion Models (Stable Diffusion)

Problem with Pixel-Space Diffusion

Challenge: High-resolution images are expensive

512x512 RGB = 786,432 dimensions
1000 diffusion steps x U-Net forward pass = very slow

Latent Diffusion Solution

Idea: Run diffusion in compressed latent space!

Architecture:

css

Text → CLIP Encoder → Text Embedding
                           ↓
Image → VAE Encoder → Latent z (64×64×4)
                           ↓
                    U-Net Diffusion (in latent space)
                           ↓
                    VAE Decoder → Image (512×512×3)

Benefits:

Faster: 64x64x4 = 16,384 dims (48x smaller!)
Memory efficient: Fits in consumer GPUs
Quality: Still generates 512x512 images

Stable Diffusion Architecture

Components:

One. VAE (Variational Autoencoder):

Encoder: Image (512x512x3) -> Latent (64x64x4)
Decoder: Latent (64x64x4) -> Image (512x512x3)
Compression ratio: 8x spatial (64x64 vs 512x512)

2. CLIP Text Encoder:

Text prompt -> 77 tokens -> Embeddings (77x768)
Used for cross-attention in U-Net

3. U-Net with Cross-Attention:

Input: Noisy latent zₜ + timestep t + text embedding
Cross-attention: Image features attend to text tokens
Output: Predicted noise ε

4. Scheduler:

Manages noise schedule (DDPM, DDIM, DPM-Solver++)
Controls sampling steps and speed

Using Stable Diffusion (HuggingFace)

python

from diffusers import StableDiffusionPipeline
import torch

# Load model (Stable Diffusion 2.1)
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Generate image
prompt = "A beautiful landscape with mountains and a lake, digital art, highly detailed"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,  # DDIM steps (lower = faster)
    guidance_scale=7.5,      # CFG scale
    height=512,
    width=512,
).images[0]

image.save("generated.png")

Key parameters:

num_inference_steps: More steps = better quality (but slower)
guidance_scale: Higher = stronger prompt adherence
negative_prompt: What to avoid in generation

Advanced Diffusion Techniques

One. Image-to-Image Generation

Use case: Transform existing images based on prompt

python

from diffusers import StableDiffusionImg2ImgPipeline

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id)

# Load initial image
init_image = Image.open("photo.jpg").resize((512, 512))

# Generate variation
image = pipe(
    prompt="Turn this into a watercolor painting",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    guidance_scale=7.5,
).images[0]

Strength parameter:

0.0: No change (original image)
0.5: Moderate transformation
One.0: Complete regeneration

2. Inpainting

Use case: Fill masked regions of image

python

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id)

# Load image and mask
image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint, Black = keep

# Inpaint
result = pipe(
    prompt="A red apple on a table",
    image=image,
    mask_image=mask,
).images[0]

3. ControlNet (Spatial Conditioning)

Use case: Control generation with edge maps, poses, depth maps

python

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    model_id,
    controlnet=controlnet,
)

# Generate with edge map guidance
image = pipe(
    prompt="A beautiful house",
    image=edge_map,  # Canny edge detection of layout
).images[0]

Comparison: VAEs vs GANs vs Diffusion

Aspect	VAE	GAN	Diffusion
Training Stability	✅ Stable	❌ Unstable	✅ Very Stable
Sample Quality	⚠️ Blurry	✅ Sharp	✅ Sharp
Mode Coverage	✅ Good	❌ Mode collapse	✅ Excellent
Likelihood	✅ Computable	❌ No	⚠️ Lower bound
Sampling Speed	✅ Fast (1 step)	✅ Fast (1 step)	❌ Slow (50-1000 steps)
Controllability	⚠️ Limited	⚠️ Limited	✅ Excellent (CFG)
Training Time	✅ Fast	⚠️ Medium	❌ Slow

When to use each:

VAE: Fast sampling, density estimation, compression
GAN: Real-time generation, style transfer
Diffusion: State-of-the-art quality, controllable generation, text-to-image

Applications

One. Text-to-Image Generation

Models:

Stable Diffusion (open-source)
DALL-E 2 (OpenAI)
Midjourney (artistic)
Imagen (Google)

Use cases:

Art creation
Design prototyping
Marketing materials
Concept visualization

2. Image Editing

Techniques:

Inpainting (fill regions)
Outpainting (extend images)
Style transfer
Super-resolution

3. Video Generation

Recent advances:

Stable Video Diffusion: Text-to-video
Runway Gen-2: Video editing with text

Challenge: Temporal consistency across frames

4. 3D Generation

Methods:

DreamFusion: Text-to-3D via diffusion
Point-E: 3D point clouds

5. Audio and Music

Models:

Riffusion: Music generation
AudioLDM: Text-to-audio

Training Your Own Diffusion Model

Minimal Training Code

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Initialize model
model = UNet(in_channels=3, out_channels=3).to("cuda")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Noise schedule
T = 1000
betas = torch.linspace(0.0001, 0.02, T).to("cuda")
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)

# Training loop
for epoch in range(100):
    for batch_idx, (x_0, _) in enumerate(train_loader):
        x_0 = x_0.to("cuda")

        # Sample random timestep
        t = torch.randint(0, T, (x_0.shape[0],), device="cuda")

        # Sample noise
        eps = torch.randn_like(x_0)

        # Add noise to x_0
        sqrt_alpha_bar = torch.sqrt(alphas_cumprod[t])[:, None, None, None]
        sqrt_one_minus_alpha_bar = torch.sqrt(1 - alphas_cumprod[t])[:, None, None, None]
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * eps

        # Predict noise
        eps_pred = model(x_t, t)

        # Compute loss
        loss = nn.MSELoss()(eps, eps_pred)

        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch_idx % 100 == 0:
            print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

Training Tips

One. Dataset: Start with CIFAR-10 (32x32) or CelebA (64x64) 2. Noise schedule: Cosine schedule works better than linear 3. Learning rate: 1e-4 with warmup 4. Batch size: As large as GPU allows (64-128) 5. Training time: 100K-1M steps for good results 6. GPU: T4 or better recommended

Advantages of Diffusion Models

Stable training: No adversarial dynamics
High quality: State-of-the-art image generation
Mode coverage: Captures full data distribution
Controllability: Easy to add conditioning (text, class, spatial)
Principled framework: Grounded in probabilistic theory
Flexible: Works for images, audio, video, 3D

Limitations

Slow sampling: 50-1000 steps (vs 1 step for VAE/GAN)
Training cost: Requires millions of images and GPU-days
Memory: Large models need high-end GPUs
Inference cost: Expensive to run at scale
Determinism: Hard to get exact same output (addressed by DDIM)

Hyperparameters

Parameter	Typical Value	Effect
Timesteps (T)	1000	More steps = smoother process
Noise schedule	Cosine	Controls noise addition rate
Inference steps	20-100	More steps = better quality
Guidance scale	7-10	Higher = stronger conditioning
Learning rate	1e-4	Standard for diffusion
Batch size	64-128	Larger = more stable

Key Takeaways

Diffusion models generate by iteratively denoising random noise
Forward process adds noise, reverse process removes it
U-Net architecture with timestep conditioning is standard
DDIM enables fast sampling (10-50 steps)
Classifier-free guidance enables strong conditioning without classifier
Latent diffusion (Stable Diffusion) runs in compressed space for efficiency
Current state-of-the-art for image generation quality
Trade-off: Quality vs speed (slower than VAEs/GANs)

Looking Ahead

Lesson 14: Transformer architectures for autoregressive generation
Lesson 15: Large Language Models (LLMs) and text generation
Lesson 16: Combining RL and GenAI through RLHF

Summary

Diffusion models denoise random noise step-by-step to generate images
DDPM uses 1000 steps, DDIM accelerates to 10-50 steps
U-Net with timestep and text conditioning is core architecture
Classifier-free guidance enables controllable generation
Stable Diffusion uses latent space for efficiency
Applications: Text-to-image, editing, video, 3D, audio
State-of-the-art quality but slower sampling than GANs

Concept 13 of 18

Concept 13: Diffusion Models

Diffusion Models

ℹ️ Definition Diffusion Models are generative models that learn to generate data by gradually denoising random noise through a reverse diffusion process, trained to invert a forward process that progressively adds Gaussian noise to data until it becomes pure noise.

Learning Objectives

By the end of this lesson, you will:

Understand the forward and reverse diffusion processes
Learn the mathematical framework (DDPM, DDIM)
Implement a denoising U-Net architecture
Apply classifier-free guidance for conditional generation
Compare diffusion models to GANs and VAEs
Generate high-quality images using Stable Diffusion

Introduction

In Lessons 10-12, we explored VAEs and GANs for generative modeling. Now we'll learn Diffusion Models - the current state-of-the-art for image generation!

What makes diffusion models special:

Stable training (no adversarial dynamics like GANs)
High-quality samples (photorealistic images)
Flexible conditioning (text-to-image, image-to-image)
Principled framework (probabilistic modeling)

Real-world applications:

Stable Diffusion (text-to-image generation)
DALL-E 2 (OpenAI's image generator)
Midjourney (artistic image creation)
Image editing (inpainting, style transfer)

Intuition: What is Diffusion?

Forward Process (Destruction)

Idea: Gradually add noise to an image until it becomes pure noise

Process:

css

Clean image x₀ → Noisy x₁ → Noisier x₂ → ... → Pure noise x_T

Example:

css

[Photo of cat] → [Slightly noisy] → [Very noisy] → [Random noise]

Mathematical formulation:

css

q(xₜ | xₜ₋₁) = N(xₜ; √(1-βₜ) xₜ₋₁, βₜI)

Where:

βₜ: Noise schedule (controls how much noise to add at step t)
xₜ: Image at timestep t (progressively noisier)
T: Total number of steps (e.g., 1000)

Reverse Process (Generation)

Idea: Learn to reverse the forward process - denoise step by step

Process:

css

Pure noise x_T → Denoised xₜ₋₁ → ... → Clean image x₀

Challenge: We need a model to predict how to denoise!

Solution: Train a neural network εθ(xₜ, t) to predict the noise added at each step

Denoising Diffusion Probabilistic Models (DDPM)

Forward Process (Closed-Form)

Key insight: We can jump to any timestep t directly!

Direct sampling formula:

scss

xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε,  where ε ~ N(0, I)

Where:

αₜ = 1 - βₜ (complement of noise rate)
ᾱₜ = ∏ᵢ₌₁ᵗ αᵢ (cumulative product)
ε: Random noise N(0, I)

Benefit: Training is efficient - sample any timestep directly!

Noise Schedule

Common schedules:

One. Linear Schedule (original DDPM):

python

beta_start = 0.0001
beta_end = 0.02
betas = np.linspace(beta_start, beta_end, T)

2. Cosine Schedule (improved):

python

def cosine_beta_schedule(timesteps):
    s = 0.008
    steps = timesteps + 1
    x = np.linspace(0, timesteps, steps)
    alphas_cumprod = np.cos(((x / timesteps) + s) / (1 + s) * np.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return np.clip(betas, 0.0001, 0.9999)

Why cosine?: More gradual noise addition, better sample quality

Training Objective

Goal: Train model εθ to predict noise ε given noisy image xₜ and timestep t

Loss function (simplified):

css

L = E_{x₀, ε, t} [ ||ε - εθ(xₜ, t)||² ]

Where:

x₀: Clean image (from dataset)
t: Random timestep (1 to T)
xₜ: Noisy version at timestep t
ε: Actual noise added
εθ(xₜ, t): Predicted noise by model

Training algorithm:

scss

1. Sample x₀ from dataset
2. Sample timestep t ~ Uniform(1, T)
3. Sample noise ε ~ N(0, I)
4. Compute xₜ = √ᾱₜ x₀ + √(1-ᾱₜ) ε
5. Predict noise: ε_pred = εθ(xₜ, t)
6. Compute loss: L = ||ε - ε_pred||²
7. Backpropagate and update θ

Sampling (Generation)

Reverse process (iterative denoising):

python

# Start from pure noise
x_T = torch.randn(batch_size, 3, 64, 64)

# Iteratively denoise
for t in reversed(range(T)):
    # Predict noise
    eps_pred = model(x_t, t)

    # Compute previous step
    alpha_t = alphas[t]
    alpha_t_bar = alphas_cumprod[t]

    # Mean of reverse distribution
    mu = (x_t - ((1 - alpha_t) / sqrt(1 - alpha_t_bar)) * eps_pred) / sqrt(alpha_t)

    # Add noise (except last step)
    if t > 0:
        noise = torch.randn_like(x_t)
        sigma_t = sqrt(betas[t])
        x_t_minus_1 = mu + sigma_t * noise
    else:
        x_t_minus_1 = mu

    x_t = x_t_minus_1

# x_0 is the generated image

Challenge: 1000 steps = slow sampling! (Solutions: DDIM, Latent Diffusion)

U-Net Architecture

Why U-Net for Diffusion?

Requirements:

Input/output same size (image -> denoised image)
Multi-scale features (global + local structure)
Timestep conditioning (model must know t)

U-Net design:

scss

Encoder (downsampling)  ─┐
                        Bottleneck
Decoder (upsampling)   ─┘
    + Skip connections

Complete U-Net for Diffusion

python

import torch
import torch.nn as nn

class SinusoidalPositionEmbeddings(nn.Module):
    """Encode timestep t as vector"""
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = torch.log(torch.tensor(10000.0)) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

class ResidualBlock(nn.Module):
    """Residual block with timestep conditioning"""
    def __init__(self, in_channels, out_channels, time_emb_dim):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)

        # Time embedding projection
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)

        # Residual connection
        self.residual = nn.Conv2d(in_channels, out_channels, 1) if in_channels != out_channels else nn.Identity()

        self.norm1 = nn.GroupNorm(8, out_channels)
        self.norm2 = nn.GroupNorm(8, out_channels)
        self.act = nn.SiLU()

    def forward(self, x, time_emb):
        h = self.conv1(x)
        h = self.norm1(h)
        h = self.act(h)

        # Add timestep conditioning
        time_emb = self.act(self.time_mlp(time_emb))
        h = h + time_emb[:, :, None, None]  # Broadcast to spatial dims

        h = self.conv2(h)
        h = self.norm2(h)
        h = self.act(h)

        return h + self.residual(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, out_channels=3, base_channels=64, time_emb_dim=128):
        super().__init__()

        # Timestep embedding
        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.SiLU(),
        )

        # Encoder (downsampling)
        self.enc1 = ResidualBlock(in_channels, base_channels, time_emb_dim)
        self.enc2 = ResidualBlock(base_channels, base_channels * 2, time_emb_dim)
        self.enc3 = ResidualBlock(base_channels * 2, base_channels * 4, time_emb_dim)

        self.pool = nn.MaxPool2d(2)

        # Bottleneck
        self.bottleneck = ResidualBlock(base_channels * 4, base_channels * 8, time_emb_dim)

        # Decoder (upsampling)
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)

        self.dec3 = ResidualBlock(base_channels * 8 + base_channels * 4, base_channels * 4, time_emb_dim)
        self.dec2 = ResidualBlock(base_channels * 4 + base_channels * 2, base_channels * 2, time_emb_dim)
        self.dec1 = ResidualBlock(base_channels * 2 + base_channels, base_channels, time_emb_dim)

        # Output
        self.out = nn.Conv2d(base_channels, out_channels, 1)

    def forward(self, x, t):
        # Timestep embedding
        time_emb = self.time_mlp(t)

        # Encoder
        enc1 = self.enc1(x, time_emb)
        enc2 = self.enc2(self.pool(enc1), time_emb)
        enc3 = self.enc3(self.pool(enc2), time_emb)

        # Bottleneck
        bottleneck = self.bottleneck(self.pool(enc3), time_emb)

        # Decoder with skip connections
        dec3 = self.dec3(torch.cat([self.upsample(bottleneck), enc3], dim=1), time_emb)
        dec2 = self.dec2(torch.cat([self.upsample(dec3), enc2], dim=1), time_emb)
        dec1 = self.dec1(torch.cat([self.upsample(dec2), enc1], dim=1), time_emb)

        # Output
        return self.out(dec1)

Each step: model forward pass
Total time: ~50 seconds for one image (T4 GPU)

DDIM Solution

Key insight: Deterministic sampling with fewer steps

DDIM sampling (non-Markovian):

python

# Can skip timesteps!
timesteps = [1000, 900, 800, ..., 100, 0]  # 10 steps instead of 1000

for t, t_prev in zip(timesteps[:-1], timesteps[1:]):
    # Predict noise
    eps_pred = model(x_t, t)

    # Predict x_0 directly
    alpha_t_bar = alphas_cumprod[t]
    x_0_pred = (x_t - sqrt(1 - alpha_t_bar) * eps_pred) / sqrt(alpha_t_bar)

    # Compute x_{t-1} deterministically
    alpha_t_prev_bar = alphas_cumprod[t_prev]
    x_t_prev = sqrt(alpha_t_prev_bar) * x_0_pred + sqrt(1 - alpha_t_prev_bar) * eps_pred

    x_t = x_t_prev

Benefits:

10-50 steps instead of 1000 (10-100x faster!)
Deterministic: Same noise -> same image
Interpolation: Can interpolate in latent space

Conditional Generation

Classifier Guidance

Goal: Generate images conditioned on class label y

Naive approach: Train class-conditional model pθ(x|y)

Classifier guidance approach:

css

Score = εθ(xₜ, t) - √(1-ᾱₜ) ∇log p(y|xₜ)

Problem: Requires separate classifier p(y|xₜ)

Classifier-Free Guidance (CFG)

Better approach: Train single model for both conditional and unconditional

Training:

python

# Randomly drop conditioning 10% of time
if random() < 0.1:
    cond = null  # Unconditional
else:
    cond = y  # Conditional

# Train model εθ(xₜ, t, cond)

Sampling (guidance):

python

# Predict noise both ways
eps_uncond = model(x_t, t, cond=null)
eps_cond = model(x_t, t, cond=y)

# Guidance scale (w > 1 amplifies conditioning)
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

Effect of guidance scale w:

w = 0: Unconditional (ignores prompt)
w = 1: Standard conditional
w = 7: Strong guidance (Stable Diffusion default)
w = 20: Very strong (may oversaturate)

Latent Diffusion Models (Stable Diffusion)

Problem with Pixel-Space Diffusion

Challenge: High-resolution images are expensive

512x512 RGB = 786,432 dimensions
1000 diffusion steps x U-Net forward pass = very slow

Latent Diffusion Solution

Idea: Run diffusion in compressed latent space!

Architecture:

css

Text → CLIP Encoder → Text Embedding
                           ↓
Image → VAE Encoder → Latent z (64×64×4)
                           ↓
                    U-Net Diffusion (in latent space)
                           ↓
                    VAE Decoder → Image (512×512×3)

Benefits:

Faster: 64x64x4 = 16,384 dims (48x smaller!)
Memory efficient: Fits in consumer GPUs
Quality: Still generates 512x512 images

Stable Diffusion Architecture

Components:

One. VAE (Variational Autoencoder):

Encoder: Image (512x512x3) -> Latent (64x64x4)
Decoder: Latent (64x64x4) -> Image (512x512x3)
Compression ratio: 8x spatial (64x64 vs 512x512)

2. CLIP Text Encoder:

Text prompt -> 77 tokens -> Embeddings (77x768)
Used for cross-attention in U-Net

3. U-Net with Cross-Attention:

Input: Noisy latent zₜ + timestep t + text embedding
Cross-attention: Image features attend to text tokens
Output: Predicted noise ε

4. Scheduler:

Manages noise schedule (DDPM, DDIM, DPM-Solver++)
Controls sampling steps and speed

Using Stable Diffusion (HuggingFace)

python

from diffusers import StableDiffusionPipeline
import torch

# Load model (Stable Diffusion 2.1)
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# Generate image
prompt = "A beautiful landscape with mountains and a lake, digital art, highly detailed"
negative_prompt = "blurry, low quality, distorted"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,  # DDIM steps (lower = faster)
    guidance_scale=7.5,      # CFG scale
    height=512,
    width=512,
).images[0]

image.save("generated.png")

Key parameters:

num_inference_steps: More steps = better quality (but slower)
guidance_scale: Higher = stronger prompt adherence
negative_prompt: What to avoid in generation

Advanced Diffusion Techniques

One. Image-to-Image Generation

Use case: Transform existing images based on prompt

python

from diffusers import StableDiffusionImg2ImgPipeline

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id)

# Load initial image
init_image = Image.open("photo.jpg").resize((512, 512))

# Generate variation
image = pipe(
    prompt="Turn this into a watercolor painting",
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    guidance_scale=7.5,
).images[0]

Strength parameter:

0.0: No change (original image)
0.5: Moderate transformation
One.0: Complete regeneration

2. Inpainting

Use case: Fill masked regions of image

python

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(model_id)

# Load image and mask
image = Image.open("photo.jpg")
mask = Image.open("mask.png")  # White = inpaint, Black = keep

# Inpaint
result = pipe(
    prompt="A red apple on a table",
    image=image,
    mask_image=mask,
).images[0]

3. ControlNet (Spatial Conditioning)

Use case: Control generation with edge maps, poses, depth maps

python

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    model_id,
    controlnet=controlnet,
)

# Generate with edge map guidance
image = pipe(
    prompt="A beautiful house",
    image=edge_map,  # Canny edge detection of layout
).images[0]

Comparison: VAEs vs GANs vs Diffusion

Aspect	VAE	GAN	Diffusion
Training Stability	✅ Stable	❌ Unstable	✅ Very Stable
Sample Quality	⚠️ Blurry	✅ Sharp	✅ Sharp
Mode Coverage	✅ Good	❌ Mode collapse	✅ Excellent
Likelihood	✅ Computable	❌ No	⚠️ Lower bound
Sampling Speed	✅ Fast (1 step)	✅ Fast (1 step)	❌ Slow (50-1000 steps)
Controllability	⚠️ Limited	⚠️ Limited	✅ Excellent (CFG)
Training Time	✅ Fast	⚠️ Medium	❌ Slow

When to use each:

VAE: Fast sampling, density estimation, compression
GAN: Real-time generation, style transfer
Diffusion: State-of-the-art quality, controllable generation, text-to-image

Applications

One. Text-to-Image Generation

Models:

Stable Diffusion (open-source)
DALL-E 2 (OpenAI)
Midjourney (artistic)
Imagen (Google)

Use cases:

Art creation
Design prototyping
Marketing materials
Concept visualization

2. Image Editing

Techniques:

Inpainting (fill regions)
Outpainting (extend images)
Style transfer
Super-resolution

3. Video Generation

Recent advances:

Stable Video Diffusion: Text-to-video
Runway Gen-2: Video editing with text

Challenge: Temporal consistency across frames

4. 3D Generation

Methods:

DreamFusion: Text-to-3D via diffusion
Point-E: 3D point clouds

5. Audio and Music

Models:

Riffusion: Music generation
AudioLDM: Text-to-audio

Training Your Own Diffusion Model

Minimal Training Code

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Initialize model
model = UNet(in_channels=3, out_channels=3).to("cuda")
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# Noise schedule
T = 1000
betas = torch.linspace(0.0001, 0.02, T).to("cuda")
alphas = 1 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)

# Training loop
for epoch in range(100):
    for batch_idx, (x_0, _) in enumerate(train_loader):
        x_0 = x_0.to("cuda")

        # Sample random timestep
        t = torch.randint(0, T, (x_0.shape[0],), device="cuda")

        # Sample noise
        eps = torch.randn_like(x_0)

        # Add noise to x_0
        sqrt_alpha_bar = torch.sqrt(alphas_cumprod[t])[:, None, None, None]
        sqrt_one_minus_alpha_bar = torch.sqrt(1 - alphas_cumprod[t])[:, None, None, None]
        x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * eps

        # Predict noise
        eps_pred = model(x_t, t)

        # Compute loss
        loss = nn.MSELoss()(eps, eps_pred)

        # Backprop
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch_idx % 100 == 0:
            print(f"Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")

Training Tips

Advantages of Diffusion Models

Stable training: No adversarial dynamics
High quality: State-of-the-art image generation
Mode coverage: Captures full data distribution
Controllability: Easy to add conditioning (text, class, spatial)
Principled framework: Grounded in probabilistic theory
Flexible: Works for images, audio, video, 3D

Limitations

Slow sampling: 50-1000 steps (vs 1 step for VAE/GAN)
Training cost: Requires millions of images and GPU-days
Memory: Large models need high-end GPUs
Inference cost: Expensive to run at scale
Determinism: Hard to get exact same output (addressed by DDIM)

Hyperparameters

Parameter	Typical Value	Effect
Timesteps (T)	1000	More steps = smoother process
Noise schedule	Cosine	Controls noise addition rate
Inference steps	20-100	More steps = better quality
Guidance scale	7-10	Higher = stronger conditioning
Learning rate	1e-4	Standard for diffusion
Batch size	64-128	Larger = more stable

Key Takeaways

Diffusion models generate by iteratively denoising random noise
Forward process adds noise, reverse process removes it
U-Net architecture with timestep conditioning is standard
DDIM enables fast sampling (10-50 steps)
Classifier-free guidance enables strong conditioning without classifier
Latent diffusion (Stable Diffusion) runs in compressed space for efficiency
Current state-of-the-art for image generation quality
Trade-off: Quality vs speed (slower than VAEs/GANs)

Looking Ahead

Lesson 14: Transformer architectures for autoregressive generation
Lesson 15: Large Language Models (LLMs) and text generation
Lesson 16: Combining RL and GenAI through RLHF

Summary

Diffusion models denoise random noise step-by-step to generate images
DDPM uses 1000 steps, DDIM accelerates to 10-50 steps
U-Net with timestep and text conditioning is core architecture
Classifier-free guidance enables controllable generation
Stable Diffusion uses latent space for efficiency
Applications: Text-to-image, editing, video, 3D, audio
State-of-the-art quality but slower sampling than GANs