Diffusion Models - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand diffusion processes (forward and reverse)
Implement DDPM (Denoising Diffusion Probabilistic Models) from scratch
Master noise scheduling (β variance schedule)
Apply U-Net architecture for denoising
Implement ancestral sampling from pure noise
Explore classifier-free guidance for conditional generation

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ MNIST dataset loaded (handwritten digits)
- ✅ Noise scheduler initialized (linear β schedule)
- ✅ U-Net architecture built (time-conditional denoising)
- ✅ Denoising visualization (step-by-step noise removal)

Expected First Run Time: ~90 seconds (CPU), ~30 seconds (GPU)

🎯 What's Already Working

The template comes with 65-70% working code:

✅ Noise Scheduler: Linear β schedule from 1e-4 to 0.02 over 1000 timesteps
✅ U-Net Architecture: Time-conditional denoising network with skip connections
✅ MNIST Dataset: 60K training images (28x28 grayscale digits)
✅ Visualization Tools: Step-by-step denoising animations
✅ Sampling Loop: Framework for generating images from noise
✅ Training Framework: Loss calculation structure and optimizer setup

What Needs Your Work (30-35%):

⚠️ TODO 1: Implement forward diffusion process (add noise at timestep t)
⚠️ TODO 2: Implement reverse diffusion process (denoise from t to t-1)
⚠️ TODO 3: Implement denoising loss (MSE between predicted and actual noise)
⚠️ TODO 4: Add classifier-free guidance (conditional generation)

📋 Tasks to Complete

TODO 1: Implement Forward Diffusion Process (Medium)

Location: Section 3 - "Forward Diffusion"

Current State: Noise scheduler exists but forward process not implemented

Your Task: Implement the forward diffusion equation to add noise at timestep t:

arduino

x_t = sqrt(α_bar_t) * x_0 + sqrt(1 - α_bar_t) * ε

Where:

x_0 = original clean image
x_t = noisy image at timestep t
α_bar_t = cumulative product of (1 - β_t) from 0 to t
ε = random Gaussian noise N(0, I)

Starter Code Provided:

python

def forward_diffusion(x_0, t, alphas_cumprod, noise=None):
    """
    Add noise to image at timestep t.

    Args:
        x_0: Clean image [batch, channels, height, width]
        t: Timestep (0 to T-1)
        alphas_cumprod: Cumulative product of (1 - beta)
        noise: Optional pre-generated noise

    Returns:
        x_t: Noisy image at timestep t
        noise: Noise that was added
    """
    # TODO: Implement forward diffusion
    # Hint: Use reparameterization trick
    # Hint: noise = torch.randn_like(x_0) if noise is None
    pass

Success Criteria:

Forward diffusion produces increasingly noisy images from t=0 to t=T
At t=0, image is nearly unchanged
At t=1000, image is pure Gaussian noise (mean~~0, std~~One)
Visualization shows smooth noise addition progression
Formula matches DDPM paper equation

TODO 2: Implement Reverse Diffusion Process (Hard)

Location: Section 4 - "Reverse Diffusion"

Your Task: Implement the reverse diffusion equation to remove noise:

arduino

x_{t-1} = (1 / sqrt(α_t)) * (x_t - (β_t / sqrt(1 - α_bar_t)) * ε_θ(x_t, t)) + sqrt(β_t) * z

Where:

x_t = noisy image at timestep t
x_{t-1} = slightly less noisy image at t-1
ε_θ(x_t, t) = noise predicted by U-Net
β_t = noise variance at timestep t
α_t = 1 - β_t
z = random noise N(0, I) for stochasticity (z=0 for t=0)

Requirements:

Implement the denoising formula correctly
Handle t=0 special case (no added noise)
Use predicted noise from U-Net model
Apply correct variance schedule

Success Criteria:

Reverse diffusion progressively removes noise
Sampling from pure noise produces recognizable digits
Generated images resemble MNIST distribution
No noise added at final step (t=0)
Formula matches DDPM paper

TODO 3: Implement Denoising Loss (Hard)

Location: Section 5 - "Training Loss"

Your Task: Implement the loss function for training the denoising U-Net:

Loss Function:

ini

L = MSE(ε, ε_θ(x_t, t))

Where:

ε = actual noise added in forward diffusion
ε_θ(x_t, t) = noise predicted by U-Net
MSE = Mean Squared Error

Training Procedure:

Sample random timestep t ~ Uniform(0, T-1)
Generate noise ε ~ N(0, I)
Create noisy image x_t using forward diffusion (TODO 1)
Predict noise ε_θ using U-Net
Calculate loss: MSE between ε and ε_θ
Backpropagate and update U-Net weights

Success Criteria:

Loss decreases over training epochs
U-Net learns to predict noise accurately
Training converges ``in < 10`` epochs (MNIST is simple)
Generated samples improve visually over time
Loss ``reaches < 0.05`` for good quality

TODO 4: Classifier-Free Guidance (Very Hard - Extension)

Location: New section you'll create

Your Task: Implement classifier-free guidance for conditional generation

Concept: Generate specific digits (e.g., "generate a 7") by:

Training with class labels (digit 0-9)
Randomly drop labels 10% of time (unconditional training)
At sampling, blend conditional and unconditional predictions:

ini

ε_guided = ε_uncond + w * (ε_cond - ε_uncond)

Where:

w = guidance scale (higher = stronger conditioning, typical w=2-7)
ε_cond = noise prediction conditioned on class label
ε_uncond = noise prediction with label dropped

Requirements:

Modify U-Net to accept class embeddings
Randomly drop labels during training (p=0.1)
Implement guidance formula in sampling loop
Test different guidance scales

Success Criteria:

Can generate specific digits on demand
Higher guidance scale -> more recognizable digits
w=0 -> unconditional sampling (random digits)
w=3-5 -> optimal balance (clear digits, diversity)
``w>10`` -> over-saturated/distorted images

🚀 Extension Challenges

Challenge One: Cosine Noise Schedule (Medium)

Replace linear β schedule with cosine schedule (better quality):

Paper: "Improved DDPM" by Nichol & Dhariwal (2021)
Formula: α_bar_t = f(t) / f(0) where f(t) = cos((t/T + s) / (1 + s) * π/2)^2
Compare: Visual quality, training stability, convergence speed
Expected: Smoother noise addition, better image quality

Challenge 2: DDIM Sampling (Hard)

Implement DDIM (Denoising Diffusion Implicit Models) for faster sampling:

Paper: Song et al. (2021)
Key: Deterministic sampling (remove stochastic term z)
Benefit: Generate images in 50 steps instead of 1000 (20x faster)
Formula: Use DDIM update rule instead of DDPM
Compare: Speed vs quality tradeoff

Challenge 3: Latent Diffusion (Very Hard)

Apply diffusion in latent space instead of pixel space:

Train VAE to compress images to 7x7 latent
Run diffusion on latents (64x fewer dimensions)
Decode latents back to images
Compare: Training speed, memory usage, image quality
This is how Stable Diffusion works!

Challenge 4: Text-to-Image with CLIP (Very Hard)

Replace digit labels with text captions:

Use CLIP to encode text prompts ("handwritten digit seven")
Condition U-Net on CLIP embeddings instead of class labels
Generate images from text descriptions
Requires: Understanding of CLIP, text encoding, cross-attention
Mini Stable Diffusion on MNIST!

📊 Expected Results

Random Noise (Pre-Built):

Visualization: Pure Gaussian noise at t=1000
Mean: ~0, Std: ~1
No structure visible

Forward Diffusion (TODO 1):

t=0: Clear digit (original image)
t=250: Slightly noisy but recognizable
t=500: Very noisy, barely visible structure
t=1000: Pure noise, no structure

Reverse Diffusion (TODO 2):

Start: Pure noise (t=1000)
t=750: Faint structure emerges
t=500: Digit shape visible
t=250: Clear digit with minor noise
t=0: Clean, recognizable digit

Training Results (TODO 3):

Epoch 1: Loss ~0.5, blurry generations
Epoch 5: Loss ~0.1, recognizable digits
Epoch 10: Loss ~0.05, high-quality digits
Success Rate: >80% of samples look like real MNIST

Classifier-Free Guidance (TODO 4):

w=0: Random digits (unconditional)
w=2: Correct digit, natural looking
w=5: Very clear digit, slightly over-sharpened
w=10: Distorted, unnatural artifacts

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors
TODO 1 completed (forward diffusion working)
TODO 2 completed (reverse diffusion implemented)
Can generate images from noise (even if blurry)
Forward process visualization shows noise progression

Target Grade (for excellent work):

All 3 core TODOs completed (1-3)
TODO 3 completed (training loss working)
Generated digits are recognizable
Training converges ``in < 10`` epochs
Sampling produces clear MNIST digits

Exceptional Work (bonus points):

TODO 4 completed (classifier-free guidance)
At least 1 extension challenge attempted
Cosine schedule or DDIM implementation
Comparative analysis of different schedules
Clean, well-documented code

🛠️ Troubleshooting

Issue: "Forward diffusion produces black images"

Solution: Check your noise scale. Ensure:

noise = torch.randn_like(x_0) (standard normal, not uniform)
Images are normalized to [-1, 1] range before diffusion
alpha_cumprod values are computed correctly
At t=1000, alpha_cumprod should be very small (~0.001)

Issue: "Reverse diffusion gets stuck in noise"

Solution:

Verify U-Net is actually trained (not random weights)
Check variance schedule: beta should be small (1e-4 to 0.02)
Ensure you're using predicted noise correctly in formula
Try more sampling steps (use all 1000 steps, don't skip)
Visualize intermediate steps to see where it fails

Issue: "Training loss doesn't decrease"

Solution:

Reduce learning rate (try 1e-4 instead of 1e-3)
Check loss calculation: Should be MSE between predicted and actual noise
Verify forward diffusion is working (plot noisy images)
Ensure U-Net output shape matches noise shape
Check for NaN gradients (gradient clipping may help)

Issue: "Generated images are blurry"

Solution:

Train longer (more epochs)
Use larger U-Net (more channels/layers)
Try cosine schedule instead of linear
Reduce noise variance (smaller beta_max)
Use DDIM sampling for deterministic generation

Issue: "Classifier-free guidance doesn't work"

Solution:

Verify label dropping during training (10% unconditional)
Check guidance scale: Start with w=3, adjust from there
Ensure unconditional embedding is actually different (not just zeros)
Train longer with conditional data
Visualize conditional vs unconditional predictions

Issue: "Out of memory errors"

Solution:

Reduce batch size (try 32 instead of 128)
Use gradient checkpointing in U-Net
Train on GPU (Colab: Runtime -> Change runtime type -> GPU)
Reduce image size (use 16x16 instead of 28x28 for testing)
Reduce U-Net channels (use 32 base channels instead of 64)

📚 Resources

Documentation

Concept 13: Diffusion Models (theory of forward/reverse processes)
Concept 14: Stable Diffusion (latent diffusion architecture)
Concept 15: Text-to-Image Models (CLIP, cross-attention)

Key Papers

Sohl-Dickstein et al. (2015): "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" - original diffusion paper
Ho et al. (2020): "Denoising Diffusion Probabilistic Models (DDPM)" - foundational DDPM paper (this activity)
Song et al. (2021): "Denoising Diffusion Implicit Models (DDIM)" - faster sampling
Nichol & Dhariwal (2021): "Improved Denoising Diffusion Probabilistic Models" - cosine schedule
Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" - Stable Diffusion architecture
Ho & Salimans (2022): "Classifier-Free Diffusion Guidance" - conditional generation (TODO 4)

Code References

DDPM PyTorch Implementation
lucidrains/denoising-diffusion-pytorch
View on GitHub
- clean reference
HuggingFace DDPM Training - production code
Phil Wang's x-transformers
lucidrains/x-transformers
View on GitHub
- U-Net variants

📤 Submission

Complete required TODOs (minimum: TODO 1-3)
Run entire notebook to generate all outputs
Export visualizations: Save forward/reverse diffusion animations
Generate samples: Create grid of 16 generated digits
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and sample images

Submission Checklist:

Filename: activity-13-[YourName].ipynb
All code cells executed
Forward diffusion visualization visible
Reverse diffusion animation complete
Generated samples (at least 8 recognizable digits)
Training loss curve plotted

🎉 What's Next?

After mastering diffusion models:

Move to Activity 14: Stable Diffusion
Learn latent diffusion and text conditioning
Build text-to-image generator (Project 5)

Key Insight: Diffusion models generate images by learning to remove noise. DDPM works on pixels (this activity), but Stable Diffusion works on latents (64x more efficient). Master DDPM first, then scale to Stable Diffusion!

Important Concepts to Remember:

Forward process: Gradually add noise (Markov chain)
Reverse process: Learned denoising (trained U-Net)
Noise prediction: Model predicts ε, not x_0
Reparameterization: Closed-form forward diffusion (any timestep)
Variance schedule: Controls noise addition rate
Ancestral sampling: Stochastic reverse process
Classifier-free guidance: Conditional generation without classifier

Good luck! Diffusion models are the state-of-the-art in image generation. Master DDPM, and you'll understand the foundation of Stable Diffusion, DALL-E 2, and Imagen! 🚀

Template 13: Diffusion Models

📦 Project Files Included:

Diffusion Models - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (30-35%):

📋 Tasks to Complete

TODO 1: Implement Forward Diffusion Process (Medium)

TODO 2: Implement Reverse Diffusion Process (Hard)

TODO 3: Implement Denoising Loss (Hard)

TODO 4: Classifier-Free Guidance (Very Hard - Extension)

🚀 Extension Challenges

Challenge One: Cosine Noise Schedule (Medium)

Challenge 2: DDIM Sampling (Hard)

Challenge 3: Latent Diffusion (Very Hard)

Challenge 4: Text-to-Image with CLIP (Very Hard)

📊 Expected Results

Random Noise (Pre-Built):

Forward Diffusion (TODO 1):

Reverse Diffusion (TODO 2):

Training Results (TODO 3):

Classifier-Free Guidance (TODO 4):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "Forward diffusion produces black images"

Issue: "Reverse diffusion gets stuck in noise"

Issue: "Training loss doesn't decrease"

Issue: "Generated images are blurry"

Issue: "Classifier-free guidance doesn't work"

Issue: "Out of memory errors"

📚 Resources

Documentation

Key Papers

Code References

DDPM PyTorch Implementation

Phil Wang's x-transformers

📤 Submission

🎉 What's Next?