By completing this activity, you will:
Understand diffusion processes (forward and reverse)
Implement DDPM (Denoising Diffusion Probabilistic Models) from scratch
Master noise scheduling (β variance schedule)
Apply U-Net architecture for denoising
Implement ancestral sampling from pure noise
Explore classifier-free guidance for conditional generation
Open in Google Colab : Upload this notebook to Google Colab
Run All Cells : Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic : You'll see:
✅ MNIST dataset loaded (handwritten digits)
✅ Noise scheduler initialized (linear β schedule)
✅ U-Net architecture built (time-conditional denoising)
✅ Denoising visualization (step-by-step noise removal)
Expected First Run Time : ~90 seconds (CPU), ~30 seconds (GPU)
The template comes with 65-70% working code :
✅ Noise Scheduler : Linear β schedule from 1e-4 to 0.02 over 1000 timesteps
✅ U-Net Architecture : Time-conditional denoising network with skip connections
✅ MNIST Dataset : 60K training images (28x28 grayscale digits)
✅ Visualization Tools : Step-by-step denoising animations
✅ Sampling Loop : Framework for generating images from noise
✅ Training Framework : Loss calculation structure and optimizer setup
⚠️ TODO 1 : Implement forward diffusion process (add noise at timestep t)
⚠️ TODO 2 : Implement reverse diffusion process (denoise from t to t-1)
⚠️ TODO 3 : Implement denoising loss (MSE between predicted and actual noise)
⚠️ TODO 4 : Add classifier-free guidance (conditional generation)
Location : Section 3 - "Forward Diffusion"
Current State : Noise scheduler exists but forward process not implemented
Your Task : Implement the forward diffusion equation to add noise at timestep t:
arduino
x_t = sqrt (α_bar_t ) * x_0 + sqrt (1 - α_bar_t ) * ε
Where:
x_0 = original clean image
x_t = noisy image at timestep t
α_bar_t = cumulative product of (1 - β_t) from 0 to t
ε = random Gaussian noise N(0, I)
Starter Code Provided :
python
def forward_diffusion (x_0, t, alphas_cumprod, noise=None ):
"""
Add noise to image at timestep t.
Args:
x_0: Clean image [batch, channels, height, width]
t: Timestep (0 to T-1)
alphas_cumprod: Cumulative product of (1 - beta)
noise: Optional pre-generated noise
Returns:
x_t: Noisy image at timestep t
noise: Noise that was added
"""
pass
Success Criteria :
Location : Section 4 - "Reverse Diffusion"
Your Task : Implement the reverse diffusion equation to remove noise:
arduino
x_{t-1 } = (1 / sqrt (α_t )) * (x_t - (β_t / sqrt (1 - α_bar_t )) * ε_θ(x_t , t)) + sqrt (β_t ) * z
Where:
x_t = noisy image at timestep t
x_{t-1} = slightly less noisy image at t-1
ε_θ(x_t, t) = noise predicted by U-Net
β_t = noise variance at timestep t
α_t = 1 - β_t
z = random noise N(0, I) for stochasticity (z=0 for t=0)
Requirements :
Implement the denoising formula correctly
Handle t=0 special case (no added noise)
Use predicted noise from U-Net model
Apply correct variance schedule
Success Criteria :
Location : Section 5 - "Training Loss"
Your Task : Implement the loss function for training the denoising U-Net:
Loss Function :
ini
L = MSE(ε, ε_θ(x_t, t))
Where:
ε = actual noise added in forward diffusion
ε_θ(x_t, t) = noise predicted by U-Net
MSE = Mean Squared Error
Training Procedure :
Sample random timestep t ~ Uniform(0, T-1)
Generate noise ε ~ N(0, I)
Create noisy image x_t using forward diffusion (TODO 1)
Predict noise ε_θ using U-Net
Calculate loss: MSE between ε and ε_θ
Backpropagate and update U-Net weights
Success Criteria :
Location : New section you'll create
Your Task : Implement classifier-free guidance for conditional generation
Concept : Generate specific digits (e.g., "generate a 7") by:
Training with class labels (digit 0-9)
Randomly drop labels 10% of time (unconditional training)
At sampling, blend conditional and unconditional predictions:
ini
ε_guided = ε_uncond + w * (ε_cond - ε_uncond)
Where:
w = guidance scale (higher = stronger conditioning, typical w=2-7)
ε_cond = noise prediction conditioned on class label
ε_uncond = noise prediction with label dropped
Requirements :
Modify U-Net to accept class embeddings
Randomly drop labels during training (p=0.1)
Implement guidance formula in sampling loop
Test different guidance scales
Success Criteria :
Replace linear β schedule with cosine schedule (better quality):
Paper: "Improved DDPM" by Nichol & Dhariwal (2021)
Formula: α_bar_t = f(t) / f(0) where f(t) = cos((t/T + s) / (1 + s) * π/2)^2
Compare: Visual quality, training stability, convergence speed
Expected: Smoother noise addition, better image quality
Implement DDIM (Denoising Diffusion Implicit Models) for faster sampling:
Paper: Song et al. (2021)
Key: Deterministic sampling (remove stochastic term z)
Benefit: Generate images in 50 steps instead of 1000 (20x faster)
Formula: Use DDIM update rule instead of DDPM
Compare: Speed vs quality tradeoff
Apply diffusion in latent space instead of pixel space:
Train VAE to compress images to 7x7 latent
Run diffusion on latents (64x fewer dimensions)
Decode latents back to images
Compare: Training speed, memory usage, image quality
This is how Stable Diffusion works!
Replace digit labels with text captions:
Use CLIP to encode text prompts ("handwritten digit seven")
Condition U-Net on CLIP embeddings instead of class labels
Generate images from text descriptions
Requires: Understanding of CLIP, text encoding, cross-attention
Mini Stable Diffusion on MNIST!
Visualization: Pure Gaussian noise at t=1000
Mean: ~0, Std: ~1
No structure visible
t=0: Clear digit (original image)
t=250: Slightly noisy but recognizable
t=500: Very noisy, barely visible structure
t=1000: Pure noise, no structure
Start: Pure noise (t=1000)
t=750: Faint structure emerges
t=500: Digit shape visible
t=250: Clear digit with minor noise
t=0: Clean, recognizable digit
Epoch 1: Loss ~0.5, blurry generations
Epoch 5: Loss ~0.1, recognizable digits
Epoch 10: Loss ~0.05, high-quality digits
Success Rate: >80% of samples look like real MNIST
w=0: Random digits (unconditional)
w=2: Correct digit, natural looking
w=5: Very clear digit, slightly over-sharpened
w=10: Distorted, unnatural artifacts
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution : Check your noise scale. Ensure:
noise = torch.randn_like(x_0) (standard normal, not uniform)
Images are normalized to [-1, 1] range before diffusion
alpha_cumprod values are computed correctly
At t=1000, alpha_cumprod should be very small (~0.001)
Solution :
Verify U-Net is actually trained (not random weights)
Check variance schedule: beta should be small (1e-4 to 0.02)
Ensure you're using predicted noise correctly in formula
Try more sampling steps (use all 1000 steps, don't skip)
Visualize intermediate steps to see where it fails
Solution :
Reduce learning rate (try 1e-4 instead of 1e-3)
Check loss calculation: Should be MSE between predicted and actual noise
Verify forward diffusion is working (plot noisy images)
Ensure U-Net output shape matches noise shape
Check for NaN gradients (gradient clipping may help)
Solution :
Train longer (more epochs)
Use larger U-Net (more channels/layers)
Try cosine schedule instead of linear
Reduce noise variance (smaller beta_max)
Use DDIM sampling for deterministic generation
Solution :
Verify label dropping during training (10% unconditional)
Check guidance scale: Start with w=3, adjust from there
Ensure unconditional embedding is actually different (not just zeros)
Train longer with conditional data
Visualize conditional vs unconditional predictions
Solution :
Reduce batch size (try 32 instead of 128)
Use gradient checkpointing in U-Net
Train on GPU (Colab: Runtime -> Change runtime type -> GPU)
Reduce image size (use 16x16 instead of 28x28 for testing)
Reduce U-Net channels (use 32 base channels instead of 64)
Concept 13 : Diffusion Models (theory of forward/reverse processes)
Concept 14 : Stable Diffusion (latent diffusion architecture)
Concept 15 : Text-to-Image Models (CLIP, cross-attention)
Sohl-Dickstein et al. (2015): "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" - original diffusion paper
Ho et al. (2020): "Denoising Diffusion Probabilistic Models (DDPM)" - foundational DDPM paper (this activity)
Song et al. (2021): "Denoising Diffusion Implicit Models (DDIM)" - faster sampling
Nichol & Dhariwal (2021): "Improved Denoising Diffusion Probabilistic Models" - cosine schedule
Rombach et al. (2022): "High-Resolution Image Synthesis with Latent Diffusion Models" - Stable Diffusion architecture
Ho & Salimans (2022): "Classifier-Free Diffusion Guidance" - conditional generation (TODO 4)
- clean reference
HuggingFace DDPM Training - production code
- U-Net variants
Complete required TODOs (minimum: TODO 1-3)
Run entire notebook to generate all outputs
Export visualizations : Save forward/reverse diffusion animations
Generate samples : Create grid of 16 generated digits
Download notebook : File -> Download -> Download .ipynb
Submit via portal : Upload .ipynb and sample images
Submission Checklist :
After mastering diffusion models:
Move to Activity 14: Stable Diffusion
Learn latent diffusion and text conditioning
Build text-to-image generator (Project 5)
Key Insight : Diffusion models generate images by learning to remove noise. DDPM works on pixels (this activity), but Stable Diffusion works on latents (64x more efficient). Master DDPM first, then scale to Stable Diffusion!
Important Concepts to Remember :
Forward process : Gradually add noise (Markov chain)
Reverse process : Learned denoising (trained U-Net)
Noise prediction : Model predicts ε, not x_0
Reparameterization : Closed-form forward diffusion (any timestep)
Variance schedule : Controls noise addition rate
Ancestral sampling : Stochastic reverse process
Classifier-free guidance : Conditional generation without classifier
Good luck! Diffusion models are the state-of-the-art in image generation. Master DDPM, and you'll understand the foundation of Stable Diffusion, DALL-E 2, and Imagen! 🚀