Demo Mode

No student ID available

Activity 13 of 18

Activity 13: Diffusion Models

Practice and reinforce the concepts from Lesson 13

Activity 13: Diffusion Models

Overview

In this activity, you'll build a complete diffusion model from scratch, implementing the forward diffusion process, U-Net denoising architecture, DDPM sampling, and DDIM acceleration. You'll train on MNIST digits and explore advanced techniques like classifier-free guidance using Stable Diffusion.

Learning Objectives

By completing this activity, you will:

Implement forward and reverse diffusion processes
Build a U-Net architecture with timestep conditioning
Train a denoising diffusion probabilistic model (DDPM)
Apply DDIM for fast sampling (10-50 steps vs 1000)
Use Stable Diffusion for text-to-image generation
Implement classifier-free guidance for controllable generation
Compare diffusion models vs GANs and VAEs

Prerequisites

Completed Concept 13: Diffusion Models
Completed Activity 11 (GANs) and Activity 10 (VAEs)
Strong understanding of U-Net architecture

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-13-diffusion-models.zip
Location: Templates/AI25-Template-activity-13-diffusion-models.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-13-diffusion-models.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 minimum, A100 recommended for Stable Diffusion)

Step 3: Run Initial Cells

Execute the first few cells to:

Install PyTorch, diffusers, transformers
Import libraries
Load MNIST dataset
Set up utilities

What You'll Build

Part One: Noise Schedule (YOU COMPLETE)

TODO 1: Implement linear and cosine noise schedules

python

def linear_beta_schedule(timesteps, beta_start=0.0001, beta_end=0.02):
    """
    Linear noise schedule (original DDPM)

    Args:
        timesteps: Total number of timesteps (T)
        beta_start: Starting noise level
        beta_end: Ending noise level

    Returns:
        betas: (T,) array of noise levels
    """
    # TODO 1a: Implement linear schedule
    # betas = linspace(beta_start, beta_end, timesteps)

    # Your code here
    pass

def cosine_beta_schedule(timesteps, s=0.008):
    """
    Cosine noise schedule (improved)

    More gradual noise addition at beginning and end

    Args:
        timesteps: Total number of timesteps (T)
        s: Offset parameter

    Returns:
        betas: (T,) array of noise levels
    """
    # TODO 1b: Implement cosine schedule
    # Formula:
    # f(t) = cos((t/T + s) / (1 + s) * π/2)^2
    # α̅_t = f(t) / f(0)
    # β_t = 1 - α̅_t / α̅_{t-1}

    # Your code here
    pass

def compute_alpha_bars(betas):
    """
    Compute cumulative product of alphas

    Args:
        betas: (T,) noise schedule

    Returns:
        alphas: 1 - betas
        alphas_cumprod: Cumulative product of alphas
    """
    # TODO 1c: Implement alpha computation
    # alphas = 1 - betas
    # alphas_cumprod = cumprod(alphas)

    # Your code here
    pass

Part 2: Forward Diffusion Process (YOU COMPLETE)

TODO 2: Implement q(x_t | x_0) - add noise to images

python

def forward_diffusion_sample(x_0, t, alphas_cumprod, device):
    """
    Sample from q(x_t | x_0) using closed-form formula

    Formula: x_t = √ᾱ_t * x_0 + √(1-ᾱ_t) * ε
    where ε ~ N(0, I)

    Args:
        x_0: Clean images (batch, channels, H, W)
        t: Timesteps (batch,) - indices from 0 to T-1
        alphas_cumprod: Cumulative alphas (T,)
        device: 'cuda' or 'cpu'

    Returns:
        x_t: Noisy images (same shape as x_0)
        noise: The noise that was added (same shape as x_0)
    """
    # TODO 2: Implement forward diffusion
    # Step 1: Get sqrt_alphas_cumprod[t] and sqrt_one_minus_alphas_cumprod[t]
    # Step 2: Sample noise ε ~ N(0, I)
    # Step 3: Compute x_t = sqrt_alphas * x_0 + sqrt_one_minus_alphas * ε

    # Your code here
    pass

Part 3: U-Net Architecture with Timestep Conditioning (YOU COMPLETE)

TODO 3: Implement sinusoidal timestep embeddings

python

class SinusoidalPositionEmbeddings(nn.Module):
    """
    Encode timestep t as continuous vector using sinusoidal functions
    """
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, time):
        """
        Args:
            time: (batch,) - timestep indices

        Returns:
            Embeddings (batch, dim)
        """
        # TODO 3a: Implement sinusoidal embeddings
        # Formula:
        # emb[i, 2k] = sin(t / 10000^(2k/dim))
        # emb[i, 2k+1] = cos(t / 10000^(2k/dim))

        # Your code here
        pass

TODO 4: Implement U-Net denoising model

python

class DiffusionUNet(nn.Module):
    """
    U-Net architecture for diffusion models

    Predicts noise ε given noisy image x_t and timestep t
    """
    def __init__(self, in_channels=1, out_channels=1, time_emb_dim=128):
        super().__init__()

        # TODO 4: Implement U-Net architecture
        # Components:
        # 1. Timestep embedding MLP
        # 2. Encoder blocks (with timestep conditioning)
        # 3. Bottleneck
        # 4. Decoder blocks (with skip connections + timestep conditioning)
        # 5. Output projection

        # Your code here
        pass

    def forward(self, x, t):
        """
        Args:
            x: Noisy images (batch, channels, H, W)
            t: Timesteps (batch,)

        Returns:
            Predicted noise (batch, channels, H, W)
        """
        # TODO 4: Implement forward pass
        # 1. Encode timestep
        # 2. Encoder (save activations for skip connections)
        # 3. Bottleneck
        # 4. Decoder (concatenate skip connections, add timestep conditioning)
        # 5. Output

        # Your code here
        pass

Part 4: Training Loss (YOU COMPLETE)

TODO 5: Implement DDPM training loss

python

def ddpm_loss(model, x_0, alphas_cumprod, device):
    """
    Compute DDPM training loss

    Loss = E_{t,x_0,ε} [ ||ε - ε_θ(x_t, t)||^2 ]

    Args:
        model: Denoising U-Net
        x_0: Clean images (batch, channels, H, W)
        alphas_cumprod: Cumulative alphas
        device: 'cuda' or 'cpu'

    Returns:
        loss: MSE between true noise and predicted noise
    """
    # TODO 5: Implement training loss
    # Step 1: Sample random timesteps t ~ Uniform(0, T-1)
    # Step 2: Sample noise and create x_t using forward_diffusion_sample
    # Step 3: Predict noise: ε_pred = model(x_t, t)
    # Step 4: Compute MSE loss: ||ε - ε_pred||^2

    # Your code here
    pass

Part 5: DDPM Sampling (YOU COMPLETE)

TODO 6: Implement reverse diffusion sampling (1000 steps)

python

@torch.no_grad()
def ddpm_sample(model, image_size, num_samples, timesteps, alphas, alphas_cumprod, betas, device):
    """
    Sample from model using DDPM (1000 steps)

    Reverse process: x_T → x_{T-1} → ... → x_1 → x_0

    Args:
        model: Trained denoising U-Net
        image_size: (H, W) of output images
        num_samples: Number of images to generate
        timesteps: Total timesteps (T)
        alphas, alphas_cumprod, betas: Noise schedule parameters

    Returns:
        Generated images (num_samples, channels, H, W)
    """
    # TODO 6: Implement DDPM sampling
    # Step 1: Start from pure noise x_T ~ N(0, I)
    # Step 2: For t = T-1 down to 0:
    #   a. Predict noise: ε_pred = model(x_t, t)
    #   b. Compute mean of reverse distribution
    #   c. Add noise (if t > 0)
    #   d. x_{t-1} = mean + noise

    # Your code here
    pass

Part 6: DDIM Sampling (YOU COMPLETE)

TODO 7: Implement DDIM for fast sampling (10-50 steps)

python

@torch.no_grad()
def ddim_sample(model, image_size, num_samples, ddim_steps, alphas_cumprod, device):
    """
    Sample using DDIM (10-50 steps instead of 1000)

    Key insight: Non-Markovian, deterministic sampling

    Args:
        model: Trained denoising U-Net
        image_size: (H, W)
        num_samples: Number of images
        ddim_steps: Number of sampling steps (e.g., 50)
        alphas_cumprod: Cumulative alphas

    Returns:
        Generated images (num_samples, channels, H, W)
    """
    # TODO 7: Implement DDIM sampling
    # Step 1: Create timestep sequence (skip timesteps!)
    #   e.g., [1000, 950, 900, ..., 50, 0]
    # Step 2: Start from pure noise x_T ~ N(0, I)
    # Step 3: For each timestep pair (t, t_prev):
    #   a. Predict noise: ε_pred = model(x_t, t)
    #   b. Predict x_0: x_0_pred = (x_t - sqrt(1-ᾱ_t) * ε) / sqrt(ᾱ_t)
    #   c. Compute x_{t_prev} deterministically
    # Step 4: Return x_0

    # Your code here
    pass

Part 7: Stable Diffusion Text-to-Image (PRE-BUILT)

Pre-built interface to Stable Diffusion:

Load stabilityai/stable-diffusion-2-1
Text-to-image generation
Image-to-image translation
Inpainting

Features:

Adjustable guidance scale
Negative prompts
Multiple sampling steps
Seed control for reproducibility

Part 8: Classifier-Free Guidance (YOU COMPLETE)

TODO 8: Implement classifier-free guidance for conditional generation

python

class ConditionalDiffusionUNet(nn.Module):
    """
    U-Net with class conditioning for CFG
    """
    def __init__(self, num_classes=10, class_emb_dim=64, **kwargs):
        super().__init__()

        # TODO 8a: Add class conditioning to U-Net
        # 1. Class embedding layer
        # 2. Combine with timestep embedding
        # 3. Modify U-Net to accept combined embedding

        # Your code here
        pass

    def forward(self, x, t, class_labels=None):
        """
        Args:
            x: Noisy images (batch, channels, H, W)
            t: Timesteps (batch,)
            class_labels: (batch,) or None for unconditional

        Returns:
            Predicted noise (batch, channels, H, W)
        """
        # TODO 8a: Implement conditional forward pass
        # Your code here
        pass

@torch.no_grad()
def classifier_free_guidance_sample(model, class_label, guidance_scale=7.5, **kwargs):
    """
    Sample with classifier-free guidance

    Formula: ε_guided = ε_uncond + w * (ε_cond - ε_uncond)

    Args:
        model: Conditional U-Net
        class_label: Target class (0-9 for MNIST)
        guidance_scale: w (higher = stronger conditioning)

    Returns:
        Generated image
    """
    # TODO 8b: Implement CFG sampling
    # For each timestep:
    #   1. Predict noise unconditionally: ε_uncond = model(x_t, t, class_labels=None)
    #   2. Predict noise conditionally: ε_cond = model(x_t, t, class_labels=class_label)
    #   3. Combine: ε_guided = ε_uncond + w * (ε_cond - ε_uncond)
    #   4. Use ε_guided for denoising step

    # Your code here
    pass

Expected Results

Part 1-2: Noise Schedules

Linear vs Cosine Comparison:

sql

Timestep | Linear β | Cosine β
---------|----------|----------
0        | 0.0001   | 0.0001
250      | 0.0050   | 0.0015
500      | 0.0100   | 0.0035
750      | 0.0150   | 0.0105
1000     | 0.0200   | 0.0200

✓ Cosine more gradual early on
✓ Both reach same endpoint

Forward Diffusion Visualization:

ini

t=0:    Clear digit "7"
t=250:  Slightly noisy
t=500:  Very noisy
t=750:  Mostly noise
t=1000: Pure noise (indistinguishable from random)

✓ Gradual noise addition

Part 3-5: Training

Training Progress (10 epochs on MNIST):

ini

Epoch 1: Loss = 0.152
Epoch 5: Loss = 0.045
Epoch 10: Loss = 0.032

✓ Loss decreases smoothly
✓ Model learns to denoise

Part 6: DDPM Sampling

Generation Quality (1000 steps):

sql

Generated 64 MNIST digits:
✓ All recognizable (0-9)
✓ High quality, minimal artifacts
✓ Diverse within each class

Sampling time: 45 seconds (T4 GPU)

Part 7: DDIM Sampling

Speed vs Quality Trade-off:

scss

Steps | Time     | Quality (FID)
------|----------|---------------
1000  | 45s      | 8.2 (excellent)
100   | 5s       | 9.1 (excellent)
50    | 2.5s     | 11.3 (good)
10    | 0.5s     | 18.7 (moderate)

✓ 50 steps: 18× faster, minimal quality loss
✓ 100 steps: 9× faster, negligible quality loss

Part 8: Classifier-Free Guidance

Guidance Scale Effects (class "7"):

ini

w = 0:   Random digits (ignores class)
w = 1:   Recognizable "7"
w = 3:   Clear, typical "7"
w = 7:   Very clear, canonical "7"
w = 15:  Oversaturated "7" (artifacts)

✓ Optimal guidance scale: 5-10

Success Criteria

Your implementation is complete when:

Forward diffusion gradually corrupts images to noise
U-Net training loss decreases smoothly over epochs
DDPM sampling generates recognizable MNIST digits (1000 steps)
DDIM generates similar quality in 50 steps (18x speedup)
Stable Diffusion generates images matching text prompts
Classifier-free guidance enables class-specific generation

Tips for Success

Debugging Diffusion Training

Common Issues:

One. Loss Not Decreasing:

Check noise schedule (βs in correct range?)
Verify forward diffusion (visualize x_t at different t)
Ensure U-Net has timestep conditioning

2. Generated Samples are Noise:

Model undertrained (train longer!)
Sampling process incorrect (check reverse diffusion formula)
Learning rate too high

3. Samples Have Artifacts:

Too few sampling steps (increase DDIM steps)
Poor noise schedule (try cosine)
Model architecture too small

Stable Diffusion Prompting Tips

Effective Prompts:

arduino

✅ "A beautiful landscape with mountains and a lake, digital art, highly detailed, 4k"
✅ "Portrait of a cat wearing a wizard hat, oil painting, warm lighting"
✅ "Futuristic city at sunset, cyberpunk, neon lights, photorealistic"

❌ "cat" (too vague)
❌ "make it good" (not descriptive)

Negative Prompts:

arduino

Common negatives: "blurry, low quality, distorted, ugly, bad anatomy"

Hyperparameter Tuning

Parameter	Recommended	Effect
Timesteps (T)	1000	Standard for DDPM
DDIM steps	50-100	Balance speed/quality
Learning rate	1e-4	Standard for diffusion
Guidance scale	7-10	Higher = stronger conditioning
Batch size	128	Larger = more stable

Extension Challenges

Challenge One: Latent Diffusion (Medium)

Run diffusion in VAE latent space:

python

# Encode to latent
z = vae.encode(x)

# Diffusion in latent space (64×64×4 instead of 512×512×3)
z_noisy = forward_diffusion(z, t)
z_denoised = reverse_diffusion(z_noisy)

# Decode to image
x_generated = vae.decode(z_denoised)

Benefit: 48x faster!

Challenge 2: Inpainting with Diffusion (Hard)

Fill masked regions:

python

def inpaint(image, mask, model):
    """
    Fill masked region using diffusion

    During sampling:
    - Keep unmasked region fixed
    - Only denoise masked region
    """
    pass

Challenge 3: ControlNet Integration (Very Hard)

Add spatial control (edges, pose, depth):

python

class ControlNet(nn.Module):
    """
    Add condition (edge map, pose) to diffusion
    """
    pass

Challenge 4: Video Diffusion (Very Hard)

Extend to video generation:

python

class VideoDiffusionUNet(nn.Module):
    """
    3D U-Net for temporal consistency

    Input: (batch, channels, frames, H, W)
    """
    pass

Submission Requirements

What to Submit

Completed Notebook: activity-13-diffusion-models.ipynb
- All TODOs completed
- Training logs visible
Generated Samples:
- Forward diffusion visualization (t=0, 250, 500, 750, 1000)
- DDPM samples (64 digits, 1000 steps)
- DDIM samples (64 digits, 50 steps)
- Stable Diffusion text-to-image results (5 prompts)
- Classifier-free guidance comparison (w=1, 3, 7, 15)
Speed Comparison:
- Table showing DDIM steps vs time and quality
- Recommendation for optimal step count
Stable Diffusion Experiments:
- 5 creative prompts with results
- Negative prompt examples
- Guidance scale exploration
Analysis (7-10 sentences):
- How do diffusion models compare to GANs and VAEs?
- Why is DDIM so much faster than DDPM?
- What guidance scale works best? Why?
- When would you use latent diffusion vs pixel-space?

Submission Steps

Train diffusion model (10 epochs)
Generate samples with DDPM and DDIM
Run Stable Diffusion experiments
Download notebook
Submit via [course portal link]

Resources

Documentation

DDPM Paper (Ho et al., 2020)
DDIM Paper (Song et al., 2020)
HuggingFace Diffusers

Papers

Classifier-Free Guidance (Ho & Salimans, 2022)
Latent Diffusion (Stable Diffusion) (Rombach et al., 2021)
ControlNet (Zhang et al., 2023)

Denoising diffusion
Reverse SDE
Score-based models
Variance-preserving process

Next Steps

Next Activity: Activity 14 - Transformer Architectures for Generation

Autoregressive text generation
Self-attention mechanism
GPT-style models
Sampling strategies (greedy, nucleus, beam search)

Assessment

This activity is graded on:

Code Completion (35%): All TODOs implemented
Training Success (25%): Model converges, generates valid digits
Speed/Quality Analysis (20%): DDIM comparison thorough
Stable Diffusion Experiments (10%): Creative prompts, guidance exploration
Analysis (10%): Demonstrates understanding

Passing Grade: 70% or higher

Congratulations on mastering diffusion models! 🎉🎨

Activity 13 of 18