Practice and reinforce the concepts from Lesson 13
In this activity, you'll build a complete diffusion model from scratch, implementing the forward diffusion process, U-Net denoising architecture, DDPM sampling, and DDIM acceleration. You'll train on MNIST digits and explore advanced techniques like classifier-free guidance using Stable Diffusion.
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-13-diffusion-models.zipTemplates/AI25-Template-activity-13-diffusion-models.zipactivity-13-diffusion-models.ipynb to Google ColabExecute the first few cells to:
TODO 1: Implement linear and cosine noise schedules
def linear_beta_schedule(timesteps, beta_start=0.0001, beta_end=0.02):
"""
Linear noise schedule (original DDPM)
Args:
timesteps: Total number of timesteps (T)
beta_start: Starting noise level
beta_end: Ending noise level
Returns:
betas: (T,) array of noise levels
"""
# TODO 1a: Implement linear schedule
# betas = linspace(beta_start, beta_end, timesteps)
# Your code here
pass
def cosine_beta_schedule(timesteps, s=0.008):
"""
Cosine noise schedule (improved)
More gradual noise addition at beginning and end
Args:
timesteps: Total number of timesteps (T)
s: Offset parameter
Returns:
betas: (T,) array of noise levels
"""
# TODO 1b: Implement cosine schedule
# Formula:
# f(t) = cos((t/T + s) / (1 + s) * π/2)^2
# α̅_t = f(t) / f(0)
# β_t = 1 - α̅_t / α̅_{t-1}
# Your code here
pass
def compute_alpha_bars(betas):
"""
Compute cumulative product of alphas
Args:
betas: (T,) noise schedule
Returns:
alphas: 1 - betas
alphas_cumprod: Cumulative product of alphas
"""
# TODO 1c: Implement alpha computation
# alphas = 1 - betas
# alphas_cumprod = cumprod(alphas)
# Your code here
pass
TODO 2: Implement q(x_t | x_0) - add noise to images
def forward_diffusion_sample(x_0, t, alphas_cumprod, device):
"""
Sample from q(x_t | x_0) using closed-form formula
Formula: x_t = √ᾱ_t * x_0 + √(1-ᾱ_t) * ε
where ε ~ N(0, I)
Args:
x_0: Clean images (batch, channels, H, W)
t: Timesteps (batch,) - indices from 0 to T-1
alphas_cumprod: Cumulative alphas (T,)
device: 'cuda' or 'cpu'
Returns:
x_t: Noisy images (same shape as x_0)
noise: The noise that was added (same shape as x_0)
"""
# TODO 2: Implement forward diffusion
# Step 1: Get sqrt_alphas_cumprod[t] and sqrt_one_minus_alphas_cumprod[t]
# Step 2: Sample noise ε ~ N(0, I)
# Step 3: Compute x_t = sqrt_alphas * x_0 + sqrt_one_minus_alphas * ε
# Your code here
pass
TODO 3: Implement sinusoidal timestep embeddings
class SinusoidalPositionEmbeddings(nn.Module):
"""
Encode timestep t as continuous vector using sinusoidal functions
"""
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, time):
"""
Args:
time: (batch,) - timestep indices
Returns:
Embeddings (batch, dim)
"""
# TODO 3a: Implement sinusoidal embeddings
# Formula:
# emb[i, 2k] = sin(t / 10000^(2k/dim))
# emb[i, 2k+1] = cos(t / 10000^(2k/dim))
# Your code here
pass
TODO 4: Implement U-Net denoising model
class DiffusionUNet(nn.Module):
"""
U-Net architecture for diffusion models
Predicts noise ε given noisy image x_t and timestep t
"""
def __init__(self, in_channels=1, out_channels=1, time_emb_dim=128):
super().__init__()
# TODO 4: Implement U-Net architecture
# Components:
# 1. Timestep embedding MLP
# 2. Encoder blocks (with timestep conditioning)
# 3. Bottleneck
# 4. Decoder blocks (with skip connections + timestep conditioning)
# 5. Output projection
# Your code here
pass
def forward(self, x, t):
"""
Args:
x: Noisy images (batch, channels, H, W)
t: Timesteps (batch,)
Returns:
Predicted noise (batch, channels, H, W)
"""
# TODO 4: Implement forward pass
# 1. Encode timestep
# 2. Encoder (save activations for skip connections)
# 3. Bottleneck
# 4. Decoder (concatenate skip connections, add timestep conditioning)
# 5. Output
# Your code here
pass
TODO 5: Implement DDPM training loss
def ddpm_loss(model, x_0, alphas_cumprod, device):
"""
Compute DDPM training loss
Loss = E_{t,x_0,ε} [ ||ε - ε_θ(x_t, t)||^2 ]
Args:
model: Denoising U-Net
x_0: Clean images (batch, channels, H, W)
alphas_cumprod: Cumulative alphas
device: 'cuda' or 'cpu'
Returns:
loss: MSE between true noise and predicted noise
"""
# TODO 5: Implement training loss
# Step 1: Sample random timesteps t ~ Uniform(0, T-1)
# Step 2: Sample noise and create x_t using forward_diffusion_sample
# Step 3: Predict noise: ε_pred = model(x_t, t)
# Step 4: Compute MSE loss: ||ε - ε_pred||^2
# Your code here
pass
TODO 6: Implement reverse diffusion sampling (1000 steps)
@torch.no_grad()
def ddpm_sample(model, image_size, num_samples, timesteps, alphas, alphas_cumprod, betas, device):
"""
Sample from model using DDPM (1000 steps)
Reverse process: x_T → x_{T-1} → ... → x_1 → x_0
Args:
model: Trained denoising U-Net
image_size: (H, W) of output images
num_samples: Number of images to generate
timesteps: Total timesteps (T)
alphas, alphas_cumprod, betas: Noise schedule parameters
Returns:
Generated images (num_samples, channels, H, W)
"""
# TODO 6: Implement DDPM sampling
# Step 1: Start from pure noise x_T ~ N(0, I)
# Step 2: For t = T-1 down to 0:
# a. Predict noise: ε_pred = model(x_t, t)
# b. Compute mean of reverse distribution
# c. Add noise (if t > 0)
# d. x_{t-1} = mean + noise
# Your code here
pass
TODO 7: Implement DDIM for fast sampling (10-50 steps)
@torch.no_grad()
def ddim_sample(model, image_size, num_samples, ddim_steps, alphas_cumprod, device):
"""
Sample using DDIM (10-50 steps instead of 1000)
Key insight: Non-Markovian, deterministic sampling
Args:
model: Trained denoising U-Net
image_size: (H, W)
num_samples: Number of images
ddim_steps: Number of sampling steps (e.g., 50)
alphas_cumprod: Cumulative alphas
Returns:
Generated images (num_samples, channels, H, W)
"""
# TODO 7: Implement DDIM sampling
# Step 1: Create timestep sequence (skip timesteps!)
# e.g., [1000, 950, 900, ..., 50, 0]
# Step 2: Start from pure noise x_T ~ N(0, I)
# Step 3: For each timestep pair (t, t_prev):
# a. Predict noise: ε_pred = model(x_t, t)
# b. Predict x_0: x_0_pred = (x_t - sqrt(1-ᾱ_t) * ε) / sqrt(ᾱ_t)
# c. Compute x_{t_prev} deterministically
# Step 4: Return x_0
# Your code here
pass
Pre-built interface to Stable Diffusion:
stabilityai/stable-diffusion-2-1Features:
TODO 8: Implement classifier-free guidance for conditional generation
class ConditionalDiffusionUNet(nn.Module):
"""
U-Net with class conditioning for CFG
"""
def __init__(self, num_classes=10, class_emb_dim=64, **kwargs):
super().__init__()
# TODO 8a: Add class conditioning to U-Net
# 1. Class embedding layer
# 2. Combine with timestep embedding
# 3. Modify U-Net to accept combined embedding
# Your code here
pass
def forward(self, x, t, class_labels=None):
"""
Args:
x: Noisy images (batch, channels, H, W)
t: Timesteps (batch,)
class_labels: (batch,) or None for unconditional
Returns:
Predicted noise (batch, channels, H, W)
"""
# TODO 8a: Implement conditional forward pass
# Your code here
pass
@torch.no_grad()
def classifier_free_guidance_sample(model, class_label, guidance_scale=7.5, **kwargs):
"""
Sample with classifier-free guidance
Formula: ε_guided = ε_uncond + w * (ε_cond - ε_uncond)
Args:
model: Conditional U-Net
class_label: Target class (0-9 for MNIST)
guidance_scale: w (higher = stronger conditioning)
Returns:
Generated image
"""
# TODO 8b: Implement CFG sampling
# For each timestep:
# 1. Predict noise unconditionally: ε_uncond = model(x_t, t, class_labels=None)
# 2. Predict noise conditionally: ε_cond = model(x_t, t, class_labels=class_label)
# 3. Combine: ε_guided = ε_uncond + w * (ε_cond - ε_uncond)
# 4. Use ε_guided for denoising step
# Your code here
pass
Linear vs Cosine Comparison:
Timestep | Linear β | Cosine β
---------|----------|----------
0 | 0.0001 | 0.0001
250 | 0.0050 | 0.0015
500 | 0.0100 | 0.0035
750 | 0.0150 | 0.0105
1000 | 0.0200 | 0.0200
✓ Cosine more gradual early on
✓ Both reach same endpoint
Forward Diffusion Visualization:
t=0: Clear digit "7"
t=250: Slightly noisy
t=500: Very noisy
t=750: Mostly noise
t=1000: Pure noise (indistinguishable from random)
✓ Gradual noise addition
Training Progress (10 epochs on MNIST):
Epoch 1: Loss = 0.152
Epoch 5: Loss = 0.045
Epoch 10: Loss = 0.032
✓ Loss decreases smoothly
✓ Model learns to denoise
Generation Quality (1000 steps):
Generated 64 MNIST digits:
✓ All recognizable (0-9)
✓ High quality, minimal artifacts
✓ Diverse within each class
Sampling time: 45 seconds (T4 GPU)
Speed vs Quality Trade-off:
Steps | Time | Quality (FID)
------|----------|---------------
1000 | 45s | 8.2 (excellent)
100 | 5s | 9.1 (excellent)
50 | 2.5s | 11.3 (good)
10 | 0.5s | 18.7 (moderate)
✓ 50 steps: 18× faster, minimal quality loss
✓ 100 steps: 9× faster, negligible quality loss
Guidance Scale Effects (class "7"):
w = 0: Random digits (ignores class)
w = 1: Recognizable "7"
w = 3: Clear, typical "7"
w = 7: Very clear, canonical "7"
w = 15: Oversaturated "7" (artifacts)
✓ Optimal guidance scale: 5-10
Your implementation is complete when:
Common Issues:
One. Loss Not Decreasing:
2. Generated Samples are Noise:
3. Samples Have Artifacts:
Effective Prompts:
✅ "A beautiful landscape with mountains and a lake, digital art, highly detailed, 4k"
✅ "Portrait of a cat wearing a wizard hat, oil painting, warm lighting"
✅ "Futuristic city at sunset, cyberpunk, neon lights, photorealistic"
❌ "cat" (too vague)
❌ "make it good" (not descriptive)
Negative Prompts:
Common negatives: "blurry, low quality, distorted, ugly, bad anatomy"
| Parameter | Recommended | Effect |
|---|---|---|
| Timesteps (T) | 1000 | Standard for DDPM |
| DDIM steps | 50-100 | Balance speed/quality |
| Learning rate | 1e-4 | Standard for diffusion |
| Guidance scale | 7-10 | Higher = stronger conditioning |
| Batch size | 128 | Larger = more stable |
Run diffusion in VAE latent space:
# Encode to latent
z = vae.encode(x)
# Diffusion in latent space (64×64×4 instead of 512×512×3)
z_noisy = forward_diffusion(z, t)
z_denoised = reverse_diffusion(z_noisy)
# Decode to image
x_generated = vae.decode(z_denoised)
Benefit: 48x faster!
Fill masked regions:
def inpaint(image, mask, model):
"""
Fill masked region using diffusion
During sampling:
- Keep unmasked region fixed
- Only denoise masked region
"""
pass
Add spatial control (edges, pose, depth):
class ControlNet(nn.Module):
"""
Add condition (edge map, pose) to diffusion
"""
pass
Extend to video generation:
class VideoDiffusionUNet(nn.Module):
"""
3D U-Net for temporal consistency
Input: (batch, channels, frames, H, W)
"""
pass
Completed Notebook: activity-13-diffusion-models.ipynb
Generated Samples:
Speed Comparison:
Stable Diffusion Experiments:
Analysis (7-10 sentences):
Next Activity: Activity 14 - Transformer Architectures for Generation
This activity is graded on:
Passing Grade: 70% or higher
Congratulations on mastering diffusion models! 🎉🎨