Apply your knowledge to build something amazing!
Duration: 2 weeks Points: 100 Prerequisites: Complete Lesson 10 (Variational Autoencoders) Difficulty: Intermediate
In this project, you'll build an interactive latent space exploration system using Variational Autoencoders (VAEs). You'll train a VAE on image data, create tools for latent space traversal and interpolation, analyze disentangled representations, and deploy an interactive web application. This project demonstrates how VAEs enable controlled image manipulation and semantic editing.
Why This Matters: VAEs are the foundation of many modern generative systems, including Stable Diffusion's latent diffusion architecture. Understanding latent space manipulation is crucial for controllable AI generation.
What You'll Build:
By completing this project, you will:
Your Latent Space Explorer must:
Your implementation must include:
project-04-vae-latent-explorer/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── vae_model.py # VAE architecture
├── train.py # Training script
├── explorer_app.py # Interactive Gradio/Streamlit app
├── analysis.py # Disentanglement analysis
├── utils.py # Helper functions
├── models/ # Saved checkpoints
│ └── vae_best.pth
├── results/ # Generated visualizations
│ ├── latent_traversals/
│ ├── interpolations/
│ └── arithmetic/
└── logs/ # Training logs
| Criterion | Points | Description |
|---|---|---|
| VAE Implementation | 30 | Correct VAE with ELBO loss and reparameterization |
| Reconstruction Quality | 20 | ``PSNR >= 20`` dB, visually accurate reconstructions |
| Disentanglement | 20 | At least 3 interpretable latent dimensions |
| Interactive Explorer | 15 | Real-time UI with smooth latent manipulation |
| Analysis | 10 | Comprehensive disentanglement analysis |
| Documentation | 5 | Clear README with examples |
| Total | 100 |
Bonus Points (+10 each):
Day 1-2: Dataset Preparation
Day 3-5: VAE Architecture
Day 6-7: Training
Deliverable: Trained VAE with good reconstruction quality
Day 8-9: Latent Space Analysis
Day 10-11: Interactive Features
Day 12-13: β-VAE Experiments
Day 14: Documentation and Portfolio
Deliverable: Complete latent space explorer with analysis
import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, latent_dim=64):
super().__init__()
# Encoder: Image → μ, log_σ
self.encoder = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=4, stride=2, padding=1), # 64 → 32
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=1), # 32 → 16
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1), # 16 → 8
nn.ReLU(),
nn.Conv2d(128, 256, kernel_size=4, stride=2, padding=1), # 8 → 4
nn.ReLU(),
nn.Flatten(),
)
# Latent space
self.fc_mu = nn.Linear(256 * 4 * 4, latent_dim)
self.fc_logvar = nn.Linear(256 * 4 * 4, latent_dim)
# Decoder: z → Image
self.fc_decode = nn.Linear(latent_dim, 256 * 4 * 4)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1), # 4 → 8
nn.ReLU(),
nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1), # 8 → 16
nn.ReLU(),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1), # 16 → 32
nn.ReLU(),
nn.ConvTranspose2d(32, 3, kernel_size=4, stride=2, padding=1), # 32 → 64
nn.Sigmoid(), # Output in [0, 1]
)
def encode(self, x):
"""Encode image to latent distribution parameters"""
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
"""Reparameterization trick: z = μ + σ * ε"""
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
return z
def decode(self, z):
"""Decode latent code to image"""
h = self.fc_decode(z)
h = h.view(-1, 256, 4, 4)
reconstruction = self.decoder(h)
return reconstruction
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
reconstruction = self.decode(z)
return reconstruction, mu, logvar
def vae_loss(reconstruction, x, mu, logvar, beta=1.0):
"""
ELBO = Reconstruction Loss - β * KL Divergence
Beta-VAE: β > 1 encourages disentanglement
"""
# Reconstruction loss (MSE or BCE)
recon_loss = F.mse_loss(reconstruction, x, reduction='sum')
# KL divergence: KL(q(z|x) || p(z))
# For Gaussian: -0.5 * sum(1 + log(σ²) - μ² - σ²)
kl_divergence = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
# Total loss (ELBO)
loss = recon_loss + beta * kl_divergence
return loss, recon_loss, kl_divergence
# Training loop
for images in dataloader:
# Forward pass
reconstruction, mu, logvar = vae(images)
# Compute loss
loss, recon_loss, kl_loss = vae_loss(reconstruction, images, mu, logvar, beta=1.0)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Log metrics
print(f"Loss: {loss.item():.4f}, Recon: {recon_loss.item():.4f}, KL: {kl_loss.item():.4f}")
def latent_traversal(vae, image, dimension, values=np.linspace(-3, 3, 11)):
"""
Traverse latent space along a single dimension
Shows what that dimension controls
"""
# Encode image to latent code
mu, logvar = vae.encode(image)
z = mu # Use mean (no sampling for deterministic results)
results = []
for value in values:
# Modify specific dimension
z_modified = z.clone()
z_modified[0, dimension] = value
# Decode
reconstruction = vae.decode(z_modified)
results.append(reconstruction)
return results
# Example: Traverse dimension 5
traversal_images = latent_traversal(vae, test_image, dimension=5)
# If dimension 5 controls "smile", you'll see smile intensity changing
def interpolate_images(vae, img1, img2, steps=10):
"""Smoothly interpolate between two images"""
# Encode both images
mu1, _ = vae.encode(img1)
mu2, _ = vae.encode(img2)
# Linear interpolation in latent space
alphas = torch.linspace(0, 1, steps)
interpolations = []
for alpha in alphas:
z_interp = (1 - alpha) * mu1 + alpha * mu2
reconstruction = vae.decode(z_interp)
interpolations.append(reconstruction)
return interpolations
def latent_arithmetic(vae, img_source, img_add, img_subtract):
"""
Latent arithmetic: source + add - subtract
Example: man + smile - neutral = smiling man
"""
# Encode all images
z_source, _ = vae.encode(img_source)
z_add, _ = vae.encode(img_add)
z_subtract, _ = vae.encode(img_subtract)
# Arithmetic operation
z_result = z_source + z_add - z_subtract
# Decode
result_image = vae.decode(z_result)
return result_image
import gradio as gr
import torch
# Load trained VAE
vae = VAE(latent_dim=64)
vae.load_state_dict(torch.load("models/vae_best.pth"))
vae = vae.to('cuda')
def explore_latent_space(*sliders):
"""Generate image from latent code controlled by sliders"""
# Convert sliders to latent code
z = torch.tensor([sliders]).float().to('cuda')
# Decode
with torch.no_grad():
img = vae.decode(z)
# Convert to numpy
img = img[0].permute(1, 2, 0).cpu().numpy()
return (img * 255).astype('uint8')
# Create 64 sliders (one per latent dimension)
sliders = [gr.Slider(-3, 3, value=0, step=0.1, label=f"Z{i}") for i in range(64)]
interface = gr.Interface(
fn=explore_latent_space,
inputs=sliders,
outputs=gr.Image(label="Generated Image"),
title="VAE Latent Space Explorer",
description="Adjust sliders to explore the latent space!",
)
interface.launch()
def compute_mig(vae, dataset, num_samples=10000):
"""
Compute Mutual Information Gap
Measures how well latent factors capture data factors
"""
# Sample from dataset and encode
latent_codes = []
labels = [] # Ground truth factors (e.g., pose, lighting)
for images, label in dataset:
mu, _ = vae.encode(images)
latent_codes.append(mu.cpu().numpy())
labels.append(label.numpy())
latent_codes = np.concatenate(latent_codes)
labels = np.concatenate(labels)
# Compute mutual information between each latent dimension and each ground truth factor
# (Implementation details omitted for brevity)
return mig_score
Best for Disentanglement Analysis:
Best for Visual Quality:
| Hyperparameter | Value | Notes |
|---|---|---|
| Latent dim | 32-128 | Depends on data complexity |
| β (beta) | 1-16 | Higher β -> more disentanglement |
| Learning rate | 1e-4 | Adam optimizer |
| Batch size | 64-128 | Larger is better |
| Reconstruction loss | MSE or BCE | BCE for binary images |
pytorch/examplesDLR-RM/stable-baselines3Required Deliverables:
Deadline: 2 weeks from project start
Demo Website:
LinkedIn/Resume:
"Built interactive latent space exploration system using Variational Autoencoders. Achieved 23.5 dB PSNR reconstruction quality with 5 disentangled semantic dimensions for controllable image generation."
Good luck! VAEs are the foundation of modern generative AI. This project will give you deep understanding of latent representations.
Related Projects: