Demo Mode

No student ID available

Project 4: Latent Space Explorer with VAEs

Duration: 2 weeks Points: 100 Prerequisites: Complete Lesson 10 (Variational Autoencoders) Difficulty: Intermediate

Project Overview

In this project, you'll build an interactive latent space exploration system using Variational Autoencoders (VAEs). You'll train a VAE on image data, create tools for latent space traversal and interpolation, analyze disentangled representations, and deploy an interactive web application. This project demonstrates how VAEs enable controlled image manipulation and semantic editing.

Why This Matters: VAEs are the foundation of many modern generative systems, including Stable Diffusion's latent diffusion architecture. Understanding latent space manipulation is crucial for controllable AI generation.

What You'll Build:

VAE with disentangled latent representations
Interactive latent space explorer (sliders for each latent dimension)
Image interpolation and arithmetic capabilities
Disentanglement analysis and visualization
Portfolio-ready demos of semantic image editing

Learning Objectives

By completing this project, you will:

Implement VAE with reparameterization trick and ELBO loss
Train VAE to learn disentangled latent representations
Analyze latent space structure and semantic meaning of dimensions
Create interactive interface for latent space exploration
Perform image interpolation and latent arithmetic operations
Evaluate reconstruction quality and disentanglement metrics

Requirements

Functional Requirements

Your Latent Space Explorer must:

High-quality reconstruction: ``PSNR >= 20`` dB on test set
Smooth interpolations: Smooth transitions between images in latent space
Disentangled representations: At least 3 interpretable latent dimensions
Interactive exploration: Real-time latent space manipulation via sliders
Image arithmetic: Support operations like "smiling man - man + woman = smiling woman"
Comprehensive analysis: Disentanglement metrics (β-VAE, Factor-VAE scores)

Technical Requirements

Your implementation must include:

VAE Architecture: Encoder (image -> μ, σ) + Decoder (z -> image)
ELBO Loss: Reconstruction loss + KL divergence with β weighting
Reparameterization Trick: Enables backpropagation through sampling
Interactive UI: Gradio/Streamlit with real-time image generation
Latent Traversal: Visualize effect of changing each latent dimension
Metrics: Reconstruction quality, KL divergence, disentanglement scores

Code Structure

bash

project-04-vae-latent-explorer/
├── README.md                      # Project documentation
├── requirements.txt               # Python dependencies
├── vae_model.py                  # VAE architecture
├── train.py                      # Training script
├── explorer_app.py               # Interactive Gradio/Streamlit app
├── analysis.py                   # Disentanglement analysis
├── utils.py                      # Helper functions
├── models/                       # Saved checkpoints
│   └── vae_best.pth
├── results/                      # Generated visualizations
│   ├── latent_traversals/
│   ├── interpolations/
│   └── arithmetic/
└── logs/                         # Training logs

Grading Rubric

Criterion	Points	Description
VAE Implementation	30	Correct VAE with ELBO loss and reparameterization
Reconstruction Quality	20	``PSNR >= 20`` dB, visually accurate reconstructions
Disentanglement	20	At least 3 interpretable latent dimensions
Interactive Explorer	15	Real-time UI with smooth latent manipulation
Analysis	10	Comprehensive disentanglement analysis
Documentation	5	Clear README with examples
Total	100

Bonus Points (+10 each):

Implement β-VAE with controlled disentanglement
Implement Factor-VAE or β-TCVAE for improved disentanglement
Add conditional VAE (CVAE) for class-conditional generation
Deploy to public demo (Hugging Face Spaces)

Milestones

Week One: VAE Implementation and Training

Day 1-2: Dataset Preparation

Choose dataset (CelebA for faces, MNIST/Fashion-MNIST for simplicity, dSprites for disentanglement analysis)
Implement data loading and preprocessing
Verify data augmentation

Day 3-5: VAE Architecture

Implement encoder (CNN -> μ, log_σ)
Implement decoder (z -> CNN -> image)
Implement ELBO loss (reconstruction + KL)
Test reparameterization trick

Day 6-7: Training

Train VAE with β=1 (standard VAE)
Monitor reconstruction quality and KL divergence
Save checkpoints

Deliverable: Trained VAE with good reconstruction quality

Week 2: Latent Space Analysis and Deployment

Day 8-9: Latent Space Analysis

Generate latent traversal visualizations (vary each dimension)
Identify interpretable dimensions (e.g., pose, lighting, attributes)
Calculate disentanglement metrics

Day 10-11: Interactive Features

Build Gradio/Streamlit interface with sliders
Implement interpolation between images
Implement latent arithmetic (vector operations)

Day 12-13: β-VAE Experiments

Train β-VAE with different β values (4, 8, 16)
Compare disentanglement vs reconstruction trade-off
Generate comparison visualizations

Day 14: Documentation and Portfolio

Create comprehensive analysis document
Record demo video
Prepare presentation

Deliverable: Complete latent space explorer with analysis

Implementation Guide

VAE Architecture

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, latent_dim=64):
        super().__init__()

        # Encoder: Image → μ, log_σ
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=4, stride=2, padding=1),  # 64 → 32
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=1), # 32 → 16
            nn.ReLU(),
            nn.Conv2d(64, 128, kernel_size=4, stride=2, padding=1), # 16 → 8
            nn.ReLU(),
            nn.Conv2d(128, 256, kernel_size=4, stride=2, padding=1), # 8 → 4
            nn.ReLU(),
            nn.Flatten(),
        )

        # Latent space
        self.fc_mu = nn.Linear(256 * 4 * 4, latent_dim)
        self.fc_logvar = nn.Linear(256 * 4 * 4, latent_dim)

        # Decoder: z → Image
        self.fc_decode = nn.Linear(latent_dim, 256 * 4 * 4)

        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1), # 4 → 8
            nn.ReLU(),
            nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1),  # 8 → 16
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2, padding=1),   # 16 → 32
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, kernel_size=4, stride=2, padding=1),    # 32 → 64
            nn.Sigmoid(),  # Output in [0, 1]
        )

    def encode(self, x):
        """Encode image to latent distribution parameters"""
        h = self.encoder(x)
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        """Reparameterization trick: z = μ + σ * ε"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        z = mu + eps * std
        return z

    def decode(self, z):
        """Decode latent code to image"""
        h = self.fc_decode(z)
        h = h.view(-1, 256, 4, 4)
        reconstruction = self.decoder(h)
        return reconstruction

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        reconstruction = self.decode(z)
        return reconstruction, mu, logvar

ELBO Loss Function

python

def vae_loss(reconstruction, x, mu, logvar, beta=1.0):
    """
    ELBO = Reconstruction Loss - β * KL Divergence

    Beta-VAE: β > 1 encourages disentanglement
    """
    # Reconstruction loss (MSE or BCE)
    recon_loss = F.mse_loss(reconstruction, x, reduction='sum')

    # KL divergence: KL(q(z|x) || p(z))
    # For Gaussian: -0.5 * sum(1 + log(σ²) - μ² - σ²)
    kl_divergence = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    # Total loss (ELBO)
    loss = recon_loss + beta * kl_divergence

    return loss, recon_loss, kl_divergence

# Training loop
for images in dataloader:
    # Forward pass
    reconstruction, mu, logvar = vae(images)

    # Compute loss
    loss, recon_loss, kl_loss = vae_loss(reconstruction, images, mu, logvar, beta=1.0)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Log metrics
    print(f"Loss: {loss.item():.4f}, Recon: {recon_loss.item():.4f}, KL: {kl_loss.item():.4f}")

Latent Space Traversal

python

def latent_traversal(vae, image, dimension, values=np.linspace(-3, 3, 11)):
    """
    Traverse latent space along a single dimension
    Shows what that dimension controls
    """
    # Encode image to latent code
    mu, logvar = vae.encode(image)
    z = mu  # Use mean (no sampling for deterministic results)

    results = []
    for value in values:
        # Modify specific dimension
        z_modified = z.clone()
        z_modified[0, dimension] = value

        # Decode
        reconstruction = vae.decode(z_modified)
        results.append(reconstruction)

    return results

# Example: Traverse dimension 5
traversal_images = latent_traversal(vae, test_image, dimension=5)
# If dimension 5 controls "smile", you'll see smile intensity changing

Image Interpolation

python

def interpolate_images(vae, img1, img2, steps=10):
    """Smoothly interpolate between two images"""
    # Encode both images
    mu1, _ = vae.encode(img1)
    mu2, _ = vae.encode(img2)

    # Linear interpolation in latent space
    alphas = torch.linspace(0, 1, steps)
    interpolations = []

    for alpha in alphas:
        z_interp = (1 - alpha) * mu1 + alpha * mu2
        reconstruction = vae.decode(z_interp)
        interpolations.append(reconstruction)

    return interpolations

Latent Arithmetic

python

def latent_arithmetic(vae, img_source, img_add, img_subtract):
    """
    Latent arithmetic: source + add - subtract
    Example: man + smile - neutral = smiling man
    """
    # Encode all images
    z_source, _ = vae.encode(img_source)
    z_add, _ = vae.encode(img_add)
    z_subtract, _ = vae.encode(img_subtract)

    # Arithmetic operation
    z_result = z_source + z_add - z_subtract

    # Decode
    result_image = vae.decode(z_result)
    return result_image

Interactive Explorer (Gradio)

python

import gradio as gr
import torch

# Load trained VAE
vae = VAE(latent_dim=64)
vae.load_state_dict(torch.load("models/vae_best.pth"))
vae = vae.to('cuda')

def explore_latent_space(*sliders):
    """Generate image from latent code controlled by sliders"""
    # Convert sliders to latent code
    z = torch.tensor([sliders]).float().to('cuda')

    # Decode
    with torch.no_grad():
        img = vae.decode(z)

    # Convert to numpy
    img = img[0].permute(1, 2, 0).cpu().numpy()
    return (img * 255).astype('uint8')

# Create 64 sliders (one per latent dimension)
sliders = [gr.Slider(-3, 3, value=0, step=0.1, label=f"Z{i}") for i in range(64)]

interface = gr.Interface(
    fn=explore_latent_space,
    inputs=sliders,
    outputs=gr.Image(label="Generated Image"),
    title="VAE Latent Space Explorer",
    description="Adjust sliders to explore the latent space!",
)

interface.launch()

Disentanglement Analysis

Metric: Mutual Information Gap (MIG)

python

def compute_mig(vae, dataset, num_samples=10000):
    """
    Compute Mutual Information Gap
    Measures how well latent factors capture data factors
    """
    # Sample from dataset and encode
    latent_codes = []
    labels = []  # Ground truth factors (e.g., pose, lighting)

    for images, label in dataset:
        mu, _ = vae.encode(images)
        latent_codes.append(mu.cpu().numpy())
        labels.append(label.numpy())

    latent_codes = np.concatenate(latent_codes)
    labels = np.concatenate(labels)

    # Compute mutual information between each latent dimension and each ground truth factor
    # (Implementation details omitted for brevity)

    return mig_score

Dataset Recommendations

Best for Disentanglement Analysis:

dSprites: Simple shapes with ground truth factors (shape, scale, rotation, position)
3DShapes: 3D objects with controlled factors

Best for Visual Quality:

CelebA: Face images (interpretable dimensions: smile, gender, age)
MNIST/Fashion-MNIST: Simple for quick prototyping

Hyperparameter Recommendations

Hyperparameter	Value	Notes
Latent dim	32-128	Depends on data complexity
β (beta)	1-16	Higher β -> more disentanglement
Learning rate	1e-4	Adam optimizer
Batch size	64-128	Larger is better
Reconstruction loss	MSE or BCE	BCE for binary images

Resources

Research Papers

VAE (Kingma & Welling, 2014): Auto-Encoding Variational Bayes
β-VAE (Higgins et al., 2017): beta-VAE: Learning Basic Visual Concepts
Factor-VAE (Kim & Mnih, 2018): Disentangling by Factorising

Code Examples

PyTorch VAE Examples
pytorch/examples
View on GitHub
Stable-Baselines3 VAE
DLR-RM/stable-baselines3
View on GitHub

Submission Guidelines

Required Deliverables:

Code repository with VAE implementation, training script, explorer app
Trained model checkpoint (vae_best.pth)
Latent traversal visualizations (all dimensions)
Interactive web application (deployed or local)
Disentanglement analysis report

Deadline: 2 weeks from project start

Portfolio Presentation

Demo Website:

Title: "VAE Latent Space Explorer - Semantic Image Editing"
Interactive sliders for real-time manipulation
Gallery showing latent traversals and interpolations
Technical highlights: "64D latent space with 5 interpretable dimensions"

LinkedIn/Resume:

"Built interactive latent space exploration system using Variational Autoencoders. Achieved 23.5 dB PSNR reconstruction quality with 5 disentangled semantic dimensions for controllable image generation."

Good luck! VAEs are the foundation of modern generative AI. This project will give you deep understanding of latent representations.

Related Projects:

Project 3 - GAN Art Studio <- (Alternative generative model)
Project 6 - Multi-Modal Content Generator -> (Uses latent diffusion)

Project 4: Latent Space Explorer with VAEs

PyTorch VAE Examples

Stable-Baselines3 VAE

Project 4: Latent Space Explorer with VAEs

PyTorch VAE Examples

Stable-Baselines3 VAE