Demo Mode

No student ID available

Concept 10 of 18

Concept 10: Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs)

ℹ️ Definition Variational Autoencoders (VAEs) are generative models that learn a continuous latent space representation of data by combining an encoder (inference network) and decoder (generative network), trained using variational inference to maximize the Evidence Lower Bound (ELBO).

Learning Objectives

By the end of this lesson, you will:

Understand the VAE architecture (encoder, latent space, decoder)
Learn the reparameterization trick for gradient flow
Derive and implement the Evidence Lower Bound (ELBO) loss
Build a VAE for MNIST digit generation using PyTorch
Explore and manipulate latent space representations
Apply VAEs to image generation and interpolation

Introduction

In Lesson 9, we introduced generative models. Now we'll learn our first deep generative model: Variational Autoencoders (VAEs).

VAEs combine:

Autoencoders (compression + reconstruction)
Probabilistic modeling (Bayesian inference)
Deep learning (neural networks)

Result: A model that can generate new, realistic samples!

Autoencoders (Background)

Traditional Autoencoders

Architecture:

css

Input x → Encoder → Latent code z → Decoder → Reconstruction x̂

Training objective:

Minimize reconstruction error: ||x - x̂||²

Problem: Latent space is deterministic and discontinuous

Can't generate new samples (only reconstruct training data)
Small changes in z can cause large changes in x̂
Gaps in latent space produce garbage

Example:

css

Encode digit "3" → z₁ = [0.5, 0.2, ...]
Encode digit "8" → z₂ = [0.6, 0.3, ...]

What about z = [0.55, 0.25, ...]? (midpoint)
→ Might generate garbage, not a valid digit!

Variational Autoencoders (VAEs)

Key Innovation

VAE insight: Encode data into a probability distribution in latent space, not a single point!

Architecture:

vbnet

Encoder: x → μ, σ (distribution parameters)
Latent: z ~ N(μ, σ²) (sample from distribution)
Decoder: z → x̂ (reconstruct)

Benefit: Smooth, continuous latent space

Probabilistic Formulation

Goal: Learn generative model P(x)

Latent variable model:

css

P(x) = ∫ P(x|z) * P(z) dz

Where:

P(z): Prior distribution (standard normal N(0,I))
P(x|z): Likelihood (decoder outputs)
P(x): Marginal distribution (data probability)

Challenge: Computing P(x) requires intractable integral!

VAE solution: Use variational inference to approximate

Encoder: Inference Network

Encoder learns: Q(z|x) ~= P(z|x)

Parametrization:

css

Q(z|x) = N(μ(x), σ²(x))

Where:

μ(x): Mean predicted by neural network
σ(x): Standard deviation predicted by neural network

Example (MNIST):

python

class Encoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 400)
        self.fc_mu = nn.Linear(400, latent_dim)  # Mean
        self.fc_logvar = nn.Linear(400, latent_dim)  # Log variance

    def forward(self, x):
        h = F.relu(self.fc1(x))
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

Decoder: Generative Network

Decoder learns: P(x|z)

Parametrization:

scss

P(x|z) = Bernoulli(π(z))  # For binary data
P(x|z) = N(μ(z), σ²)      # For continuous data

Example (MNIST):

python

class Decoder(nn.Module):
    def __init__(self, latent_dim=20, output_dim=784):
        super().__init__()
        self.fc1 = nn.Linear(latent_dim, 400)
        self.fc2 = nn.Linear(400, output_dim)

    def forward(self, z):
        h = F.relu(self.fc1(z))
        recon = torch.sigmoid(self.fc2(h))  # Bernoulli parameter
        return recon

The Reparameterization Trick

Problem

Naive sampling:

scss

z ~ N(μ, σ²)

Issue: Sampling is not differentiable! Can't backpropagate through random sampling.

Solution: Reparameterization

Reparameterize:

ini

z = μ + σ * ε,  where ε ~ N(0, 1)

Why this works:

ε: Random noise (no gradient needed)
μ, σ: Deterministic (backprop flows through these!)
z: Still distributed as N(μ, σ²)

Implementation:

python

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)  # Convert log variance to std
    eps = torch.randn_like(std)    # Sample ε ~ N(0,1)
    z = mu + std * eps             # Reparameterize
    return z

Gradient flow:

markdown

Loss → Decoder → z → μ, σ → Encoder parameters ✓
                   ↓
                   ε (no gradient needed)

Evidence Lower Bound (ELBO)

Derivation

Goal: Maximize log P(x)

Variational inference:

css

log P(x) = ELBO + KL(Q(z|x) || P(z|x))
           ≥ ELBO  (since KL ≥ 0)

Evidence Lower Bound (ELBO):

scss

ELBO = E_Q[log P(x|z)] - KL(Q(z|x) || P(z))
        ↑                   ↑
   Reconstruction       Regularization
   (Decoder quality)    (Latent prior matching)

ELBO Components

One. Reconstruction Term

E_Q[log P(x|z)] = E[log P(x|z=μ+σε)]

For binary data (MNIST):

css

Reconstruction loss = Binary Cross-Entropy(x, x̂)

For continuous data:

css

Reconstruction loss = MSE(x, x̂)

2. KL Divergence Term

scss

KL(Q(z|x) || P(z)) = KL(N(μ,σ²) || N(0,I))

Closed-form solution:

ini

KL = 0.5 * Σ(1 + log(σ²) - μ² - σ²)

Loss Function

VAE loss (negative ELBO):

ini

Loss = Reconstruction Loss + KL Divergence

PyTorch implementation:

python

def vae_loss(x, recon_x, mu, logvar):
    # Reconstruction loss (BCE for binary data)
    recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')

    # KL divergence loss
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    # Total loss
    return recon_loss + kl_loss

Complete VAE Implementation

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super().__init__()

        # Encoder
        self.encoder_fc1 = nn.Linear(input_dim, hidden_dim)
        self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
        self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder_fc1 = nn.Linear(latent_dim, hidden_dim)
        self.decoder_fc2 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.encoder_fc1(x))
        mu = self.encoder_mu(h)
        logvar = self.encoder_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.relu(self.decoder_fc1(z))
        recon = torch.sigmoid(self.decoder_fc2(h))
        return recon

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        recon_x = self.decode(z)
        return recon_x, mu, logvar

    def sample(self, num_samples):
        # Sample from prior N(0,I)
        z = torch.randn(num_samples, self.latent_dim)
        samples = self.decode(z)
        return samples

# Training loop
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(10):
    for batch_idx, (data, _) in enumerate(train_loader):
        optimizer.zero_grad()

        # Forward pass
        recon_batch, mu, logvar = model(data)

        # Compute loss
        loss = vae_loss(data.view(-1, 784), recon_batch, mu, logvar)

        # Backward pass
        loss.backward()
        optimizer.step()

Latent Space Exploration

Generating New Samples

Method 1: Sample from prior

python

z = torch.randn(num_samples, latent_dim)
generated_images = vae.decode(z)

Method 2: Interpolation

python

# Encode two images
mu1, _ = vae.encode(image1)
mu2, _ = vae.encode(image2)

# Interpolate in latent space
alpha = np.linspace(0, 1, num=10)
interpolated_codes = [(1-a)*mu1 + a*mu2 for a in alpha]

# Decode interpolated codes
interpolated_images = [vae.decode(z) for z in interpolated_codes]

Latent Space Arithmetic

Famous example (faces):

ini

man_with_glasses = man + glasses - face
woman_smiling = woman + smile - neutral

MNIST example:

ini

# Average latent codes for each digit
z_3 = mean(latent_codes_for_digit_3)
z_8 = mean(latent_codes_for_digit_8)

# Arithmetic
z_blend = 0.5 * z_3 + 0.5 * z_8
blended_digit = vae.decode(z_blend)  # Might look like 3+8 hybrid

Latent Space Traversal

Explore individual latent dimensions:

python

z = torch.zeros(1, latent_dim)

# Vary one dimension
for value in np.linspace(-3, 3, num=10):
    z[0, 5] = value  # Vary dimension 5
    image = vae.decode(z)
    # Display image - see how dimension 5 affects output

Observation: Some dimensions control specific attributes (rotation, thickness, style)

β-VAE: Disentangled Representations

Motivation

Standard VAE: Latent dimensions may be entangled (correlated factors)

β-VAE: Encourage disentanglement (independent factors)

Modified Loss

β-VAE loss:

ini

Loss = Reconstruction Loss + β * KL Divergence

Where:

β > 1: Stronger regularization
Effect: More disentangled latent space
Trade-off: Worse reconstructions

Typical β values: 1-10

Benefits of Disentanglement

Interpretable latent dimensions:

Dimension 1: Digit identity
Dimension 2: Rotation angle
Dimension 3: Stroke thickness
...

Applications:

Controllable generation
Data editing (change specific attributes)
Improved generalization

Conditional VAE (CVAE)

Motivation

Standard VAE: Can't control what to generate

CVAE: Condition generation on labels/attributes

Architecture

Encoder:

css

Q(z | x, y)  # Condition on both image x and label y

Decoder:

css

P(x | z, y)  # Generate image conditioned on label y

Implementation

python

class CVAE(nn.Module):
    def __init__(self, input_dim=784, label_dim=10, latent_dim=20):
        super().__init__()

        # Encoder (input = image + label)
        self.encoder_fc1 = nn.Linear(input_dim + label_dim, 400)
        self.encoder_mu = nn.Linear(400, latent_dim)
        self.encoder_logvar = nn.Linear(400, latent_dim)

        # Decoder (input = latent + label)
        self.decoder_fc1 = nn.Linear(latent_dim + label_dim, 400)
        self.decoder_fc2 = nn.Linear(400, input_dim)

    def encode(self, x, y):
        # Concatenate image and label
        xy = torch.cat([x, y], dim=1)
        h = F.relu(self.encoder_fc1(xy))
        mu = self.encoder_mu(h)
        logvar = self.encoder_logvar(h)
        return mu, logvar

    def decode(self, z, y):
        # Concatenate latent and label
        zy = torch.cat([z, y], dim=1)
        h = F.relu(self.decoder_fc1(zy))
        recon = torch.sigmoid(self.decoder_fc2(h))
        return recon

    def sample(self, y, num_samples):
        # Generate specific digit y
        z = torch.randn(num_samples, self.latent_dim)
        y_repeated = y.repeat(num_samples, 1)
        samples = self.decode(z, y_repeated)
        return samples

Usage

Generate specific digits:

python

# Generate 100 samples of digit "7"
y = F.one_hot(torch.tensor([7]), num_classes=10).float()
samples = cvae.sample(y, num_samples=100)

Applications

One. Image Generation

Generate new faces, bedrooms, objects

CelebA dataset (faces)
LSUN dataset (bedrooms)

2. Image-to-Image Translation

Style transfer:

Photo -> Painting style
Day -> Night

3. Anomaly Detection

Use reconstruction error:

css

If ||x - x̂||² > threshold: Anomaly

Applications:

Medical imaging (detect tumors)
Manufacturing (defect detection)

4. Data Compression

Compress to latent code (20-200 dimensions)

Much smaller than original image (784 dimensions for MNIST)

5. Semi-Supervised Learning

Use VAE for feature learning:

Encoder learns useful representations
Use z as features for downstream tasks

Advantages of VAEs

Principled training: Maximize likelihood (ELBO)
Stable training: No adversarial dynamics
Interpretable latent space: Smooth, continuous
Fast sampling: Single forward pass through decoder
Density estimation: Can compute P(x) approximately

Limitations of VAEs

Blurry outputs: Reconstruction loss encourages averaging
Limited expressiveness: Gaussian assumptions
Posterior collapse: KL term can dominate, z ignored
Worse sample quality than GANs: For complex data

Hyperparameters

Parameter	Typical Value	Effect
Latent dimension	10-200	Smaller: compressed; Larger: expressive
Learning rate	1e-3 to 1e-4	Standard for VAEs
β (for β-VAE)	1-10	Higher: more disentanglement
Hidden layers	1-3 layers	Deeper: more expressive
Hidden units	128-512	Larger: more capacity

Key Takeaways

VAEs learn latent probability distributions, not fixed codes
Reparameterization trick enables gradient flow through sampling
ELBO loss balances reconstruction and regularization
Latent space is smooth and interpretable
β-VAE encourages disentangled representations
CVAE enables controlled generation
Trade-offs: Stable training, blurry outputs vs GANs

Looking Ahead

Lesson 11: GANs for higher-quality generation
Lesson 12: Advanced GAN architectures (StyleGAN, ProGAN)
Lesson 13: Diffusion models (current state-of-the-art)

Summary

VAEs combine autoencoders with probabilistic modeling
Encoder maps data to latent distribution parameters (μ, σ)
Reparameterization trick enables backpropagation through sampling
ELBO = Reconstruction + KL divergence
Latent space allows interpolation and arithmetic
β-VAE and CVAE extend VAEs for disentanglement and control
Applications: Generation, translation, anomaly detection, compression

Concept 10 of 18

Concept 10: Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs)

ℹ️ Definition Variational Autoencoders (VAEs) are generative models that learn a continuous latent space representation of data by combining an encoder (inference network) and decoder (generative network), trained using variational inference to maximize the Evidence Lower Bound (ELBO).

Learning Objectives

By the end of this lesson, you will:

Understand the VAE architecture (encoder, latent space, decoder)
Learn the reparameterization trick for gradient flow
Derive and implement the Evidence Lower Bound (ELBO) loss
Build a VAE for MNIST digit generation using PyTorch
Explore and manipulate latent space representations
Apply VAEs to image generation and interpolation

Introduction

In Lesson 9, we introduced generative models. Now we'll learn our first deep generative model: Variational Autoencoders (VAEs).

VAEs combine:

Autoencoders (compression + reconstruction)
Probabilistic modeling (Bayesian inference)
Deep learning (neural networks)

Result: A model that can generate new, realistic samples!

Autoencoders (Background)

Traditional Autoencoders

Architecture:

css

Input x → Encoder → Latent code z → Decoder → Reconstruction x̂

Training objective:

Minimize reconstruction error: ||x - x̂||²

Problem: Latent space is deterministic and discontinuous

Can't generate new samples (only reconstruct training data)
Small changes in z can cause large changes in x̂
Gaps in latent space produce garbage

Example:

css

Encode digit "3" → z₁ = [0.5, 0.2, ...]
Encode digit "8" → z₂ = [0.6, 0.3, ...]

What about z = [0.55, 0.25, ...]? (midpoint)
→ Might generate garbage, not a valid digit!

Variational Autoencoders (VAEs)

Key Innovation

VAE insight: Encode data into a probability distribution in latent space, not a single point!

Architecture:

vbnet

Encoder: x → μ, σ (distribution parameters)
Latent: z ~ N(μ, σ²) (sample from distribution)
Decoder: z → x̂ (reconstruct)

Benefit: Smooth, continuous latent space

Probabilistic Formulation

Goal: Learn generative model P(x)

Latent variable model:

css

P(x) = ∫ P(x|z) * P(z) dz

Where:

P(z): Prior distribution (standard normal N(0,I))
P(x|z): Likelihood (decoder outputs)
P(x): Marginal distribution (data probability)

Challenge: Computing P(x) requires intractable integral!

VAE solution: Use variational inference to approximate

Encoder: Inference Network

Encoder learns: Q(z|x) ~= P(z|x)

Parametrization:

css

Q(z|x) = N(μ(x), σ²(x))

Where:

μ(x): Mean predicted by neural network
σ(x): Standard deviation predicted by neural network

Example (MNIST):

python

class Encoder(nn.Module):
    def __init__(self, input_dim=784, latent_dim=20):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 400)
        self.fc_mu = nn.Linear(400, latent_dim)  # Mean
        self.fc_logvar = nn.Linear(400, latent_dim)  # Log variance

    def forward(self, x):
        h = F.relu(self.fc1(x))
        mu = self.fc_mu(h)
        logvar = self.fc_logvar(h)
        return mu, logvar

Decoder: Generative Network

Decoder learns: P(x|z)

Parametrization:

scss

P(x|z) = Bernoulli(π(z))  # For binary data
P(x|z) = N(μ(z), σ²)      # For continuous data

Example (MNIST):

python

class Decoder(nn.Module):
    def __init__(self, latent_dim=20, output_dim=784):
        super().__init__()
        self.fc1 = nn.Linear(latent_dim, 400)
        self.fc2 = nn.Linear(400, output_dim)

    def forward(self, z):
        h = F.relu(self.fc1(z))
        recon = torch.sigmoid(self.fc2(h))  # Bernoulli parameter
        return recon

The Reparameterization Trick

Problem

Naive sampling:

scss

z ~ N(μ, σ²)

Issue: Sampling is not differentiable! Can't backpropagate through random sampling.

Solution: Reparameterization

Reparameterize:

ini

z = μ + σ * ε,  where ε ~ N(0, 1)

Why this works:

ε: Random noise (no gradient needed)
μ, σ: Deterministic (backprop flows through these!)
z: Still distributed as N(μ, σ²)

Implementation:

python

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)  # Convert log variance to std
    eps = torch.randn_like(std)    # Sample ε ~ N(0,1)
    z = mu + std * eps             # Reparameterize
    return z

Gradient flow:

markdown

Loss → Decoder → z → μ, σ → Encoder parameters ✓
                   ↓
                   ε (no gradient needed)

Evidence Lower Bound (ELBO)

Derivation

Goal: Maximize log P(x)

Variational inference:

css

log P(x) = ELBO + KL(Q(z|x) || P(z|x))
           ≥ ELBO  (since KL ≥ 0)

Evidence Lower Bound (ELBO):

scss

ELBO = E_Q[log P(x|z)] - KL(Q(z|x) || P(z))
        ↑                   ↑
   Reconstruction       Regularization
   (Decoder quality)    (Latent prior matching)

ELBO Components

One. Reconstruction Term

E_Q[log P(x|z)] = E[log P(x|z=μ+σε)]

For binary data (MNIST):

css

Reconstruction loss = Binary Cross-Entropy(x, x̂)

For continuous data:

css

Reconstruction loss = MSE(x, x̂)

2. KL Divergence Term

scss

KL(Q(z|x) || P(z)) = KL(N(μ,σ²) || N(0,I))

Closed-form solution:

ini

KL = 0.5 * Σ(1 + log(σ²) - μ² - σ²)

Loss Function

VAE loss (negative ELBO):

ini

Loss = Reconstruction Loss + KL Divergence

PyTorch implementation:

python

def vae_loss(x, recon_x, mu, logvar):
    # Reconstruction loss (BCE for binary data)
    recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')

    # KL divergence loss
    kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    # Total loss
    return recon_loss + kl_loss

Complete VAE Implementation

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
        super().__init__()

        # Encoder
        self.encoder_fc1 = nn.Linear(input_dim, hidden_dim)
        self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
        self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)

        # Decoder
        self.decoder_fc1 = nn.Linear(latent_dim, hidden_dim)
        self.decoder_fc2 = nn.Linear(hidden_dim, input_dim)

    def encode(self, x):
        h = F.relu(self.encoder_fc1(x))
        mu = self.encoder_mu(h)
        logvar = self.encoder_logvar(h)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h = F.relu(self.decoder_fc1(z))
        recon = torch.sigmoid(self.decoder_fc2(h))
        return recon

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 784))
        z = self.reparameterize(mu, logvar)
        recon_x = self.decode(z)
        return recon_x, mu, logvar

    def sample(self, num_samples):
        # Sample from prior N(0,I)
        z = torch.randn(num_samples, self.latent_dim)
        samples = self.decode(z)
        return samples

# Training loop
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(10):
    for batch_idx, (data, _) in enumerate(train_loader):
        optimizer.zero_grad()

        # Forward pass
        recon_batch, mu, logvar = model(data)

        # Compute loss
        loss = vae_loss(data.view(-1, 784), recon_batch, mu, logvar)

        # Backward pass
        loss.backward()
        optimizer.step()

Latent Space Exploration

Generating New Samples

Method 1: Sample from prior

python

z = torch.randn(num_samples, latent_dim)
generated_images = vae.decode(z)

Method 2: Interpolation

python

# Encode two images
mu1, _ = vae.encode(image1)
mu2, _ = vae.encode(image2)

# Interpolate in latent space
alpha = np.linspace(0, 1, num=10)
interpolated_codes = [(1-a)*mu1 + a*mu2 for a in alpha]

# Decode interpolated codes
interpolated_images = [vae.decode(z) for z in interpolated_codes]

Latent Space Arithmetic

Famous example (faces):

ini

man_with_glasses = man + glasses - face
woman_smiling = woman + smile - neutral

MNIST example:

ini

# Average latent codes for each digit
z_3 = mean(latent_codes_for_digit_3)
z_8 = mean(latent_codes_for_digit_8)

# Arithmetic
z_blend = 0.5 * z_3 + 0.5 * z_8
blended_digit = vae.decode(z_blend)  # Might look like 3+8 hybrid

Latent Space Traversal

Explore individual latent dimensions:

python

z = torch.zeros(1, latent_dim)

# Vary one dimension
for value in np.linspace(-3, 3, num=10):
    z[0, 5] = value  # Vary dimension 5
    image = vae.decode(z)
    # Display image - see how dimension 5 affects output

Observation: Some dimensions control specific attributes (rotation, thickness, style)

β-VAE: Disentangled Representations

Motivation

Standard VAE: Latent dimensions may be entangled (correlated factors)

β-VAE: Encourage disentanglement (independent factors)

Modified Loss

β-VAE loss:

ini

Loss = Reconstruction Loss + β * KL Divergence

Where:

β > 1: Stronger regularization
Effect: More disentangled latent space
Trade-off: Worse reconstructions

Typical β values: 1-10

Benefits of Disentanglement

Interpretable latent dimensions:

Dimension 1: Digit identity
Dimension 2: Rotation angle
Dimension 3: Stroke thickness
...

Applications:

Controllable generation
Data editing (change specific attributes)
Improved generalization

Conditional VAE (CVAE)

Motivation

Standard VAE: Can't control what to generate

CVAE: Condition generation on labels/attributes

Architecture

Encoder:

css

Q(z | x, y)  # Condition on both image x and label y

Decoder:

css

P(x | z, y)  # Generate image conditioned on label y

Implementation

python

class CVAE(nn.Module):
    def __init__(self, input_dim=784, label_dim=10, latent_dim=20):
        super().__init__()

        # Encoder (input = image + label)
        self.encoder_fc1 = nn.Linear(input_dim + label_dim, 400)
        self.encoder_mu = nn.Linear(400, latent_dim)
        self.encoder_logvar = nn.Linear(400, latent_dim)

        # Decoder (input = latent + label)
        self.decoder_fc1 = nn.Linear(latent_dim + label_dim, 400)
        self.decoder_fc2 = nn.Linear(400, input_dim)

    def encode(self, x, y):
        # Concatenate image and label
        xy = torch.cat([x, y], dim=1)
        h = F.relu(self.encoder_fc1(xy))
        mu = self.encoder_mu(h)
        logvar = self.encoder_logvar(h)
        return mu, logvar

    def decode(self, z, y):
        # Concatenate latent and label
        zy = torch.cat([z, y], dim=1)
        h = F.relu(self.decoder_fc1(zy))
        recon = torch.sigmoid(self.decoder_fc2(h))
        return recon

    def sample(self, y, num_samples):
        # Generate specific digit y
        z = torch.randn(num_samples, self.latent_dim)
        y_repeated = y.repeat(num_samples, 1)
        samples = self.decode(z, y_repeated)
        return samples

Usage

Generate specific digits:

python

# Generate 100 samples of digit "7"
y = F.one_hot(torch.tensor([7]), num_classes=10).float()
samples = cvae.sample(y, num_samples=100)

Applications

One. Image Generation

Generate new faces, bedrooms, objects

CelebA dataset (faces)
LSUN dataset (bedrooms)

2. Image-to-Image Translation

Style transfer:

Photo -> Painting style
Day -> Night

3. Anomaly Detection

Use reconstruction error:

css

If ||x - x̂||² > threshold: Anomaly

Applications:

Medical imaging (detect tumors)
Manufacturing (defect detection)

4. Data Compression

Compress to latent code (20-200 dimensions)

Much smaller than original image (784 dimensions for MNIST)

5. Semi-Supervised Learning

Use VAE for feature learning:

Encoder learns useful representations
Use z as features for downstream tasks

Advantages of VAEs

Principled training: Maximize likelihood (ELBO)
Stable training: No adversarial dynamics
Interpretable latent space: Smooth, continuous
Fast sampling: Single forward pass through decoder
Density estimation: Can compute P(x) approximately

Limitations of VAEs

Blurry outputs: Reconstruction loss encourages averaging
Limited expressiveness: Gaussian assumptions
Posterior collapse: KL term can dominate, z ignored
Worse sample quality than GANs: For complex data

Hyperparameters

Parameter	Typical Value	Effect
Latent dimension	10-200	Smaller: compressed; Larger: expressive
Learning rate	1e-3 to 1e-4	Standard for VAEs
β (for β-VAE)	1-10	Higher: more disentanglement
Hidden layers	1-3 layers	Deeper: more expressive
Hidden units	128-512	Larger: more capacity

Key Takeaways

VAEs learn latent probability distributions, not fixed codes
Reparameterization trick enables gradient flow through sampling
ELBO loss balances reconstruction and regularization
Latent space is smooth and interpretable
β-VAE encourages disentangled representations
CVAE enables controlled generation
Trade-offs: Stable training, blurry outputs vs GANs

Looking Ahead

Lesson 11: GANs for higher-quality generation
Lesson 12: Advanced GAN architectures (StyleGAN, ProGAN)
Lesson 13: Diffusion models (current state-of-the-art)

Summary

VAEs combine autoencoders with probabilistic modeling
Encoder maps data to latent distribution parameters (μ, σ)
Reparameterization trick enables backpropagation through sampling
ELBO = Reconstruction + KL divergence
Latent space allows interpolation and arithmetic
β-VAE and CVAE extend VAEs for disentanglement and control
Applications: Generation, translation, anomaly detection, compression