ℹ️ Definition Variational Autoencoders (VAEs) are generative models that learn a continuous latent space representation of data by combining an encoder (inference network) and decoder (generative network), trained using variational inference to maximize the Evidence Lower Bound (ELBO).
By the end of this lesson, you will:
In Lesson 9, we introduced generative models. Now we'll learn our first deep generative model: Variational Autoencoders (VAEs).
VAEs combine:
Result: A model that can generate new, realistic samples!
Architecture:
Input x → Encoder → Latent code z → Decoder → Reconstruction x̂
Training objective:
Minimize reconstruction error: ||x - x̂||²
Problem: Latent space is deterministic and discontinuous
Example:
Encode digit "3" → z₁ = [0.5, 0.2, ...]
Encode digit "8" → z₂ = [0.6, 0.3, ...]
What about z = [0.55, 0.25, ...]? (midpoint)
→ Might generate garbage, not a valid digit!

VAE insight: Encode data into a probability distribution in latent space, not a single point!
Architecture:
Encoder: x → μ, σ (distribution parameters)
Latent: z ~ N(μ, σ²) (sample from distribution)
Decoder: z → x̂ (reconstruct)
Benefit: Smooth, continuous latent space
Goal: Learn generative model P(x)
Latent variable model:
P(x) = ∫ P(x|z) * P(z) dz
Where:
Challenge: Computing P(x) requires intractable integral!
VAE solution: Use variational inference to approximate
Encoder learns: Q(z|x) ~= P(z|x)
Parametrization:
Q(z|x) = N(μ(x), σ²(x))
Where:
Example (MNIST):
class Encoder(nn.Module):
def __init__(self, input_dim=784, latent_dim=20):
super().__init__()
self.fc1 = nn.Linear(input_dim, 400)
self.fc_mu = nn.Linear(400, latent_dim) # Mean
self.fc_logvar = nn.Linear(400, latent_dim) # Log variance
def forward(self, x):
h = F.relu(self.fc1(x))
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar

Decoder learns: P(x|z)
Parametrization:
P(x|z) = Bernoulli(π(z)) # For binary data
P(x|z) = N(μ(z), σ²) # For continuous data
Example (MNIST):
class Decoder(nn.Module):
def __init__(self, latent_dim=20, output_dim=784):
super().__init__()
self.fc1 = nn.Linear(latent_dim, 400)
self.fc2 = nn.Linear(400, output_dim)
def forward(self, z):
h = F.relu(self.fc1(z))
recon = torch.sigmoid(self.fc2(h)) # Bernoulli parameter
return recon
Naive sampling:
z ~ N(μ, σ²)
Issue: Sampling is not differentiable! Can't backpropagate through random sampling.
Reparameterize:
z = μ + σ * ε, where ε ~ N(0, 1)
Why this works:
Implementation:
def reparameterize(mu, logvar):
std = torch.exp(0.5 * logvar) # Convert log variance to std
eps = torch.randn_like(std) # Sample ε ~ N(0,1)
z = mu + std * eps # Reparameterize
return z
Gradient flow:
Loss → Decoder → z → μ, σ → Encoder parameters ✓
↓
ε (no gradient needed)

Goal: Maximize log P(x)
Variational inference:
log P(x) = ELBO + KL(Q(z|x) || P(z|x))
≥ ELBO (since KL ≥ 0)
Evidence Lower Bound (ELBO):
ELBO = E_Q[log P(x|z)] - KL(Q(z|x) || P(z))
↑ ↑
Reconstruction Regularization
(Decoder quality) (Latent prior matching)
One. Reconstruction Term
E_Q[log P(x|z)] = E[log P(x|z=μ+σε)]
For binary data (MNIST):
Reconstruction loss = Binary Cross-Entropy(x, x̂)
For continuous data:
Reconstruction loss = MSE(x, x̂)
2. KL Divergence Term
KL(Q(z|x) || P(z)) = KL(N(μ,σ²) || N(0,I))
Closed-form solution:
KL = 0.5 * Σ(1 + log(σ²) - μ² - σ²)
VAE loss (negative ELBO):
Loss = Reconstruction Loss + KL Divergence
PyTorch implementation:
def vae_loss(x, recon_x, mu, logvar):
# Reconstruction loss (BCE for binary data)
recon_loss = F.binary_cross_entropy(recon_x, x, reduction='sum')
# KL divergence loss
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
# Total loss
return recon_loss + kl_loss

import torch
import torch.nn as nn
import torch.nn.functional as F
class VAE(nn.Module):
def __init__(self, input_dim=784, hidden_dim=400, latent_dim=20):
super().__init__()
# Encoder
self.encoder_fc1 = nn.Linear(input_dim, hidden_dim)
self.encoder_mu = nn.Linear(hidden_dim, latent_dim)
self.encoder_logvar = nn.Linear(hidden_dim, latent_dim)
# Decoder
self.decoder_fc1 = nn.Linear(latent_dim, hidden_dim)
self.decoder_fc2 = nn.Linear(hidden_dim, input_dim)
def encode(self, x):
h = F.relu(self.encoder_fc1(x))
mu = self.encoder_mu(h)
logvar = self.encoder_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
h = F.relu(self.decoder_fc1(z))
recon = torch.sigmoid(self.decoder_fc2(h))
return recon
def forward(self, x):
mu, logvar = self.encode(x.view(-1, 784))
z = self.reparameterize(mu, logvar)
recon_x = self.decode(z)
return recon_x, mu, logvar
def sample(self, num_samples):
# Sample from prior N(0,I)
z = torch.randn(num_samples, self.latent_dim)
samples = self.decode(z)
return samples
# Training loop
model = VAE()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(10):
for batch_idx, (data, _) in enumerate(train_loader):
optimizer.zero_grad()
# Forward pass
recon_batch, mu, logvar = model(data)
# Compute loss
loss = vae_loss(data.view(-1, 784), recon_batch, mu, logvar)
# Backward pass
loss.backward()
optimizer.step()
Method 1: Sample from prior
z = torch.randn(num_samples, latent_dim)
generated_images = vae.decode(z)
Method 2: Interpolation
# Encode two images
mu1, _ = vae.encode(image1)
mu2, _ = vae.encode(image2)
# Interpolate in latent space
alpha = np.linspace(0, 1, num=10)
interpolated_codes = [(1-a)*mu1 + a*mu2 for a in alpha]
# Decode interpolated codes
interpolated_images = [vae.decode(z) for z in interpolated_codes]
Famous example (faces):
man_with_glasses = man + glasses - face
woman_smiling = woman + smile - neutral
MNIST example:
# Average latent codes for each digit
z_3 = mean(latent_codes_for_digit_3)
z_8 = mean(latent_codes_for_digit_8)
# Arithmetic
z_blend = 0.5 * z_3 + 0.5 * z_8
blended_digit = vae.decode(z_blend) # Might look like 3+8 hybrid
Explore individual latent dimensions:
z = torch.zeros(1, latent_dim)
# Vary one dimension
for value in np.linspace(-3, 3, num=10):
z[0, 5] = value # Vary dimension 5
image = vae.decode(z)
# Display image - see how dimension 5 affects output
Observation: Some dimensions control specific attributes (rotation, thickness, style)
Standard VAE: Latent dimensions may be entangled (correlated factors)
β-VAE: Encourage disentanglement (independent factors)
β-VAE loss:
Loss = Reconstruction Loss + β * KL Divergence
Where:
Typical β values: 1-10
Interpretable latent dimensions:
Applications:
Standard VAE: Can't control what to generate
CVAE: Condition generation on labels/attributes
Encoder:
Q(z | x, y) # Condition on both image x and label y
Decoder:
P(x | z, y) # Generate image conditioned on label y
class CVAE(nn.Module):
def __init__(self, input_dim=784, label_dim=10, latent_dim=20):
super().__init__()
# Encoder (input = image + label)
self.encoder_fc1 = nn.Linear(input_dim + label_dim, 400)
self.encoder_mu = nn.Linear(400, latent_dim)
self.encoder_logvar = nn.Linear(400, latent_dim)
# Decoder (input = latent + label)
self.decoder_fc1 = nn.Linear(latent_dim + label_dim, 400)
self.decoder_fc2 = nn.Linear(400, input_dim)
def encode(self, x, y):
# Concatenate image and label
xy = torch.cat([x, y], dim=1)
h = F.relu(self.encoder_fc1(xy))
mu = self.encoder_mu(h)
logvar = self.encoder_logvar(h)
return mu, logvar
def decode(self, z, y):
# Concatenate latent and label
zy = torch.cat([z, y], dim=1)
h = F.relu(self.decoder_fc1(zy))
recon = torch.sigmoid(self.decoder_fc2(h))
return recon
def sample(self, y, num_samples):
# Generate specific digit y
z = torch.randn(num_samples, self.latent_dim)
y_repeated = y.repeat(num_samples, 1)
samples = self.decode(z, y_repeated)
return samples
Generate specific digits:
# Generate 100 samples of digit "7"
y = F.one_hot(torch.tensor([7]), num_classes=10).float()
samples = cvae.sample(y, num_samples=100)
Generate new faces, bedrooms, objects
Style transfer:
Use reconstruction error:
If ||x - x̂||² > threshold: Anomaly
Applications:
Compress to latent code (20-200 dimensions)
Use VAE for feature learning:
| Parameter | Typical Value | Effect |
|---|---|---|
| Latent dimension | 10-200 | Smaller: compressed; Larger: expressive |
| Learning rate | 1e-3 to 1e-4 | Standard for VAEs |
| β (for β-VAE) | 1-10 | Higher: more disentanglement |
| Hidden layers | 1-3 layers | Deeper: more expressive |
| Hidden units | 128-512 | Larger: more capacity |