ℹ️ Definition Advanced GAN Architectures extend the basic GAN framework with innovations in network design, training procedures, and loss functions to achieve photorealistic image generation, style control, and robust training stability.
By the end of this lesson, you will:
In Lesson 11, we learned basic GANs - powerful but challenging to train. Advanced GAN architectures solve key problems:
These innovations enable photorealistic face generation, artistic style transfer, and more.
Problem: Training GANs on high-resolution images (1024x1024) is unstable
Why difficult:
Idea: Start with low resolution, gradually increase
Training stages:
Stage 1: 4×4 resolution (train until stable)
Stage 2: 8×8 resolution (add layers, continue training)
Stage 3: 16×16 resolution
...
Stage N: 1024×1024 resolution
Benefits:
Generator:
Latent z → 4×4 block → 8×8 block → 16×16 block → ... → 1024×1024
↑ ↑ ↑
Gradually add these blocks during training
Each block:
Input → Upsample (2×) → Conv → Conv → Output
Discriminator: Mirror structure (downsampling instead of upsampling)
Problem: Abruptly adding layers can destabilize training
Solution: Gradually fade in new layers
# α = 0: Only use lower resolution
# α = 1: Fully use higher resolution
# α ∈ (0,1): Blend both resolutions
def fade_in(alpha, low_res, high_res):
return alpha * high_res + (1 - alpha) * low_res
# During transition phase, gradually increase α from 0 to 1
class ProgressiveGenerator(nn.Module):
def __init__(self, latent_dim=512):
super().__init__()
# Initial 4×4 block
self.initial = nn.Sequential(
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
nn.LeakyReLU(0.2),
nn.Conv2d(512, 512, 3, 1, 1),
nn.LeakyReLU(0.2)
)
# Progressive blocks (8×8, 16×16, 32×32, ...)
self.blocks = nn.ModuleList([
self._make_block(512, 512), # 8×8
self._make_block(512, 512), # 16×16
self._make_block(512, 256), # 32×32
self._make_block(256, 128), # 64×64
# ... more blocks
])
# Output layers (RGB conversion)
self.to_rgb = nn.ModuleList([
nn.Conv2d(512, 3, 1), # 4×4
nn.Conv2d(512, 3, 1), # 8×8
# ... one for each resolution
])
def _make_block(self, in_channels, out_channels):
return nn.Sequential(
nn.Upsample(scale_factor=2),
nn.Conv2d(in_channels, out_channels, 3, 1, 1),
nn.LeakyReLU(0.2),
nn.Conv2d(out_channels, out_channels, 3, 1, 1),
nn.LeakyReLU(0.2)
)
def forward(self, z, stage, alpha):
# stage: current resolution stage (0=4×4, 1=8×8, ...)
# alpha: fade-in weight (0 to 1)
x = self.initial(z.view(-1, z.size(1), 1, 1))
if stage == 0:
return self.to_rgb[0](x)
# Apply progressive blocks
for i in range(stage):
x = self.blocks[i](x)
# Fade-in logic
if alpha < 1.0:
# Blend current resolution with upsampled previous
x_prev = F.interpolate(x, scale_factor=0.5)
x_prev = self.to_rgb[stage - 1](x_prev)
x_prev = F.interpolate(x_prev, scale_factor=2)
x_curr = self.to_rgb[stage](x)
return alpha * x_curr + (1 - alpha) * x_prev
return self.to_rgb[stage](x)
Example for 1024x1024:
Stage 0 (4×4): Train for 600k images
Stage 1 (8×8): Train for 600k images (300k fade-in + 300k stable)
Stage 2 (16×16): Train for 600k images
Stage 3 (32×32): Train for 600k images
Stage 4 (64×64): Train for 600k images
Stage 5 (128×128): Train for 600k images
Stage 6 (256×256): Train for 600k images
Stage 7 (512×512): Train for 600k images
Stage 8 (1024×1024): Train for 600k images
Total: ~4.8M images
Problem: Traditional generators lack fine-grained control
Desired: Control different aspects independently
One. Style-Based Generator
2. Adaptive Instance Normalization (AdaIN)
3. Mapping Network
Latent z → Mapping Network (8 FC layers) → w
↓
Constant 4×4 → Block (style w₁) → 8×8
↓
Block (style w₂) → 16×16
↓
Block (style w₃) → 32×32
↓
... → 1024×1024
Each style injection point controls different features.
AdaIN formula:
AdaIN(x, y) = σ_y * (x - μ_x) / σ_x + μ_y
Where:
Effect: Normalize features, then apply learned style
Implementation:
class AdaIN(nn.Module):
def __init__(self, in_channels, latent_dim):
super().__init__()
self.norm = nn.InstanceNorm2d(in_channels)
# Learn scale and bias from latent code
self.style = nn.Linear(latent_dim, in_channels * 2)
def forward(self, x, w):
style = self.style(w).unsqueeze(2).unsqueeze(3)
scale, bias = style.chunk(2, dim=1) # Split into scale and bias
normalized = self.norm(x)
return scale * normalized + bias
Powerful feature: Mix styles from different sources
Example:
# Generate two latent codes
w1 = mapping_network(z1)
w2 = mapping_network(z2)
# Use w1 for coarse layers, w2 for fine layers
# Result: Coarse structure from z1, fine details from z2
Applications:
Add stochastic variation:
Implementation:
# Add noise to feature maps
noise = torch.randn(batch_size, 1, height, width)
features = features + noise_scale * noise
Problem: Standard GAN loss can cause vanishing gradients and mode collapse
Why: Jensen-Shannon divergence is not ideal for distributions with non-overlapping support
Intuition: Minimum "cost" to transform one distribution into another
Properties:
Standard GAN:
min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]
WGAN:
min_G max_D E[D(x)] - E[D(G(z))]
Key differences:
Problem: WGAN requires Lipschitz constraint on critic
Original WGAN solution: Weight clipping (clip weights to [-c, c])
WGAN-GP solution: Gradient penalty
λ * E[(||∇D(x̂)||₂ - 1)²]
Where x̂ is interpolated between real and fake samples
Implementation:
def gradient_penalty(critic, real_samples, fake_samples, device):
batch_size = real_samples.size(0)
# Random interpolation
alpha = torch.rand(batch_size, 1, 1, 1).to(device)
interpolates = (alpha * real_samples + (1 - alpha) * fake_samples).requires_grad_(True)
# Critic scores
d_interpolates = critic(interpolates)
# Gradients
gradients = torch.autograd.grad(
outputs=d_interpolates,
inputs=interpolates,
grad_outputs=torch.ones_like(d_interpolates),
create_graph=True,
retain_graph=True,
)[0]
# Gradient penalty
gradients = gradients.view(batch_size, -1)
gradient_norm = gradients.norm(2, dim=1)
penalty = ((gradient_norm - 1) ** 2).mean()
return penalty
# Training
d_loss = -torch.mean(critic(real_images)) + torch.mean(critic(fake_images)) + \
lambda_gp * gradient_penalty(critic, real_images, fake_images, device)
Problem: Image-to-image translation usually requires paired data
Example - Horse to Zebra:
CycleGAN: Learn translation without paired data!
Idea: If we translate A -> B -> A, we should get back to A
Two generators:
Cycle consistency loss:
L_cyc = E[||F(G(x)) - x||] + E[||G(F(y)) - y||]
Interpretation:
Total loss:
L = L_GAN(G) + L_GAN(F) + λ * L_cyc(G, F)
Where:
Four networks:
Training:
# Forward cycle: x → G(x) → F(G(x))
fake_y = G(x)
reconstructed_x = F(fake_y)
cycle_loss_x = F.l1_loss(reconstructed_x, x)
# Backward cycle: y → F(y) → G(F(y))
fake_x = F(y)
reconstructed_y = G(fake_x)
cycle_loss_y = F.l1_loss(reconstructed_y, y)
# Adversarial losses
g_loss = adversarial_loss(D_Y(fake_y), real_labels)
f_loss = adversarial_loss(D_X(fake_x), real_labels)
# Total loss
total_loss = g_loss + f_loss + lambda_cyc * (cycle_loss_x + cycle_loss_y)
Style Transfer:
Object Transfiguration:
Domain Adaptation:
Problem: Image-to-image translation with paired training data
Examples:
One. Conditional GAN:
2. U-Net Generator:
3. PatchGAN Discriminator:
Combined loss:
L = L_GAN + λ * L_L1
Where:
Why L1 loss:
Architecture:
Encoder (downsample):
Input → Conv(64) → Conv(128) → Conv(256) → Conv(512) → Latent
Decoder (upsample):
Latent → ConvTranspose(512) → ConvTranspose(256) → ConvTranspose(128) → ConvTranspose(64) → Output
↑ skip connection ↑ ↑ skip connection ↑
(from encoder) (from encoder)
Skip connections: Preserve spatial information
Idea: Classify NxN patches instead of entire image
Benefits:
Implementation:
class PatchGANDiscriminator(nn.Module):
def __init__(self, input_channels=6): # 3 (input) + 3 (output)
super().__init__()
self.model = nn.Sequential(
nn.Conv2d(input_channels, 64, 4, 2, 1),
nn.LeakyReLU(0.2),
nn.Conv2d(64, 128, 4, 2, 1),
nn.BatchNorm2d(128),
nn.LeakyReLU(0.2),
nn.Conv2d(128, 256, 4, 2, 1),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.2),
nn.Conv2d(256, 512, 4, 1, 1),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.2),
nn.Conv2d(512, 1, 4, 1, 1) # Output: N×N patch predictions
)
def forward(self, input_img, target_img):
x = torch.cat([input_img, target_img], dim=1)
return self.model(x)
| Architecture | Key Feature | Use Case | Training Difficulty |
|---|---|---|---|
| Progressive GAN | Gradual resolution increase | High-res generation (1024x1024) | Medium |
| StyleGAN | Style-based control | Controllable face generation | Medium |
| WGAN-GP | Wasserstein loss + gradient penalty | Stable training | Easy |
| CycleGAN | Cycle consistency | Unpaired translation | Medium |
| pix2pix | Paired translation + L1 loss | Paired translation | Easy |
| BigGAN | Large-scale training | High-quality ImageNet | Hard |
| StyleGAN2 | Improved StyleGAN | State-of-the-art faces | Medium |
For high-resolution generation:
For stable training:
For unpaired translation:
For paired translation:
For controllable generation:
One. Use spectral normalization (discriminator stability) 2. Self-attention layers (capture long-range dependencies) 3. Two time-scale update rule (TTUR): Different learning rates for G and D 4. Exponential moving average (EMA) of generator weights 5. Progressive training for high resolutions