Demo Mode

No student ID available

Concept 12 of 18

Concept 12: Advanced GAN Architectures

Advanced GAN Architectures

ℹ️ Definition Advanced GAN Architectures extend the basic GAN framework with innovations in network design, training procedures, and loss functions to achieve photorealistic image generation, style control, and robust training stability.

Learning Objectives

By the end of this lesson, you will:

Understand Progressive GAN's gradual resolution increase strategy
Learn StyleGAN's style-based generator architecture
Implement Wasserstein GAN (WGAN) with gradient penalty
Apply CycleGAN for unpaired image-to-image translation
Explore pix2pix for paired translation tasks
Compare different GAN variants and their use cases

Introduction

In Lesson 11, we learned basic GANs - powerful but challenging to train. Advanced GAN architectures solve key problems:

Progressive GAN: Stable training for high-resolution images
StyleGAN: Fine-grained control over generated image style
WGAN: Principled loss function for stable training
CycleGAN: Translation without paired training data

These innovations enable photorealistic face generation, artistic style transfer, and more.

Progressive GAN (ProGAN)

Motivation

Problem: Training GANs on high-resolution images (1024x1024) is unstable

Why difficult:

High-dimensional space harder to learn
Discriminator easily overwhelms generator
Training prone to mode collapse

Key Innovation: Progressive Training

Idea: Start with low resolution, gradually increase

Training stages:

yaml

Stage 1: 4×4 resolution (train until stable)
Stage 2: 8×8 resolution (add layers, continue training)
Stage 3: 16×16 resolution
...
Stage N: 1024×1024 resolution

Benefits:

Stable training (start simple, add complexity gradually)
Faster convergence (learn coarse features first, then details)
Higher quality results

Architecture

Generator:

scss

Latent z → 4×4 block → 8×8 block → 16×16 block → ... → 1024×1024
                ↑           ↑            ↑
            Gradually add these blocks during training

Each block:

css

Input → Upsample (2×) → Conv → Conv → Output

Discriminator: Mirror structure (downsampling instead of upsampling)

Smooth Transition with Fade-in

Problem: Abruptly adding layers can destabilize training

Solution: Gradually fade in new layers

python

# α = 0: Only use lower resolution
# α = 1: Fully use higher resolution
# α ∈ (0,1): Blend both resolutions

def fade_in(alpha, low_res, high_res):
    return alpha * high_res + (1 - alpha) * low_res

# During transition phase, gradually increase α from 0 to 1

Implementation Sketch

python

class ProgressiveGenerator(nn.Module):
    def __init__(self, latent_dim=512):
        super().__init__()
        # Initial 4×4 block
        self.initial = nn.Sequential(
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
            nn.LeakyReLU(0.2),
            nn.Conv2d(512, 512, 3, 1, 1),
            nn.LeakyReLU(0.2)
        )

        # Progressive blocks (8×8, 16×16, 32×32, ...)
        self.blocks = nn.ModuleList([
            self._make_block(512, 512),  # 8×8
            self._make_block(512, 512),  # 16×16
            self._make_block(512, 256),  # 32×32
            self._make_block(256, 128),  # 64×64
            # ... more blocks
        ])

        # Output layers (RGB conversion)
        self.to_rgb = nn.ModuleList([
            nn.Conv2d(512, 3, 1),  # 4×4
            nn.Conv2d(512, 3, 1),  # 8×8
            # ... one for each resolution
        ])

    def _make_block(self, in_channels, out_channels):
        return nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv2d(in_channels, out_channels, 3, 1, 1),
            nn.LeakyReLU(0.2),
            nn.Conv2d(out_channels, out_channels, 3, 1, 1),
            nn.LeakyReLU(0.2)
        )

    def forward(self, z, stage, alpha):
        # stage: current resolution stage (0=4×4, 1=8×8, ...)
        # alpha: fade-in weight (0 to 1)

        x = self.initial(z.view(-1, z.size(1), 1, 1))

        if stage == 0:
            return self.to_rgb[0](x)

        # Apply progressive blocks
        for i in range(stage):
            x = self.blocks[i](x)

        # Fade-in logic
        if alpha < 1.0:
            # Blend current resolution with upsampled previous
            x_prev = F.interpolate(x, scale_factor=0.5)
            x_prev = self.to_rgb[stage - 1](x_prev)
            x_prev = F.interpolate(x_prev, scale_factor=2)

            x_curr = self.to_rgb[stage](x)

            return alpha * x_curr + (1 - alpha) * x_prev

        return self.to_rgb[stage](x)

Training Schedule

Example for 1024x1024:

java

Stage 0 (4×4):     Train for 600k images
Stage 1 (8×8):     Train for 600k images (300k fade-in + 300k stable)
Stage 2 (16×16):   Train for 600k images
Stage 3 (32×32):   Train for 600k images
Stage 4 (64×64):   Train for 600k images
Stage 5 (128×128): Train for 600k images
Stage 6 (256×256): Train for 600k images
Stage 7 (512×512): Train for 600k images
Stage 8 (1024×1024): Train for 600k images

Total: ~4.8M images

StyleGAN

Motivation

Problem: Traditional generators lack fine-grained control

Desired: Control different aspects independently

Coarse features: Face shape, pose
Medium features: Hairstyle, facial features
Fine features: Skin texture, color

Key Innovations

One. Style-Based Generator

Separate content (spatial structure) from style (appearance)
Inject style at multiple resolutions

2. Adaptive Instance Normalization (AdaIN)

Apply styles via learned affine transformations
Different style for each resolution

3. Mapping Network

Transform latent code z -> intermediate latent w
w is easier to control (disentangled)

Architecture Overview

scss

Latent z → Mapping Network (8 FC layers) → w
                                              ↓
Constant 4×4 → Block (style w₁) → 8×8
                      ↓
               Block (style w₂) → 16×16
                      ↓
               Block (style w₃) → 32×32
                      ↓
               ... → 1024×1024

Each style injection point controls different features.

Adaptive Instance Normalization (AdaIN)

AdaIN formula:

css

AdaIN(x, y) = σ_y * (x - μ_x) / σ_x + μ_y

Where:

x: Feature maps
y: Style (learned from w)
μ, σ: Mean and standard deviation

Effect: Normalize features, then apply learned style

Implementation:

python

class AdaIN(nn.Module):
    def __init__(self, in_channels, latent_dim):
        super().__init__()
        self.norm = nn.InstanceNorm2d(in_channels)
        # Learn scale and bias from latent code
        self.style = nn.Linear(latent_dim, in_channels * 2)

    def forward(self, x, w):
        style = self.style(w).unsqueeze(2).unsqueeze(3)
        scale, bias = style.chunk(2, dim=1)  # Split into scale and bias

        normalized = self.norm(x)
        return scale * normalized + bias

Style Mixing

Powerful feature: Mix styles from different sources

Example:

python

# Generate two latent codes
w1 = mapping_network(z1)
w2 = mapping_network(z2)

# Use w1 for coarse layers, w2 for fine layers
# Result: Coarse structure from z1, fine details from z2

Applications:

Face from person A, hairstyle from person B
Building structure from photo 1, texture from photo 2

Noise Injection

Add stochastic variation:

Per-pixel noise added at each resolution
Controls stochastic details (hair strands, pores, wrinkles)

Implementation:

python

# Add noise to feature maps
noise = torch.randn(batch_size, 1, height, width)
features = features + noise_scale * noise

Wasserstein GAN (WGAN)

Motivation

Problem: Standard GAN loss can cause vanishing gradients and mode collapse

Why: Jensen-Shannon divergence is not ideal for distributions with non-overlapping support

Wasserstein Distance (Earth Mover's Distance)

Intuition: Minimum "cost" to transform one distribution into another

Properties:

Continuous and differentiable everywhere
Provides meaningful gradient even when distributions don't overlap

WGAN Loss

Standard GAN:

bash

min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]

WGAN:

css

min_G max_D E[D(x)] - E[D(G(z))]

Key differences:

No sigmoid in discriminator (critic outputs real-valued score)
No log in loss function

WGAN-GP: Gradient Penalty

Problem: WGAN requires Lipschitz constraint on critic

Original WGAN solution: Weight clipping (clip weights to [-c, c])

Problem: Causes optimization issues

WGAN-GP solution: Gradient penalty

css

λ * E[(||∇D(x̂)||₂ - 1)²]

Where x̂ is interpolated between real and fake samples

Implementation:

python

def gradient_penalty(critic, real_samples, fake_samples, device):
    batch_size = real_samples.size(0)

    # Random interpolation
    alpha = torch.rand(batch_size, 1, 1, 1).to(device)
    interpolates = (alpha * real_samples + (1 - alpha) * fake_samples).requires_grad_(True)

    # Critic scores
    d_interpolates = critic(interpolates)

    # Gradients
    gradients = torch.autograd.grad(
        outputs=d_interpolates,
        inputs=interpolates,
        grad_outputs=torch.ones_like(d_interpolates),
        create_graph=True,
        retain_graph=True,
    )[0]

    # Gradient penalty
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)
    penalty = ((gradient_norm - 1) ** 2).mean()
    return penalty

# Training
d_loss = -torch.mean(critic(real_images)) + torch.mean(critic(fake_images)) + \
         lambda_gp * gradient_penalty(critic, real_images, fake_images, device)

WGAN Benefits

Stable training: No mode collapse
Meaningful loss: Correlates with sample quality
No balancing: Don't need to carefully tune D vs G updates
Reliable: Works with various architectures

CycleGAN

Motivation

Problem: Image-to-image translation usually requires paired data

Example - Horse to Zebra:

Need: Same photo with horse AND zebra (hard to obtain!)

CycleGAN: Learn translation without paired data!

Key Innovation: Cycle Consistency

Idea: If we translate A -> B -> A, we should get back to A

Two generators:

G: X -> Y (horses -> zebras)
F: Y -> X (zebras -> horses)

Cycle consistency loss:

css

L_cyc = E[||F(G(x)) - x||] + E[||G(F(y)) - y||]

Interpretation:

G(x): Horse -> Zebra
F(G(x)): Zebra -> Horse (should equal x)

CycleGAN Objective

Total loss:

scss

L = L_GAN(G) + L_GAN(F) + λ * L_cyc(G, F)

Where:

L_GAN: Adversarial loss (fool discriminators)
L_cyc: Cycle consistency loss
λ: Cycle consistency weight (typically 10)

Architecture

Four networks:

G: X -> Y (generator)
F: Y -> X (generator)
D_Y: Discriminate real/fake Y
D_X: Discriminate real/fake X

Training:

python

# Forward cycle: x → G(x) → F(G(x))
fake_y = G(x)
reconstructed_x = F(fake_y)
cycle_loss_x = F.l1_loss(reconstructed_x, x)

# Backward cycle: y → F(y) → G(F(y))
fake_x = F(y)
reconstructed_y = G(fake_x)
cycle_loss_y = F.l1_loss(reconstructed_y, y)

# Adversarial losses
g_loss = adversarial_loss(D_Y(fake_y), real_labels)
f_loss = adversarial_loss(D_X(fake_x), real_labels)

# Total loss
total_loss = g_loss + f_loss + lambda_cyc * (cycle_loss_x + cycle_loss_y)

Applications

Style Transfer:

Photos -> Monet paintings
Summer -> Winter
Day -> Night

Object Transfiguration:

Horses <-> Zebras
Apples <-> Oranges
Cats <-> Dogs

Domain Adaptation:

Synthetic -> Real images
Sketch -> Photo

pix2pix

Motivation

Problem: Image-to-image translation with paired training data

Examples:

Semantic labels -> Photo
Edges -> Photo
Black&white -> Color

Key Innovations

One. Conditional GAN:

Generator: G(x) conditioned on input x
Discriminator: D(x, y) judges if y is real given x

2. U-Net Generator:

Encoder-decoder with skip connections
Preserves low-level information

3. PatchGAN Discriminator:

Classifies NxN patches as real/fake
Models high-frequency structure

Loss Function

Combined loss:

ini

L = L_GAN + λ * L_L1

Where:

L_GAN: Adversarial loss
L_L1: Pixel-wise L1 loss (encourages correct output)
λ: L1 weight (typically 100)

Why L1 loss:

GAN alone can produce artifacts
L1 encourages correct low-level structure

U-Net Generator

Architecture:

scss

Encoder (downsample):
Input → Conv(64) → Conv(128) → Conv(256) → Conv(512) → Latent

Decoder (upsample):
Latent → ConvTranspose(512) → ConvTranspose(256) → ConvTranspose(128) → ConvTranspose(64) → Output
            ↑ skip connection ↑        ↑ skip connection ↑
         (from encoder)             (from encoder)

Skip connections: Preserve spatial information

PatchGAN Discriminator

Idea: Classify NxN patches instead of entire image

Benefits:

Fewer parameters
Works on arbitrary image sizes
Focuses on high-frequency details

Implementation:

python

class PatchGANDiscriminator(nn.Module):
    def __init__(self, input_channels=6):  # 3 (input) + 3 (output)
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(input_channels, 64, 4, 2, 1),
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.2),
            nn.Conv2d(128, 256, 4, 2, 1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2),
            nn.Conv2d(256, 512, 4, 1, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2),
            nn.Conv2d(512, 1, 4, 1, 1)  # Output: N×N patch predictions
        )

    def forward(self, input_img, target_img):
        x = torch.cat([input_img, target_img], dim=1)
        return self.model(x)

GAN Variants Comparison

Architecture	Key Feature	Use Case	Training Difficulty
Progressive GAN	Gradual resolution increase	High-res generation (1024x1024)	Medium
StyleGAN	Style-based control	Controllable face generation	Medium
WGAN-GP	Wasserstein loss + gradient penalty	Stable training	Easy
CycleGAN	Cycle consistency	Unpaired translation	Medium
pix2pix	Paired translation + L1 loss	Paired translation	Easy
BigGAN	Large-scale training	High-quality ImageNet	Hard
StyleGAN2	Improved StyleGAN	State-of-the-art faces	Medium

Best Practices

Choosing the Right GAN

For high-resolution generation:

Progressive GAN or StyleGAN

For stable training:

WGAN-GP

For unpaired translation:

CycleGAN

For paired translation:

pix2pix

For controllable generation:

StyleGAN, Conditional GAN

One. Use spectral normalization (discriminator stability) 2. Self-attention layers (capture long-range dependencies) 3. Two time-scale update rule (TTUR): Different learning rates for G and D 4. Exponential moving average (EMA) of generator weights 5. Progressive training for high resolutions

Key Takeaways

Progressive GAN trains GANs gradually from low to high resolution
StyleGAN provides fine-grained control via style injection
WGAN-GP uses Wasserstein distance for stable training
CycleGAN enables unpaired image translation via cycle consistency
pix2pix uses paired data with U-Net and PatchGAN
Architecture choice depends on task and data availability

Looking Ahead

Lesson 13: Diffusion models (current state-of-the-art, surpasses GANs!)
Lesson 14: Transformer architectures for generation
Project 3: GAN Art Studio (implement StyleGAN-based art generation)

Summary

Progressive GAN gradually increases resolution for stable high-res training
StyleGAN separates content and style for controllable generation
WGAN-GP uses Wasserstein loss with gradient penalty for stability
CycleGAN learns unpaired translation with cycle consistency
pix2pix uses U-Net + PatchGAN for paired translation
Advanced techniques: Spectral norm, self-attention, TTUR, EMA
Trade-offs: Quality, control, training stability, data requirements

Concept 12 of 18

Concept 12: Advanced GAN Architectures

Advanced GAN Architectures

ℹ️ Definition Advanced GAN Architectures extend the basic GAN framework with innovations in network design, training procedures, and loss functions to achieve photorealistic image generation, style control, and robust training stability.

Learning Objectives

By the end of this lesson, you will:

Understand Progressive GAN's gradual resolution increase strategy
Learn StyleGAN's style-based generator architecture
Implement Wasserstein GAN (WGAN) with gradient penalty
Apply CycleGAN for unpaired image-to-image translation
Explore pix2pix for paired translation tasks
Compare different GAN variants and their use cases

Introduction

In Lesson 11, we learned basic GANs - powerful but challenging to train. Advanced GAN architectures solve key problems:

Progressive GAN: Stable training for high-resolution images
StyleGAN: Fine-grained control over generated image style
WGAN: Principled loss function for stable training
CycleGAN: Translation without paired training data

These innovations enable photorealistic face generation, artistic style transfer, and more.

Progressive GAN (ProGAN)

Motivation

Problem: Training GANs on high-resolution images (1024x1024) is unstable

Why difficult:

High-dimensional space harder to learn
Discriminator easily overwhelms generator
Training prone to mode collapse

Key Innovation: Progressive Training

Idea: Start with low resolution, gradually increase

Training stages:

yaml

Stage 1: 4×4 resolution (train until stable)
Stage 2: 8×8 resolution (add layers, continue training)
Stage 3: 16×16 resolution
...
Stage N: 1024×1024 resolution

Benefits:

Stable training (start simple, add complexity gradually)
Faster convergence (learn coarse features first, then details)
Higher quality results

Architecture

Generator:

scss

Latent z → 4×4 block → 8×8 block → 16×16 block → ... → 1024×1024
                ↑           ↑            ↑
            Gradually add these blocks during training

Each block:

css

Input → Upsample (2×) → Conv → Conv → Output

Discriminator: Mirror structure (downsampling instead of upsampling)

Smooth Transition with Fade-in

Problem: Abruptly adding layers can destabilize training

Solution: Gradually fade in new layers

python

# α = 0: Only use lower resolution
# α = 1: Fully use higher resolution
# α ∈ (0,1): Blend both resolutions

def fade_in(alpha, low_res, high_res):
    return alpha * high_res + (1 - alpha) * low_res

# During transition phase, gradually increase α from 0 to 1

Implementation Sketch

python

class ProgressiveGenerator(nn.Module):
    def __init__(self, latent_dim=512):
        super().__init__()
        # Initial 4×4 block
        self.initial = nn.Sequential(
            nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0),
            nn.LeakyReLU(0.2),
            nn.Conv2d(512, 512, 3, 1, 1),
            nn.LeakyReLU(0.2)
        )

        # Progressive blocks (8×8, 16×16, 32×32, ...)
        self.blocks = nn.ModuleList([
            self._make_block(512, 512),  # 8×8
            self._make_block(512, 512),  # 16×16
            self._make_block(512, 256),  # 32×32
            self._make_block(256, 128),  # 64×64
            # ... more blocks
        ])

        # Output layers (RGB conversion)
        self.to_rgb = nn.ModuleList([
            nn.Conv2d(512, 3, 1),  # 4×4
            nn.Conv2d(512, 3, 1),  # 8×8
            # ... one for each resolution
        ])

    def _make_block(self, in_channels, out_channels):
        return nn.Sequential(
            nn.Upsample(scale_factor=2),
            nn.Conv2d(in_channels, out_channels, 3, 1, 1),
            nn.LeakyReLU(0.2),
            nn.Conv2d(out_channels, out_channels, 3, 1, 1),
            nn.LeakyReLU(0.2)
        )

    def forward(self, z, stage, alpha):
        # stage: current resolution stage (0=4×4, 1=8×8, ...)
        # alpha: fade-in weight (0 to 1)

        x = self.initial(z.view(-1, z.size(1), 1, 1))

        if stage == 0:
            return self.to_rgb[0](x)

        # Apply progressive blocks
        for i in range(stage):
            x = self.blocks[i](x)

        # Fade-in logic
        if alpha < 1.0:
            # Blend current resolution with upsampled previous
            x_prev = F.interpolate(x, scale_factor=0.5)
            x_prev = self.to_rgb[stage - 1](x_prev)
            x_prev = F.interpolate(x_prev, scale_factor=2)

            x_curr = self.to_rgb[stage](x)

            return alpha * x_curr + (1 - alpha) * x_prev

        return self.to_rgb[stage](x)

Training Schedule

Example for 1024x1024:

java

Stage 0 (4×4):     Train for 600k images
Stage 1 (8×8):     Train for 600k images (300k fade-in + 300k stable)
Stage 2 (16×16):   Train for 600k images
Stage 3 (32×32):   Train for 600k images
Stage 4 (64×64):   Train for 600k images
Stage 5 (128×128): Train for 600k images
Stage 6 (256×256): Train for 600k images
Stage 7 (512×512): Train for 600k images
Stage 8 (1024×1024): Train for 600k images

Total: ~4.8M images

StyleGAN

Motivation

Problem: Traditional generators lack fine-grained control

Desired: Control different aspects independently

Coarse features: Face shape, pose
Medium features: Hairstyle, facial features
Fine features: Skin texture, color

Key Innovations

One. Style-Based Generator

Separate content (spatial structure) from style (appearance)
Inject style at multiple resolutions

2. Adaptive Instance Normalization (AdaIN)

Apply styles via learned affine transformations
Different style for each resolution

3. Mapping Network

Transform latent code z -> intermediate latent w
w is easier to control (disentangled)

Architecture Overview

scss

Latent z → Mapping Network (8 FC layers) → w
                                              ↓
Constant 4×4 → Block (style w₁) → 8×8
                      ↓
               Block (style w₂) → 16×16
                      ↓
               Block (style w₃) → 32×32
                      ↓
               ... → 1024×1024

Each style injection point controls different features.

Adaptive Instance Normalization (AdaIN)

AdaIN formula:

css

AdaIN(x, y) = σ_y * (x - μ_x) / σ_x + μ_y

Where:

x: Feature maps
y: Style (learned from w)
μ, σ: Mean and standard deviation

Effect: Normalize features, then apply learned style

Implementation:

python

class AdaIN(nn.Module):
    def __init__(self, in_channels, latent_dim):
        super().__init__()
        self.norm = nn.InstanceNorm2d(in_channels)
        # Learn scale and bias from latent code
        self.style = nn.Linear(latent_dim, in_channels * 2)

    def forward(self, x, w):
        style = self.style(w).unsqueeze(2).unsqueeze(3)
        scale, bias = style.chunk(2, dim=1)  # Split into scale and bias

        normalized = self.norm(x)
        return scale * normalized + bias

Style Mixing

Powerful feature: Mix styles from different sources

Example:

python

# Generate two latent codes
w1 = mapping_network(z1)
w2 = mapping_network(z2)

# Use w1 for coarse layers, w2 for fine layers
# Result: Coarse structure from z1, fine details from z2

Applications:

Face from person A, hairstyle from person B
Building structure from photo 1, texture from photo 2

Noise Injection

Add stochastic variation:

Per-pixel noise added at each resolution
Controls stochastic details (hair strands, pores, wrinkles)

Implementation:

python

# Add noise to feature maps
noise = torch.randn(batch_size, 1, height, width)
features = features + noise_scale * noise

Continuous and differentiable everywhere
Provides meaningful gradient even when distributions don't overlap

WGAN Loss

Standard GAN:

bash

min_G max_D E[log D(x)] + E[log(1 - D(G(z)))]

WGAN:

css

min_G max_D E[D(x)] - E[D(G(z))]

Key differences:

No sigmoid in discriminator (critic outputs real-valued score)
No log in loss function

WGAN-GP: Gradient Penalty

Problem: WGAN requires Lipschitz constraint on critic

Original WGAN solution: Weight clipping (clip weights to [-c, c])

Problem: Causes optimization issues

WGAN-GP solution: Gradient penalty

css

λ * E[(||∇D(x̂)||₂ - 1)²]

Where x̂ is interpolated between real and fake samples

Implementation:

python

def gradient_penalty(critic, real_samples, fake_samples, device):
    batch_size = real_samples.size(0)

    # Random interpolation
    alpha = torch.rand(batch_size, 1, 1, 1).to(device)
    interpolates = (alpha * real_samples + (1 - alpha) * fake_samples).requires_grad_(True)

    # Critic scores
    d_interpolates = critic(interpolates)

    # Gradients
    gradients = torch.autograd.grad(
        outputs=d_interpolates,
        inputs=interpolates,
        grad_outputs=torch.ones_like(d_interpolates),
        create_graph=True,
        retain_graph=True,
    )[0]

    # Gradient penalty
    gradients = gradients.view(batch_size, -1)
    gradient_norm = gradients.norm(2, dim=1)
    penalty = ((gradient_norm - 1) ** 2).mean()
    return penalty

# Training
d_loss = -torch.mean(critic(real_images)) + torch.mean(critic(fake_images)) + \
         lambda_gp * gradient_penalty(critic, real_images, fake_images, device)

WGAN Benefits

Stable training: No mode collapse
Meaningful loss: Correlates with sample quality
No balancing: Don't need to carefully tune D vs G updates
Reliable: Works with various architectures

CycleGAN

Motivation

Problem: Image-to-image translation usually requires paired data

Example - Horse to Zebra:

Need: Same photo with horse AND zebra (hard to obtain!)

CycleGAN: Learn translation without paired data!

Key Innovation: Cycle Consistency

Idea: If we translate A -> B -> A, we should get back to A

Two generators:

G: X -> Y (horses -> zebras)
F: Y -> X (zebras -> horses)

Cycle consistency loss:

css

L_cyc = E[||F(G(x)) - x||] + E[||G(F(y)) - y||]

Interpretation:

G(x): Horse -> Zebra
F(G(x)): Zebra -> Horse (should equal x)

CycleGAN Objective

Total loss:

scss

L = L_GAN(G) + L_GAN(F) + λ * L_cyc(G, F)

Where:

L_GAN: Adversarial loss (fool discriminators)
L_cyc: Cycle consistency loss
λ: Cycle consistency weight (typically 10)

Architecture

Four networks:

G: X -> Y (generator)
F: Y -> X (generator)
D_Y: Discriminate real/fake Y
D_X: Discriminate real/fake X

Training:

python

# Forward cycle: x → G(x) → F(G(x))
fake_y = G(x)
reconstructed_x = F(fake_y)
cycle_loss_x = F.l1_loss(reconstructed_x, x)

# Backward cycle: y → F(y) → G(F(y))
fake_x = F(y)
reconstructed_y = G(fake_x)
cycle_loss_y = F.l1_loss(reconstructed_y, y)

# Adversarial losses
g_loss = adversarial_loss(D_Y(fake_y), real_labels)
f_loss = adversarial_loss(D_X(fake_x), real_labels)

# Total loss
total_loss = g_loss + f_loss + lambda_cyc * (cycle_loss_x + cycle_loss_y)

Applications

Style Transfer:

Photos -> Monet paintings
Summer -> Winter
Day -> Night

Object Transfiguration:

Horses <-> Zebras
Apples <-> Oranges
Cats <-> Dogs

Domain Adaptation:

Synthetic -> Real images
Sketch -> Photo

pix2pix

Motivation

Problem: Image-to-image translation with paired training data

Examples:

Semantic labels -> Photo
Edges -> Photo
Black&white -> Color

Key Innovations

One. Conditional GAN:

Generator: G(x) conditioned on input x
Discriminator: D(x, y) judges if y is real given x

2. U-Net Generator:

Encoder-decoder with skip connections
Preserves low-level information

3. PatchGAN Discriminator:

Classifies NxN patches as real/fake
Models high-frequency structure

Loss Function

Combined loss:

ini

L = L_GAN + λ * L_L1

Where:

L_GAN: Adversarial loss
L_L1: Pixel-wise L1 loss (encourages correct output)
λ: L1 weight (typically 100)

Why L1 loss:

GAN alone can produce artifacts
L1 encourages correct low-level structure

U-Net Generator

Architecture:

scss

Encoder (downsample):
Input → Conv(64) → Conv(128) → Conv(256) → Conv(512) → Latent

Decoder (upsample):
Latent → ConvTranspose(512) → ConvTranspose(256) → ConvTranspose(128) → ConvTranspose(64) → Output
            ↑ skip connection ↑        ↑ skip connection ↑
         (from encoder)             (from encoder)

Skip connections: Preserve spatial information

PatchGAN Discriminator

Idea: Classify NxN patches instead of entire image

Benefits:

Fewer parameters
Works on arbitrary image sizes
Focuses on high-frequency details

Implementation:

python

class PatchGANDiscriminator(nn.Module):
    def __init__(self, input_channels=6):  # 3 (input) + 3 (output)
        super().__init__()
        self.model = nn.Sequential(
            nn.Conv2d(input_channels, 64, 4, 2, 1),
            nn.LeakyReLU(0.2),
            nn.Conv2d(64, 128, 4, 2, 1),
            nn.BatchNorm2d(128),
            nn.LeakyReLU(0.2),
            nn.Conv2d(128, 256, 4, 2, 1),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2),
            nn.Conv2d(256, 512, 4, 1, 1),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2),
            nn.Conv2d(512, 1, 4, 1, 1)  # Output: N×N patch predictions
        )

    def forward(self, input_img, target_img):
        x = torch.cat([input_img, target_img], dim=1)
        return self.model(x)

GAN Variants Comparison

Architecture	Key Feature	Use Case	Training Difficulty
Progressive GAN	Gradual resolution increase	High-res generation (1024x1024)	Medium
StyleGAN	Style-based control	Controllable face generation	Medium
WGAN-GP	Wasserstein loss + gradient penalty	Stable training	Easy
CycleGAN	Cycle consistency	Unpaired translation	Medium
pix2pix	Paired translation + L1 loss	Paired translation	Easy
BigGAN	Large-scale training	High-quality ImageNet	Hard
StyleGAN2	Improved StyleGAN	State-of-the-art faces	Medium