Deep Q-Networks (DQN) - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand Deep Q-Networks (DQN) and neural network function approximation
Implement convolutional neural networks (CNNs) for visual state processing
Master experience replay for stable training
Apply target networks to reduce training instability
Train an agent to play Atari Pong using raw pixels
Analyze learning curves and convergence in deep RL

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Enable GPU: Click Runtime -> Change runtime type -> Select T4 GPU
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ Pong environment visualization
- ✅ Random agent baseline (loses every game)
- ✅ DQN architecture diagram
- ✅ Experience replay buffer in action
- ✅ Training progress with GPU acceleration

Expected First Run Time: ~90 seconds (with GPU)

🎯 What's Already Working

The template comes with 65% working code:

✅ Pong Environment: Atari Pong-v5 with pixel preprocessing
✅ Random Baseline: Agent that loses every game (score: -21)
✅ CNN Architecture: 3 convolutional layers + 2 fully connected layers
✅ Experience Replay Buffer: Stores (s, a, r, s', done) tuples
✅ Target Network: Separate network for stable Q-targets
✅ Training Loop Framework: Episode management, reward tracking
✅ Visualization Tools: Learning curves, Q-value plots, gameplay recording

What Needs Your Work (35%):

⚠️ TODO 1: Implement DQN forward pass (Medium)
⚠️ TODO 2: Implement experience replay sampling (Easy)
⚠️ TODO 3: Implement loss calculation and backpropagation (Hard)
⚠️ TODO 4: Optimize hyperparameters for faster convergence

📋 Tasks to Complete

TODO 1: Implement DQN Forward Pass (Medium)

Location: Section 4 - "DQN Architecture"

Current State: Network structure exists but forward pass is incomplete

Your Task: Implement the CNN forward pass for visual state processing:

yaml

Input: 84x84x4 grayscale frames (stacked)
→ Conv1: 32 filters, 8x8, stride 4, ReLU
→ Conv2: 64 filters, 4x4, stride 2, ReLU
→ Conv3: 64 filters, 3x3, stride 1, ReLU
→ Flatten
→ FC1: 512 units, ReLU
→ FC2: 2 units (Q-values for UP/DOWN actions)

Architecture Details:

Input shape: (batch, 4, 84, 84) - 4 stacked frames
Conv layers extract visual features (edges, ball, paddles)
Fully connected layers compute Q-values
Output: Q(s, UP) and Q(s, DOWN)

Starter Code Provided:

python

class DQN(nn.Module):
    def __init__(self, n_actions):
        super(DQN, self).__init__()
        # TODO: Define layers
        # self.conv1 = nn.Conv2d(...)
        # self.conv2 = nn.Conv2d(...)
        # self.conv3 = nn.Conv2d(...)
        # self.fc1 = nn.Linear(...)
        # self.fc2 = nn.Linear(...)

    def forward(self, x):
        # TODO: Implement forward pass
        # x = F.relu(self.conv1(x))
        # ...
        pass

Success Criteria:

Network accepts 84x84x4 input without errors
Output shape is (batch_size, 2) for 2 actions
Forward pass completes ``in <10``ms (GPU)
Network has ~2 million parameters (run count_parameters() )

Testing Your Implementation:

python

# Test with random input
test_input = torch.randn(1, 4, 84, 84).to(device)
output = dqn(test_input)
assert output.shape == (1, 2), f"Expected (1, 2), got {output.shape}"
print(f"✅ Forward pass works! Q-values: {output.detach().cpu().numpy()}")

TODO 2: Implement Experience Replay Sampling (Easy)

Location: Section 5 - "Experience Replay"

Your Task: Sample random minibatches from replay buffer for training

Why Experience Replay?

Breaks correlation between consecutive samples
Reuses experiences multiple times (data efficiency)
Stabilizes training (reduces oscillation)

Requirements:

Sample batch_size random experiences from buffer
Return tensors: states , actions , rewards , next_states , dones
Handle edge case: buffer smaller than batch size

Starter Code Provided:

python

def sample(self, batch_size):
    """
    Sample random batch from replay buffer.

    Returns:
        states: (batch, 4, 84, 84)
        actions: (batch,)
        rewards: (batch,)
        next_states: (batch, 4, 84, 84)
        dones: (batch,)
    """
    # TODO: Sample random indices
    # TODO: Extract experiences and convert to tensors
    # Hint: Use random.sample() or np.random.choice()
    pass

Success Criteria:

Samples are random (verify with multiple calls)
Returns correct tensor shapes
Handles buffer not full (len < batch_size)
No duplicate indices in single batch

TODO 3: Implement Loss Calculation and Backpropagation (Hard)

Location: Section 6 - "Training Loop"

Your Task: Implement DQN loss (Temporal Difference error) and gradient descent

DQN Loss Function:

ini

L = E[(y - Q(s,a))²]

where:
y = r + γ * max_a' Q_target(s', a')  [if not done]
y = r                                  [if done]

Key Concepts:

Current Q-value: Q(s, a) from online network
Target Q-value: r + γ * max Q_target(s') from target network
TD Error: target - current
Loss: Mean squared error (MSE)

Requirements:

Use online network for current Q-values
Use target network for next Q-values (stability)
Handle terminal states (done=True -> no future reward)
Apply gradient clipping (prevent exploding gradients)

Starter Code Provided:

python

def compute_loss(batch, dqn, target_dqn, gamma):
    """
    Compute DQN loss.

    Args:
        batch: Sampled experiences
        dqn: Online network
        target_dqn: Target network
        gamma: Discount factor

    Returns:
        loss: MSE between Q(s,a) and target
    """
    states, actions, rewards, next_states, dones = batch

    # TODO 3.1: Get current Q-values
    # current_q = dqn(states).gather(1, actions.unsqueeze(1))

    # TODO 3.2: Get target Q-values
    # next_q = target_dqn(next_states).max(1)[0].detach()
    # target_q = rewards + gamma * next_q * (1 - dones)

    # TODO 3.3: Compute MSE loss
    # loss = F.mse_loss(current_q.squeeze(), target_q)

    pass

Success Criteria:

Loss decreases over training (from ~1.0 ``to < 0.1``)
No NaN or Inf values in loss
Gradients are finite (check with grad_norm() )
Agent score improves (from -21 to >-15 after 10K frames)

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

Location: New section you'll create

Your Task: Find optimal hyperparameters for fastest learning on Pong

Hyperparameters to Tune:

Learning Rate: [1e-4, 5e-4, 1e-3, 5e-3]
Batch Size: [32, 64, 128]
Target Network Update: [1000, 2500, 5000, 10000] frames
Epsilon Decay: [10000, 20000, 50000] frames (ε=1.0 -> 0.1)
Replay Buffer Size: [10K, 50K, 100K]

Requirements:

Run experiments for 50K frames each
Track: final score, learning speed, stability
Create plots comparing configurations
Identify best configuration
Explain why certain parameters work better

Success Criteria:

Tested at least 3 learning rates
Tested at least 2 batch sizes
Documented results in table/plot
Best config achieves score >-10 in 50K frames
Analysis explains trade-offs

🚀 Extension Challenges

Challenge One: Double DQN (Medium)

Reduce overestimation bias by separating action selection and evaluation:

Select action using online network: argmax_a Q(s',a)
Evaluate action using target network: Q_target(s', argmax_a Q(s',a))
Compare learning curves: DQN vs Double DQN
Does Double DQN learn faster or reach higher scores?

Expected Improvement: 10-20% higher final score, more stable learning

Challenge 2: Dueling DQN (Hard)

Implement dueling architecture that separates value and advantage:

scss

Q(s,a) = V(s) + A(s,a) - mean(A(s,·))

Split network into value stream and advantage stream
Combine at output layer
Compare to standard DQN
Why does this help? (Hint: some states are bad regardless of action)

Expected Improvement: 15-25% faster convergence

Challenge 3: Prioritized Experience Replay (Hard)

Sample important experiences more frequently:

Assign priority based on TD error: priority = |δ|
Sample with probability: P(i) ∝ priority_i^α
Use importance sampling weights to correct bias
Compare to uniform sampling

Expected Improvement: 30-50% faster convergence, higher data efficiency

Challenge 4: Rainbow DQN (Very Hard)

Combine multiple DQN improvements:

Double DQN
Dueling architecture
Prioritized replay
Multi-step returns (n-step Q-learning)
Noisy networks (exploration)
Distributional RL (C51)

Expected Improvement: State-of-the-art Pong performance (``score >15``)

📊 Expected Results

Random Baseline (Pre-Built):

Score: -21 (loses every game 0-21)
Win Rate: 0%
Frames to First Point: Never scores

Your DQN Agent (TODO 1-3):

Score after 10K frames: -18 to -15 (starting to learn)
Score after 50K frames: -10 to -5 (competitive)
Score after 200K frames: +5 to +15 (superhuman)
Win Rate: 20-40% after 50K, 60-80% after 200K

Optimized DQN (TODO 4):

Score after 50K frames: -5 to 0 (2x faster learning)
Score after 100K frames: +10 to +18
Convergence: Reaches +15 in ~150K frames (vs 200K baseline)

Double DQN (Challenge 1):

Score: 10-20% higher than standard DQN
Stability: Lower variance in learning curve

Rainbow DQN (Challenge 4):

Score: +18 to +21 (near-perfect play)
Frames to Master: ~100K (half the time of standard DQN)

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors on Colab GPU
TODO 1 completed (DQN forward pass working)
TODO 2 completed (experience replay sampling)
TODO 3 completed (loss calculation and training)
Agent improves from baseline (score >-18 after 10K frames)
Learning curve shows upward trend

Target Grade (for excellent work):

All 4 TODOs completed
Agent achieves score >-10 after 50K frames
Hyperparameter optimization complete with analysis
At least 1 extension challenge attempted
Code is clean and well-documented

Exceptional Work (bonus points):

Multiple extension challenges completed
Double DQN or Dueling DQN implemented
Novel experiments or architectural modifications
Comparison study with detailed analysis
Production-quality code with unit tests

🛠️ Troubleshooting

Issue: "CUDA out of memory" error

Solution:

Reduce batch size from 64 to 32
Reduce replay buffer size from 100K to 50K
Clear GPU memory: torch.cuda.empty_cache()
Restart runtime if memory is still full

Issue: "Loss is NaN or explodes"

Solution:

Check gradient clipping is enabled (torch.nn.utils.clip_grad_norm_)
Reduce learning rate (try 1e-4 instead of 5e-4)
Verify reward clipping (rewards should be in [-1, 1])
Check for division by zero in loss calculation
Ensure target network is used (not online network for targets)

Issue: "Agent score not improving after 10K frames"

Solution:

Verify experience replay is working (check buffer fills up)
Ensure epsilon is decaying (agent should exploit more over time)
Check target network updates every N frames (not every step)
Verify loss is decreasing (plot loss curve)
Increase training frequency (update every 4 frames instead of 10)

Issue: "Training is very slow (>5 min for 1K frames)"

Solution:

Verify GPU is enabled in Colab (Runtime -> Change runtime type)
Move all tensors to GPU: .to(device)
Reduce frame preprocessing time (use vectorized operations)
Increase update interval (train every 10 frames instead of 4)
Reduce batch size to 32 (fewer computations)

Issue: "Network outputs same Q-value for all actions"

Solution:

Check weight initialization (use Xavier or He initialization)
Verify forward pass implementation (ensure all layers are called)
Check learning rate not too small (<1e-5 may be too small)
Ensure gradient flow (print gradients to verify backprop works)

Issue: "Agent gets stuck in local minimum (always goes UP)"

Solution:

Increase epsilon (more exploration): epsilon_start=1.0, epsilon_end=0.1
Use longer epsilon decay: 50K frames instead of 20K
Initialize replay buffer with random experiences before training
Try different random seeds

📚 Resources

Documentation

Concept 03: Deep Q-Networks (DQN theory and architecture)
Activity 02: Q-Learning and Value Functions (tabular foundation)
Project 1: Atari Game Playing Agent (full DQN implementation)

Key Papers

Mnih et al. (2015): "Human-level control through deep RL" - original DQN paper
Van Hasselt et al. (2016): "Deep Reinforcement Learning with Double Q-learning"
Wang et al. (2016): "Dueling Network Architectures for Deep RL"
Schaul et al. (2016): "Prioritized Experience Replay"
Hessel et al. (2018): "Rainbow: Combining Improvements in Deep RL"

Videos

📤 Submission

Complete required TODOs (minimum: TODO 1-3)
Train for at least 10K frames (should take ~5-10 minutes on GPU)
Run entire notebook to generate all outputs
Export learning curve: Save training plot as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and learning_curve.png

Submission Checklist:

Filename: activity-03-[YourName].ipynb
All code cells executed
Learning curve shows improvement
Agent score >-18 demonstrated
Analysis section completed
GPU used for training (check runtime type)

🎉 What's Next?

After mastering DQN:

Move to Project One: Atari Game Playing Agent
Implement policy gradient methods (Actor-Critic, PPO)
Explore multi-agent RL (competitive/cooperative environments)

Key Insight: DQN scales Q-Learning from 16 states (FrozenLake) to millions of states (Atari pixels). This is the power of deep learning + RL!

From Tabular to Deep:

Activity 02: Q-table with 16 x 4 = 64 values
Activity 03: Neural network with 2M parameters
State Space: 16 states -> 84x84x4 = 28,224 pixels

DQN doesn't memorize every state (impossible). Instead, it learns to generalize - similar visual patterns -> similar Q-values. This is why deep RL can play games it's never seen before!

Good luck! You're about to train an AI to play Atari games using only pixels and rewards. This is the same algorithm that started the deep RL revolution in 2015! 🎮🚀

Template 3: Deep Q Networks

📦 Project Files Included:

Deep Q-Networks (DQN) - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (35%):

📋 Tasks to Complete

TODO 1: Implement DQN Forward Pass (Medium)

TODO 2: Implement Experience Replay Sampling (Easy)

TODO 3: Implement Loss Calculation and Backpropagation (Hard)

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

🚀 Extension Challenges

Challenge One: Double DQN (Medium)

Challenge 2: Dueling DQN (Hard)

Challenge 3: Prioritized Experience Replay (Hard)

Challenge 4: Rainbow DQN (Very Hard)

📊 Expected Results

Random Baseline (Pre-Built):

Your DQN Agent (TODO 1-3):

Optimized DQN (TODO 4):

Double DQN (Challenge 1):

Rainbow DQN (Challenge 4):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "CUDA out of memory" error

Issue: "Loss is NaN or explodes"

Issue: "Agent score not improving after 10K frames"

Issue: "Training is very slow (>5 min for 1K frames)"

Issue: "Network outputs same Q-value for all actions"

Issue: "Agent gets stuck in local minimum (always goes UP)"

📚 Resources

Documentation

Key Papers

Videos

📤 Submission

🎉 What's Next?