Demo Mode

No student ID available

Activity 3 of 18

Activity 3: Deep Q-Networks (DQN)

Practice and reinforce the concepts from Lesson 3

Activity 03: Deep Q-Networks (DQN)

Overview

In this activity, you'll implement a Deep Q-Network (DQN) agent to solve CartPole-v1, transitioning from tabular Q-Learning to function approximation with neural networks. You'll build experience replay, target networks, and achieve superhuman performance!

Learning Objectives

By completing this activity, you will:

Implement a DQN agent with PyTorch neural networks
Build an experience replay buffer for stable training
Use target networks to prevent moving target problems
Train an agent to achieve 200+ average reward on CartPole
Visualize training dynamics and Q-value evolution
Diagnose and fix common deep RL training issues

Prerequisites

Completed Concept 03: Deep Q-Networks (DQN)
Completed Activity 02: Q-Learning and Value Functions
Basic PyTorch knowledge (tensors, autograd, neural networks)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-03-deep-q-networks.zip
Location: Templates/AI25-Template-activity-03-deep-q-networks.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-03-deep-q-networks.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)
- CPU works but GPU trains faster

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install PyTorch and Gymnasium
Import libraries
Set random seeds

What You'll Build

Part One: Environment Setup (Pre-Built)

CartPole-v1:

State: 4 continuous values (cart position, velocity, pole angle, angular velocity)
Actions: 2 discrete (push left or right)
Reward: +1 for each timestep pole stays upright
Success: Average ``reward >195`` over 100 episodes

Challenge: Continuous state space -> can't use Q-table!

Part 2: Neural Network Architecture (YOU COMPLETE)

TODO 1: Complete the DQN network forward pass

Input: State vector (4 dimensions)
Hidden layers: 128 -> 128 with ReLU activation
Output: Q-values for each action (2 values)

python

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        # TODO 1: Complete the forward pass
        # Hint: Use F.relu() for activations
        pass

Part 3: Experience Replay Buffer (YOU COMPLETE)

TODO 2: Implement the replay buffer

Store experiences as (state, action, reward, next_state, done) tuples
Add new experiences with add() method
Sample random minibatch with sample() method

python

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = []
        self.capacity = capacity

    def add(self, state, action, reward, next_state, done):
        # TODO 2: Add experience to buffer
        # Hint: Use deque or list with max capacity
        pass

    def sample(self, batch_size):
        # TODO 2: Sample random batch from buffer
        # Return: states, actions, rewards, next_states, dones
        pass

Part 4: DQN Training Logic (YOU COMPLETE)

TODO 3: Implement the DQN update step

Compute current Q-values: Q(s,a)
Compute target Q-values: r + γ * max_a' Q_target(s', a')
Calculate loss (MSE between current and target)
Backpropagate and update network

python

def update_dqn(q_network, target_network, replay_buffer, optimizer, batch_size, gamma):
    # TODO 3: Implement DQN update
    # 1. Sample batch from replay buffer
    # 2. Compute current Q-values for taken actions
    # 3. Compute target Q-values using target network
    # 4. Compute loss and update q_network
    pass

TODO 4: Implement target network updates

Copy Q-network weights to target network every C episodes

python

if episode % target_update_freq == 0:
    # TODO 4: Update target network
    pass

Part 5: Training Loop (70% Complete)

The main training loop is mostly complete, you'll add:

Epsilon-greedy action selection
Experience storage in replay buffer
DQN update calls
Target network synchronization

Part 6: Visualization and Analysis (Pre-Built)

Dashboards showing:

Episode rewards over time (should reach 200+)
Moving average reward (100-episode window)
Q-value evolution
Epsilon decay curve
Loss curves

Expected Results

Training Progression

Episodes 0-50: Random exploration, reward ~20-30 Episodes 50-150: Learning begins, reward increases to ~100 Episodes 150-300: Rapid improvement, reward approaches 200 Episodes 300+: Solved! Consistent 200+ reward

Performance Milestones

Episode 100: Average reward ~50-80
Episode 200: Average reward ~150-180
Episode 300: Average reward ~195+ (SOLVED!)

Solved Criteria

CartPole-v1 is considered "solved" when:

sql

Average reward ≥ 195 over 100 consecutive episodes

Your DQN agent should achieve this in 200-400 episodes.

Success Criteria

Your implementation is complete when:

DQN network architecture implemented correctly
Experience replay buffer stores and samples experiences
DQN update computes loss correctly (MSE between Q and target)
Target network updates every C episodes
Agent achieves average ``reward >195`` over 100 episodes
Training is stable (no catastrophic forgetting or divergence)

Tips for Success

Neural Network Architecture

Why this architecture works:

Input dim = 4: Cart position, velocity, angle, angular velocity
Hidden layers = 128: Sufficient capacity for CartPole (not too large)
Output dim = 2: Q-values for left and right actions
No output activation: Q-values can be any real number

Hyperparameter Starting Points

python

# Network
learning_rate = 0.001
hidden_size = 128

# Training
batch_size = 64
gamma = 0.99  # Discount factor
replay_buffer_size = 10000

# Exploration
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995

# Target network
target_update_freq = 10  # episodes

Debugging Common Issues

Problem: Agent doesn't learn (reward stays ~20-30)

Check replay buffer is filling up (print len(replay_buffer))
Check network is updating (print loss values)
Check gradients are flowing (print param.grad.norm())
Increase epsilon decay (explore longer)

Problem: Training diverges (reward drops after improving)

Reduce learning rate (0.001 -> 0.0001)
Increase target update frequency (10 -> 50 episodes)
Add gradient clipping

Problem: Slow learning

Increase learning rate (0.0001 -> 0.001)
Decrease epsilon decay (explore less)
Check batch size (try 64 or 128)

Code Debugging Checklist

python

# 1. Check network output shape
state = env.reset()
q_values = q_network(torch.FloatTensor(state))
print(f"Q-values shape: {q_values.shape}")  # Should be [2]

# 2. Check replay buffer
replay_buffer.add(state, action, reward, next_state, done)
print(f"Buffer size: {len(replay_buffer)}")

# 3. Check loss is decreasing
print(f"Episode {episode}, Loss: {loss.item():.4f}")

# 4. Check Q-values are updating
print(f"Q-values: {q_values.detach().numpy()}")

Extension Challenges

Challenge One: Double DQN (Medium)

Implement Double DQN to reduce overestimation:

python

# Standard DQN target:
target = reward + gamma * target_network(next_state).max()

# Double DQN target:
best_action = q_network(next_state).argmax()
target = reward + gamma * target_network(next_state)[best_action]

Challenge 2: Dueling DQN (Hard)

Implement Dueling architecture:

python

# Split into value and advantage streams
value = self.value_stream(features)
advantage = self.advantage_stream(features)
q_values = value + (advantage - advantage.mean())

Challenge 3: Prioritized Experience Replay (Hard)

Sample important experiences more frequently based on TD error magnitude.

Challenge 4: Atari Game (Very Hard)

Extend your DQN to play Pong from pixels:

Change input to convolutional layers
Stack 4 frames as input
Train for 1M+ steps

Submission Requirements

What to Submit

Completed Notebook: activity-03-deep-q-networks.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (avg ``reward >195``)
- Final average reward over 100 episodes
- Training time (minutes)
- Hyperparameters used
- Any challenges encountered
Visualizations:
- Training curve plot
- Q-value evolution plot
- Final trained agent video (optional)

Submission Steps

Train agent until solved (avg ``reward >195``)
Run all cells from top to bottom
Verify visualizations display correctly
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Resources

Documentation

Papers

Playing Atari with Deep Reinforcement Learning (DQN) (DeepMind, 2013)
Human-level control through deep reinforcement learning (Nature, 2015)

Function approximation
Experience replay
Target networks
Off-policy learning

Next Steps

After completing this activity:

Concept 04: Policy Gradient Methods
Activity 04: REINFORCE for LunarLander
Project 1: DQN Game Master (Atari games from pixels)

In the next lesson, you'll learn a completely different approach: directly optimizing policies instead of value functions!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves CartPole (avg ``reward >195``)
Understanding (10%): Report demonstrates grasp of DQN concepts

Passing Grade: 70% or higher

Good luck, and enjoy building your first deep RL agent! 🚀

Activity 3 of 18

Activity 3: Deep Q-Networks (DQN)

Practice and reinforce the concepts from Lesson 3

Activity 03: Deep Q-Networks (DQN)

Overview

Learning Objectives

By completing this activity, you will:

Implement a DQN agent with PyTorch neural networks
Build an experience replay buffer for stable training
Use target networks to prevent moving target problems
Train an agent to achieve 200+ average reward on CartPole
Visualize training dynamics and Q-value evolution
Diagnose and fix common deep RL training issues

Prerequisites

Completed Concept 03: Deep Q-Networks (DQN)
Completed Activity 02: Q-Learning and Value Functions
Basic PyTorch knowledge (tensors, autograd, neural networks)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-03-deep-q-networks.zip
Location: Templates/AI25-Template-activity-03-deep-q-networks.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-03-deep-q-networks.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)
- CPU works but GPU trains faster

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install PyTorch and Gymnasium
Import libraries
Set random seeds

What You'll Build

Part One: Environment Setup (Pre-Built)

CartPole-v1:

State: 4 continuous values (cart position, velocity, pole angle, angular velocity)
Actions: 2 discrete (push left or right)
Reward: +1 for each timestep pole stays upright
Success: Average ``reward >195`` over 100 episodes

Challenge: Continuous state space -> can't use Q-table!

Part 2: Neural Network Architecture (YOU COMPLETE)

TODO 1: Complete the DQN network forward pass

Input: State vector (4 dimensions)
Hidden layers: 128 -> 128 with ReLU activation
Output: Q-values for each action (2 values)

python

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        # TODO 1: Complete the forward pass
        # Hint: Use F.relu() for activations
        pass

Part 3: Experience Replay Buffer (YOU COMPLETE)

TODO 2: Implement the replay buffer

Store experiences as (state, action, reward, next_state, done) tuples
Add new experiences with add() method
Sample random minibatch with sample() method

python

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = []
        self.capacity = capacity

    def add(self, state, action, reward, next_state, done):
        # TODO 2: Add experience to buffer
        # Hint: Use deque or list with max capacity
        pass

    def sample(self, batch_size):
        # TODO 2: Sample random batch from buffer
        # Return: states, actions, rewards, next_states, dones
        pass

Part 4: DQN Training Logic (YOU COMPLETE)

TODO 3: Implement the DQN update step

Compute current Q-values: Q(s,a)
Compute target Q-values: r + γ * max_a' Q_target(s', a')
Calculate loss (MSE between current and target)
Backpropagate and update network

python

def update_dqn(q_network, target_network, replay_buffer, optimizer, batch_size, gamma):
    # TODO 3: Implement DQN update
    # 1. Sample batch from replay buffer
    # 2. Compute current Q-values for taken actions
    # 3. Compute target Q-values using target network
    # 4. Compute loss and update q_network
    pass

TODO 4: Implement target network updates

Copy Q-network weights to target network every C episodes

python

if episode % target_update_freq == 0:
    # TODO 4: Update target network
    pass

Part 5: Training Loop (70% Complete)

The main training loop is mostly complete, you'll add:

Epsilon-greedy action selection
Experience storage in replay buffer
DQN update calls
Target network synchronization

Part 6: Visualization and Analysis (Pre-Built)

Dashboards showing:

Episode rewards over time (should reach 200+)
Moving average reward (100-episode window)
Q-value evolution
Epsilon decay curve
Loss curves

Episode 100: Average reward ~50-80
Episode 200: Average reward ~150-180
Episode 300: Average reward ~195+ (SOLVED!)

Solved Criteria

CartPole-v1 is considered "solved" when:

sql

Average reward ≥ 195 over 100 consecutive episodes

Your DQN agent should achieve this in 200-400 episodes.

Success Criteria

Your implementation is complete when:

DQN network architecture implemented correctly
Experience replay buffer stores and samples experiences
DQN update computes loss correctly (MSE between Q and target)
Target network updates every C episodes
Agent achieves average ``reward >195`` over 100 episodes
Training is stable (no catastrophic forgetting or divergence)

Tips for Success

Neural Network Architecture

Why this architecture works:

Input dim = 4: Cart position, velocity, angle, angular velocity
Hidden layers = 128: Sufficient capacity for CartPole (not too large)
Output dim = 2: Q-values for left and right actions
No output activation: Q-values can be any real number

Hyperparameter Starting Points

python

# Network
learning_rate = 0.001
hidden_size = 128

# Training
batch_size = 64
gamma = 0.99  # Discount factor
replay_buffer_size = 10000

# Exploration
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995

# Target network
target_update_freq = 10  # episodes

Debugging Common Issues

Problem: Agent doesn't learn (reward stays ~20-30)

Check replay buffer is filling up (print len(replay_buffer))
Check network is updating (print loss values)
Check gradients are flowing (print param.grad.norm())
Increase epsilon decay (explore longer)

Problem: Training diverges (reward drops after improving)

Reduce learning rate (0.001 -> 0.0001)
Increase target update frequency (10 -> 50 episodes)
Add gradient clipping

Problem: Slow learning

Increase learning rate (0.0001 -> 0.001)
Decrease epsilon decay (explore less)
Check batch size (try 64 or 128)

Code Debugging Checklist

python

# 1. Check network output shape
state = env.reset()
q_values = q_network(torch.FloatTensor(state))
print(f"Q-values shape: {q_values.shape}")  # Should be [2]

# 2. Check replay buffer
replay_buffer.add(state, action, reward, next_state, done)
print(f"Buffer size: {len(replay_buffer)}")

# 3. Check loss is decreasing
print(f"Episode {episode}, Loss: {loss.item():.4f}")

# 4. Check Q-values are updating
print(f"Q-values: {q_values.detach().numpy()}")

Extension Challenges

Challenge One: Double DQN (Medium)

Implement Double DQN to reduce overestimation:

python

# Standard DQN target:
target = reward + gamma * target_network(next_state).max()

# Double DQN target:
best_action = q_network(next_state).argmax()
target = reward + gamma * target_network(next_state)[best_action]

Challenge 2: Dueling DQN (Hard)

Implement Dueling architecture:

python

# Split into value and advantage streams
value = self.value_stream(features)
advantage = self.advantage_stream(features)
q_values = value + (advantage - advantage.mean())

Challenge 3: Prioritized Experience Replay (Hard)

Sample important experiences more frequently based on TD error magnitude.

Challenge 4: Atari Game (Very Hard)

Extend your DQN to play Pong from pixels:

Change input to convolutional layers
Stack 4 frames as input
Train for 1M+ steps

Submission Requirements

What to Submit

Completed Notebook: activity-03-deep-q-networks.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (avg ``reward >195``)
- Final average reward over 100 episodes
- Training time (minutes)
- Hyperparameters used
- Any challenges encountered
Visualizations:
- Training curve plot
- Q-value evolution plot
- Final trained agent video (optional)

Submission Steps

Train agent until solved (avg ``reward >195``)
Run all cells from top to bottom
Verify visualizations display correctly
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Playing Atari with Deep Reinforcement Learning (DQN) (DeepMind, 2013)
Human-level control through deep reinforcement learning (Nature, 2015)

Function approximation
Experience replay
Target networks
Off-policy learning

Next Steps

After completing this activity:

Concept 04: Policy Gradient Methods
Activity 04: REINFORCE for LunarLander
Project 1: DQN Game Master (Atari games from pixels)

In the next lesson, you'll learn a completely different approach: directly optimizing policies instead of value functions!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves CartPole (avg ``reward >195``)
Understanding (10%): Report demonstrates grasp of DQN concepts

Passing Grade: 70% or higher

Good luck, and enjoy building your first deep RL agent! 🚀