By completing this activity, you will:
Understand Deep Q-Networks (DQN) and neural network function approximation
Implement convolutional neural networks (CNNs) for visual state processing
Master experience replay for stable training
Apply target networks to reduce training instability
Train an agent to play Atari Pong using raw pixels
Analyze learning curves and convergence in deep RL
Open in Google Colab : Upload this notebook to Google Colab
Enable GPU : Click Runtime -> Change runtime type -> Select T4 GPU
Run All Cells : Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic : You'll see:
โ
Pong environment visualization
โ
Random agent baseline (loses every game)
โ
DQN architecture diagram
โ
Experience replay buffer in action
โ
Training progress with GPU acceleration
Expected First Run Time : ~90 seconds (with GPU)
The template comes with 65% working code :
โ
Pong Environment : Atari Pong-v5 with pixel preprocessing
โ
Random Baseline : Agent that loses every game (score: -21)
โ
CNN Architecture : 3 convolutional layers + 2 fully connected layers
โ
Experience Replay Buffer : Stores (s, a, r, s', done) tuples
โ
Target Network : Separate network for stable Q-targets
โ
Training Loop Framework : Episode management, reward tracking
โ
Visualization Tools : Learning curves, Q-value plots, gameplay recording
โ ๏ธ TODO 1 : Implement DQN forward pass (Medium)
โ ๏ธ TODO 2 : Implement experience replay sampling (Easy)
โ ๏ธ TODO 3 : Implement loss calculation and backpropagation (Hard)
โ ๏ธ TODO 4 : Optimize hyperparameters for faster convergence
Location : Section 4 - "DQN Architecture"
Current State : Network structure exists but forward pass is incomplete
Your Task : Implement the CNN forward pass for visual state processing:
yaml
Input: 84x84x4 grayscale frames (stacked)
โ Conv1: 32 filters, 8x8, stride 4 , ReLU
โ Conv2: 64 filters, 4x4, stride 2 , ReLU
โ Conv3: 64 filters, 3x3, stride 1 , ReLU
โ Flatten
โ FC1: 512 units, ReLU
โ FC2: 2 units (Q-values for UP/DOWN actions)
Architecture Details :
Input shape: (batch, 4, 84, 84) - 4 stacked frames
Conv layers extract visual features (edges, ball, paddles)
Fully connected layers compute Q-values
Output: Q(s, UP) and Q(s, DOWN)
Starter Code Provided :
python
class DQN (nn.Module):
def __init__ (self, n_actions ):
super (DQN, self ).__init__()
def forward (self, x ):
pass
Success Criteria :
Testing Your Implementation :
python
test_input = torch.randn(1 , 4 , 84 , 84 ).to(device)
output = dqn(test_input)
assert output.shape == (1 , 2 ), f"Expected (1, 2), got {output.shape} "
print (f"โ
Forward pass works! Q-values: {output.detach().cpu().numpy()} " )
Location : Section 5 - "Experience Replay"
Your Task : Sample random minibatches from replay buffer for training
Why Experience Replay?
Breaks correlation between consecutive samples
Reuses experiences multiple times (data efficiency)
Stabilizes training (reduces oscillation)
Requirements :
Sample batch_size random experiences from buffer
Return tensors: states , actions , rewards , next_states , dones
Handle edge case: buffer smaller than batch size
Starter Code Provided :
python
def sample (self, batch_size ):
"""
Sample random batch from replay buffer.
Returns:
states: (batch, 4, 84, 84)
actions: (batch,)
rewards: (batch,)
next_states: (batch, 4, 84, 84)
dones: (batch,)
"""
pass
Success Criteria :
Location : Section 6 - "Training Loop"
Your Task : Implement DQN loss (Temporal Difference error) and gradient descent
DQN Loss Function :
ini
L = E[(y - Q(s,a))ยฒ]
where:
y = r + ฮณ * max_a' Q_target(s' , a') [if not done]
y = r [if done]
Key Concepts :
Current Q-value : Q(s, a) from online network
Target Q-value : r + ฮณ * max Q_target(s') from target network
TD Error : target - current
Loss : Mean squared error (MSE)
Requirements :
Use online network for current Q-values
Use target network for next Q-values (stability)
Handle terminal states (done=True -> no future reward)
Apply gradient clipping (prevent exploding gradients)
Starter Code Provided :
python
def compute_loss (batch, dqn, target_dqn, gamma ):
"""
Compute DQN loss.
Args:
batch: Sampled experiences
dqn: Online network
target_dqn: Target network
gamma: Discount factor
Returns:
loss: MSE between Q(s,a) and target
"""
states, actions, rewards, next_states, dones = batch
pass
Success Criteria :
Location : New section you'll create
Your Task : Find optimal hyperparameters for fastest learning on Pong
Hyperparameters to Tune :
Learning Rate : [1e-4, 5e-4, 1e-3, 5e-3]
Batch Size : [32, 64, 128]
Target Network Update : [1000, 2500, 5000, 10000] frames
Epsilon Decay : [10000, 20000, 50000] frames (ฮต=1.0 -> 0.1)
Replay Buffer Size : [10K, 50K, 100K]
Requirements :
Run experiments for 50K frames each
Track: final score, learning speed, stability
Create plots comparing configurations
Identify best configuration
Explain why certain parameters work better
Success Criteria :
Reduce overestimation bias by separating action selection and evaluation:
Select action using online network: argmax_a Q(s',a)
Evaluate action using target network: Q_target(s', argmax_a Q(s',a))
Compare learning curves: DQN vs Double DQN
Does Double DQN learn faster or reach higher scores?
Expected Improvement : 10-20% higher final score, more stable learning
Implement dueling architecture that separates value and advantage:
scss
Q (s,a) = V (s) + A (s,a) - mean (A(s,ยท))
Split network into value stream and advantage stream
Combine at output layer
Compare to standard DQN
Why does this help? (Hint: some states are bad regardless of action)
Expected Improvement : 15-25% faster convergence
Sample important experiences more frequently:
Assign priority based on TD error: priority = |ฮด|
Sample with probability: P(i) โ priority_i^ฮฑ
Use importance sampling weights to correct bias
Compare to uniform sampling
Expected Improvement : 30-50% faster convergence, higher data efficiency
Combine multiple DQN improvements:
Double DQN
Dueling architecture
Prioritized replay
Multi-step returns (n-step Q-learning)
Noisy networks (exploration)
Distributional RL (C51)
Expected Improvement : State-of-the-art Pong performance (``score >15``)
Score : -21 (loses every game 0-21)
Win Rate : 0%
Frames to First Point : Never scores
Score after 10K frames : -18 to -15 (starting to learn)
Score after 50K frames : -10 to -5 (competitive)
Score after 200K frames : +5 to +15 (superhuman)
Win Rate : 20-40% after 50K, 60-80% after 200K
Score after 50K frames : -5 to 0 (2x faster learning)
Score after 100K frames : +10 to +18
Convergence : Reaches +15 in ~150K frames (vs 200K baseline)
Score : 10-20% higher than standard DQN
Stability : Lower variance in learning curve
Score : +18 to +21 (near-perfect play)
Frames to Master : ~100K (half the time of standard DQN)
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution :
Reduce batch size from 64 to 32
Reduce replay buffer size from 100K to 50K
Clear GPU memory: torch.cuda.empty_cache()
Restart runtime if memory is still full
Solution :
Check gradient clipping is enabled (torch.nn.utils.clip_grad_norm_)
Reduce learning rate (try 1e-4 instead of 5e-4)
Verify reward clipping (rewards should be in [-1, 1])
Check for division by zero in loss calculation
Ensure target network is used (not online network for targets)
Solution :
Verify experience replay is working (check buffer fills up)
Ensure epsilon is decaying (agent should exploit more over time)
Check target network updates every N frames (not every step)
Verify loss is decreasing (plot loss curve)
Increase training frequency (update every 4 frames instead of 10)
Solution :
Verify GPU is enabled in Colab (Runtime -> Change runtime type)
Move all tensors to GPU: .to(device)
Reduce frame preprocessing time (use vectorized operations)
Increase update interval (train every 10 frames instead of 4)
Reduce batch size to 32 (fewer computations)
Solution :
Check weight initialization (use Xavier or He initialization)
Verify forward pass implementation (ensure all layers are called)
Check learning rate not too small (<1e-5 may be too small)
Ensure gradient flow (print gradients to verify backprop works)
Solution :
Increase epsilon (more exploration): epsilon_start=1.0, epsilon_end=0.1
Use longer epsilon decay: 50K frames instead of 20K
Initialize replay buffer with random experiences before training
Try different random seeds
Concept 03 : Deep Q-Networks (DQN theory and architecture)
Activity 02 : Q-Learning and Value Functions (tabular foundation)
Project 1 : Atari Game Playing Agent (full DQN implementation)
Mnih et al. (2015): "Human-level control through deep RL" - original DQN paper
Van Hasselt et al. (2016): "Deep Reinforcement Learning with Double Q-learning"
Wang et al. (2016): "Dueling Network Architectures for Deep RL"
Schaul et al. (2016): "Prioritized Experience Replay"
Hessel et al. (2018): "Rainbow: Combining Improvements in Deep RL"
Complete required TODOs (minimum: TODO 1-3)
Train for at least 10K frames (should take ~5-10 minutes on GPU)
Run entire notebook to generate all outputs
Export learning curve : Save training plot as PNG
Download notebook : File -> Download -> Download .ipynb
Submit via portal : Upload .ipynb and learning_curve.png
Submission Checklist :
After mastering DQN:
Move to Project One: Atari Game Playing Agent
Implement policy gradient methods (Actor-Critic, PPO)
Explore multi-agent RL (competitive/cooperative environments)
Key Insight : DQN scales Q-Learning from 16 states (FrozenLake) to millions of states (Atari pixels). This is the power of deep learning + RL!
From Tabular to Deep :
Activity 02 : Q-table with 16 x 4 = 64 values
Activity 03 : Neural network with 2M parameters
State Space : 16 states -> 84x84x4 = 28,224 pixels
DQN doesn't memorize every state (impossible). Instead, it learns to generalize - similar visual patterns -> similar Q-values. This is why deep RL can play games it's never seen before!
Good luck! You're about to train an AI to play Atari games using only pixels and rewards. This is the same algorithm that started the deep RL revolution in 2015! ๐ฎ๐