RL in Practice: Debugging and Deployment - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Debug common RL failures: gradient explosions, non-learning agents, unstable training
Implement diagnostic tools: gradient norms, Q-value distributions, policy entropy
Apply reward shaping: potential-based shaping, intrinsic motivation
Create training dashboards: TensorBoard integration, custom metrics
Master deployment practices: checkpointing, reproducibility, hyperparameter logging
Build production-ready RL: error handling, validation, monitoring

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ 5 broken RL implementations (pre-built with realistic bugs)
- ✅ Diagnostic dashboard showing failures
- ✅ Debugging toolkit (gradient analysis, Q-value plots)
- ✅ TensorBoard integration

Expected First Run Time: ~90 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ 5 Broken RL Agents: DQN, PPO, A2C with real-world bugs
✅ Debugging Toolkit: Gradient norm tracking, Q-value distribution plots
✅ Visualization Suite: Learning curves, loss landscapes, policy entropy
✅ TensorBoard Integration: Real-time training monitoring
✅ Baseline Implementations: Working reference agents for comparison
✅ Test Environments: CartPole, LunarLander for debugging

What Needs Your Work (35%):

⚠️ TODO 1: Debug DQN that never learns (Hard)
⚠️ TODO 2: Debug PPO with exploding gradients (Medium)
⚠️ TODO 3: Implement reward shaping (Medium)
⚠️ TODO 4: Create training diagnostics dashboard (Easy)

📋 Tasks to Complete

TODO 1: Debug DQN That Never Learns (Hard)

Location: Section 3 - "Bug #1: Non-Learning DQN"

Current State: DQN trains for 50K steps but success rate stays at 0%

Symptoms:

Loss decreases normally (looks like it's working)
Q-values are all negative and identical
Agent takes random actions even after training
Epsilon decayed correctly (not an exploration issue)

Your Task: Find and fix the bug using diagnostic tools

Debugging Strategy:

Check Q-value distributions (are they updating?)
Verify target network is updating
Inspect loss calculation (is it actually learning?)
Check action selection (is argmax working?)

Common Bugs to Check:

python

# Bug candidates:
# 1. Wrong loss sign: loss = -F.mse_loss(...)  # Should be positive!
# 2. Target network never updated: target_net = policy_net  # Shallow copy!
# 3. Actions not from Q-network: action = env.action_space.sample()  # Always random!
# 4. Q-values not used: return torch.argmax(random_tensor)  # Not using Q!

Success Criteria:

Identify root cause of non-learning behavior
Fix the bug (modify 1-3 lines of code)
Agent achieves >90% success rate on CartPole after fix
Q-values become positive and diverse
Write explanation of what was wrong and why

TODO 2: Debug PPO With Exploding Gradients (Medium)

Location: Section 4 - "Bug #2: PPO Gradient Explosion"

Current State: PPO training crashes after ~5K steps with NaN losses

Symptoms:

First 3K steps look normal
Loss suddenly jumps to infinity
Gradient norms exceed 1000 (should ``be <10`` )
Policy outputs NaN probabilities

Your Task: Implement gradient clipping and diagnose the root cause

Starter Code Provided:

python

def train_step(policy, optimizer, batch):
    loss = compute_loss(batch)
    optimizer.zero_grad()
    loss.backward()

    # TODO: Add gradient clipping here
    # Hint: torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=?)

    optimizer.step()

Debugging Questions:

What gradient norm threshold prevents explosions? (Test 0.5, 1.0, 5.0)
Which layers have largest gradients? (Use diagnostic tools)
Is the learning rate too high? (Compare α=3e-4 vs α=1e-4)
Are rewards properly normalized? (Check reward statistics)

Success Criteria:

Implement gradient clipping with optimal threshold
Training completes without NaN losses
Agent achieves >150 reward on LunarLander
Gradient norms stay below 10 throughout training
Document which hyperparameter needed adjustment

TODO 3: Implement Reward Shaping (Medium)

Location: Section 5 - "Bug #3: Sparse Rewards"

Current State: Agent explores for 100K steps but never discovers goal

Your Task: Add shaped rewards to guide learning in sparse-reward environment

Environment: LunarLander with only sparse rewards (+/-100 at landing)

Agent dies before landing 99% of time
Never gets reward signal -> cannot learn
Need intermediate feedback (reward shaping)

Requirements: Implement potential-based reward shaping:

python

def shaped_reward(state, action, next_state, original_reward):
    # TODO: Implement potential-based shaping
    # Φ(s) = potential function (higher = better states)
    # shaped_reward = original_reward + γ*Φ(s') - Φ(s)

    # Hint: For LunarLander, good potential functions:
    # - Distance to landing pad (closer = higher Φ)
    # - Angle alignment (upright = higher Φ)
    # - Vertical velocity (slower = higher Φ)
    pass

Potential Functions to Try:

Distance-based: Φ(s) = -||position - landing_pad||
Multi-objective: Φ(s) = w1*distance + w2*angle + w3*velocity
Learned potential: Train auxiliary value function as Φ(s)

Success Criteria:

Implement at least 2 potential functions
Agent discovers landing pad within 20K steps (vs never)
Compare learning curves: no shaping vs shaped rewards
Verify shaping doesn't change optimal policy (theory: potential-based ≡ original MDP)
Document which shaping function worked best and why

TODO 4: Create Training Diagnostics Dashboard (Easy)

Location: Section 6 - "Monitoring and Deployment"

Your Task: Build comprehensive dashboard for monitoring RL training health

Required Metrics (implement logging for each):

One. Learning Progress:

Episode reward (raw and moving average)
Success rate (rolling window of 100 episodes)
Episode length (shorter = learned faster policy)

2. Network Health:

Gradient norms per layer (detect explosions early)
Weight statistics (mean, std, min, max)
Loss components (policy loss, value loss, entropy)

3. Algorithm-Specific:

DQN: Q-value distribution, TD error, epsilon
PPO: KL divergence, clip fraction, advantage distribution
A2C: Value function accuracy, entropy bonus

4. Environment Statistics:

Observation statistics (detect distribution shift)
Reward distribution (check for anomalies)
Action distribution (detect policy collapse)

Implementation:

python

def log_diagnostics(agent, episode_data, writer, step):
    # TODO: Implement comprehensive logging
    # Use TensorBoard: writer.add_scalar(tag, value, step)

    # Learning progress
    writer.add_scalar('Reward/Episode', ...)
    writer.add_scalar('Success/Rate', ...)

    # Network health
    # TODO: Add gradient norm logging
    # TODO: Add weight distribution histograms

    # Algorithm-specific
    # TODO: Add DQN Q-values or PPO KL divergence

    # Environment stats
    # TODO: Add observation/action distributions

Success Criteria:

Dashboard logs all 4 metric categories
TensorBoard visualization clearly shows training health
Can detect gradient explosion from dashboard (TODO 2)
Can detect non-learning from dashboard (TODO 1)
Export dashboard as HTML for sharing

🚀 Extension Challenges

Challenge One: Implement Early Stopping (Medium)

Create intelligent early stopping based on training diagnostics:

Stop if gradient ``norm > 100`` (explosion detected)
Stop if Q-values don't change for 10K steps (convergence)
Stop if success rate plateaus (hyperparameters need tuning)
Save best checkpoint (highest success rate)

Challenge 2: Hyperparameter Auto-Tuning (Hard)

Implement adaptive hyperparameter adjustment:

If gradient ``norm > 10``: reduce learning rate by 0.5x
If ``entropy < 0.1``: increase entropy coefficient (prevent premature convergence)
If KL ``divergence > 0.2``: reduce PPO clip range (too aggressive updates)
Log all adjustments to TensorBoard

Challenge 3: Distributed Training (Very Hard)

Scale debugging tools to multi-process training:

Launch 8 parallel environments (A3C style)
Aggregate diagnostics across workers
Detect worker failures (crashed environments)
Implement gradient aggregation with outlier rejection

Challenge 4: Automated Bug Detection (Very Hard)

Build ML model to predict training failures:

Collect diagnostic data from 100 training runs (some succeed, some fail)
Train binary classifier: "Will this training run succeed?"
Features: gradient norms, Q-value stats, loss curves (first 1K steps)
Predict failure early -> stop and adjust hyperparameters

🐛 Bug Reference Guide

Bug #1: Non-Learning DQN

Root Cause: Target network is shallow copy, updates with policy network

python

# WRONG:
target_net = policy_net  # Both point to same object!

# CORRECT:
target_net = DQN().load_state_dict(policy_net.state_dict())

Symptoms: Q-values identical, loss decreases but no learning

Bug #2: PPO Gradient Explosion

Root Cause: No gradient clipping with high learning rate

python

# WRONG:
optimizer.step()  # Gradients can be arbitrarily large!

# CORRECT:
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=0.5)
optimizer.step()

Symptoms: NaN losses after 5K steps, gradient ``norms > 1000``

Bug #3: Sparse Rewards

Root Cause: Environment provides no feedback for exploration Solution: Potential-based reward shaping

python

def potential(state):
    return -distance_to_goal(state)

shaped_reward = reward + gamma * potential(next_state) - potential(state)

Symptoms: Agent explores but never discovers goal

Bug #4: Wrong Loss Sign (Bonus Bug)

Root Cause: Loss has wrong sign, maximizing instead of minimizing

python

# WRONG:
loss = -F.mse_loss(Q_predicted, Q_target)  # Maximizes error!

# CORRECT:
loss = F.mse_loss(Q_predicted, Q_target)  # Minimizes error

Symptoms: Loss "decreases" (becomes more negative), but Q-values diverge

Bug #5: Replay Buffer Corruption (Bonus Bug)

Root Cause: Storing tensor references instead of values

python

# WRONG:
buffer.append(state)  # Stores reference, gets overwritten!

# CORRECT:
buffer.append(state.clone())  # Stores independent copy

Symptoms: Training unstable, samples change after storage

📊 Expected Results

Before Fixes (Broken Agents):

Bug #1 DQN: 0% success rate (never learns)
Bug #2 PPO: Crashes with NaN after 5K steps
Bug #3 Sparse: 0% success rate after 100K steps

After Fixes (Your Work):

Fixed DQN: 90-95% success rate on CartPole
Fixed PPO: 200-250 reward on LunarLander (no crashes)
Shaped Rewards: Agent finds goal in 20K steps (vs never)

With Dashboard (TODO 4):

Can predict failure 10K steps early from diagnostics
Identify root cause in <5 using plots
TensorBoard shows clear difference between healthy/unhealthy training

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors
TODO 1 completed (DQN bug fixed)
TODO 2 completed (gradient clipping implemented)
Fixed DQN ``achieves >80``% success rate
Fixed PPO trains without NaN losses

Target Grade (for excellent work):

All 4 TODOs completed
TODO 3 completed (reward shaping working)
TODO 4 completed (dashboard shows all metrics)
Can diagnose new bugs using dashboard
Written analysis explains each bug's root cause

Exceptional Work (bonus points):

All 5 bugs found (including bonus bugs #4-5)
At least 2 extension challenges completed
Automated bug detection system (Challenge 4)
Novel debugging tool or insight
Production-ready code (error handling, logging, checkpointing)

🛠️ Troubleshooting

Issue: "I can't find Bug #1 - DQN looks like it's working"

Solution: Check Q-value distributions, not just loss curves:

Plot histogram of Q-values every 1K steps
Are Q-values actually changing? Or all identical?
Print argmax actions: Are they deterministic or all same?
Compare target_net and policy_net weights: Are they independent?

Issue: "Gradient clipping doesn't fix Bug #2"

Solution: Multiple factors contribute to explosion:

Try smaller learning rate: 1e-4 instead of 3e-4
Check reward scale: Normalize rewards to [-1, 1]
Verify clip threshold: Try 0.5, 1.0, 5.0
Look at layer-specific gradients: Which layer explodes first?

Issue: "Reward shaping changes optimal policy"

Solution: Must use potential-based shaping:

Verify shaping formula: R' = R + γΦ(s') - Φ(s)
Check γ is same as agent's discount factor
Test on solved agent: Does shaping change learned policy?
If policy changes: Potential function is not admissible

Issue: "Dashboard metrics don't match paper values"

Solution: Different implementations have different scales:

KL divergence: Should be 0.01-0.1 (``if >1.0``, bug in calculation)
Gradient norms: Should be 0.1-10 (``if >100``, explosion likely)
Entropy: Should ``be >0.5`` initially, decay to ~0.1 (``if < 0.01``, policy collapsed)
Check documentation for your specific algorithm version

📚 Resources

Documentation

Key Papers

Reward Shaping: Ng et al. (1999) - "Policy Invariance Under Reward Transformations"
Gradient Issues: Pascanu et al. (2013) - "On the difficulty of training RNNs"
PPO Diagnostics: Schulman et al. (2017) - "Proximal Policy Optimization" (Appendix)

Concept 08: Debugging Deep RL (common failure modes)
Activity 03: Deep Q-Networks (DQN implementation)
Activity 06: Proximal Policy Optimization (PPO implementation)

Debugging Tools

Weights & Biases - Experiment tracking
Ray RLlib - Production RL library
Stable-Baselines3 - Reference implementations

📤 Submission

Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to verify all fixes work
Export diagnostics: Save TensorBoard dashboard as HTML
Write bug report: Document each bug's root cause (provided template)
Download notebook: File -> Download -> Download .ipynb

Submission Checklist:

Filename: activity-08-[YourName].ipynb
All code cells executed
Bug report section filled out (explains each bug)
Dashboard HTML exported (TensorBoard logs)
Learning curves show successful fixes

Bug Report Template:

markdown

## Bug #1: Non-Learning DQN
**Root Cause**: [Your explanation]
**How I Found It**: [Debugging process]
**Fix Applied**: [Code changes]
**Verification**: [Results after fix]

## Bug #2: PPO Gradient Explosion
[Same structure...]

🎉 What's Next?

After mastering RL debugging:

Move to Project 2: Custom RL Environment
Apply debugging skills to real-world RL problems
Build production RL systems with monitoring and error handling

Key Insight: 90% of RL engineering is debugging. Master these diagnostic tools, and you'll save weeks of frustration on real projects!

Common Pattern in Research:

Implement new algorithm -> doesn't work
Check gradients -> explosion found
Add clipping -> still doesn't work
Check Q-values -> not updating
Find bug -> success!

This activity simulates real RL research workflows. These debugging skills are more valuable than knowing 10 more algorithms!

💡 Pro Tips

Debugging Workflow

Start with visualizations: Plot everything before diving into code
Compare to baseline: Does your bug exist in reference implementation?
Binary search: Comment out half the code, does bug persist?
Sanity checks: Test on trivial environment (1-state MDP) first
Ask "What should this value be?": Q-values, gradients, losses all have expected ranges

Production Best Practices

python

# Always include:
- Gradient clipping (max_norm=0.5 for PPO, 10.0 for DQN)
- Reward normalization (RunningMeanStd)
- Checkpoint saving (every 10K steps)
- Reproducibility (set seeds: torch, numpy, env)
- Error handling (try/except for env resets)
- Logging (TensorBoard for all metrics)
- Validation (test on held-out environments)

When Training Fails

Check in this order:

Shapes: Print tensor shapes everywhere
Ranges: Print min/max/mean of Q-values, rewards, observations
Gradients: Check norms, verify backprop is working
Hyperparameters: Compare to published values
Implementation: Diff your code vs reference implementation

90% of bugs are: wrong shapes, wrong signs, wrong indexes

Good luck! Debugging is a superpower in RL. Master it, and you'll be the person everyone asks for help! 🔧🚀

Template 8: Rl Debugging

📦 Project Files Included: