Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)💡 Download the ZIP, extract it, and follow the instructions below to get started!
By completing this activity, you will:
Runtime -> Run all (or press Ctrl+F9)Expected First Run Time: ~90 seconds
The template comes with 65% working code:
Location: Section 3 - "Bug #1: Non-Learning DQN"
Current State: DQN trains for 50K steps but success rate stays at 0%
Symptoms:
Your Task: Find and fix the bug using diagnostic tools
Debugging Strategy:
Common Bugs to Check:
# Bug candidates:
# 1. Wrong loss sign: loss = -F.mse_loss(...) # Should be positive!
# 2. Target network never updated: target_net = policy_net # Shallow copy!
# 3. Actions not from Q-network: action = env.action_space.sample() # Always random!
# 4. Q-values not used: return torch.argmax(random_tensor) # Not using Q!
Success Criteria:
Location: Section 4 - "Bug #2: PPO Gradient Explosion"
Current State: PPO training crashes after ~5K steps with NaN losses
Symptoms:
Your Task: Implement gradient clipping and diagnose the root cause
Starter Code Provided:
def train_step(policy, optimizer, batch):
loss = compute_loss(batch)
optimizer.zero_grad()
loss.backward()
# TODO: Add gradient clipping here
# Hint: torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=?)
optimizer.step()
Debugging Questions:
Success Criteria:
Location: Section 5 - "Bug #3: Sparse Rewards"
Current State: Agent explores for 100K steps but never discovers goal
Your Task: Add shaped rewards to guide learning in sparse-reward environment
Environment: LunarLander with only sparse rewards (+/-100 at landing)
Requirements: Implement potential-based reward shaping:
def shaped_reward(state, action, next_state, original_reward):
# TODO: Implement potential-based shaping
# Φ(s) = potential function (higher = better states)
# shaped_reward = original_reward + γ*Φ(s') - Φ(s)
# Hint: For LunarLander, good potential functions:
# - Distance to landing pad (closer = higher Φ)
# - Angle alignment (upright = higher Φ)
# - Vertical velocity (slower = higher Φ)
pass
Potential Functions to Try:
Φ(s) = -||position - landing_pad||Φ(s) = w1*distance + w2*angle + w3*velocitySuccess Criteria:
Location: Section 6 - "Monitoring and Deployment"
Your Task: Build comprehensive dashboard for monitoring RL training health
Required Metrics (implement logging for each):
One. Learning Progress:
2. Network Health:
3. Algorithm-Specific:
4. Environment Statistics:
Implementation:
def log_diagnostics(agent, episode_data, writer, step):
# TODO: Implement comprehensive logging
# Use TensorBoard: writer.add_scalar(tag, value, step)
# Learning progress
writer.add_scalar('Reward/Episode', ...)
writer.add_scalar('Success/Rate', ...)
# Network health
# TODO: Add gradient norm logging
# TODO: Add weight distribution histograms
# Algorithm-specific
# TODO: Add DQN Q-values or PPO KL divergence
# Environment stats
# TODO: Add observation/action distributions
Success Criteria:
Create intelligent early stopping based on training diagnostics:
Implement adaptive hyperparameter adjustment:
Scale debugging tools to multi-process training:
Build ML model to predict training failures:
Root Cause: Target network is shallow copy, updates with policy network
# WRONG:
target_net = policy_net # Both point to same object!
# CORRECT:
target_net = DQN().load_state_dict(policy_net.state_dict())
Symptoms: Q-values identical, loss decreases but no learning
Root Cause: No gradient clipping with high learning rate
# WRONG:
optimizer.step() # Gradients can be arbitrarily large!
# CORRECT:
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=0.5)
optimizer.step()
Symptoms: NaN losses after 5K steps, gradient ``norms > 1000``
Root Cause: Environment provides no feedback for exploration Solution: Potential-based reward shaping
def potential(state):
return -distance_to_goal(state)
shaped_reward = reward + gamma * potential(next_state) - potential(state)
Symptoms: Agent explores but never discovers goal
Root Cause: Loss has wrong sign, maximizing instead of minimizing
# WRONG:
loss = -F.mse_loss(Q_predicted, Q_target) # Maximizes error!
# CORRECT:
loss = F.mse_loss(Q_predicted, Q_target) # Minimizes error
Symptoms: Loss "decreases" (becomes more negative), but Q-values diverge
Root Cause: Storing tensor references instead of values
# WRONG:
buffer.append(state) # Stores reference, gets overwritten!
# CORRECT:
buffer.append(state.clone()) # Stores independent copy
Symptoms: Training unstable, samples change after storage
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution: Check Q-value distributions, not just loss curves:
Solution: Multiple factors contribute to explosion:
Solution: Must use potential-based shaping:
R' = R + γΦ(s') - Φ(s)Solution: Different implementations have different scales:
Submission Checklist:
activity-08-[YourName].ipynbBug Report Template:
## Bug #1: Non-Learning DQN
**Root Cause**: [Your explanation]
**How I Found It**: [Debugging process]
**Fix Applied**: [Code changes]
**Verification**: [Results after fix]
## Bug #2: PPO Gradient Explosion
[Same structure...]
After mastering RL debugging:
Key Insight: 90% of RL engineering is debugging. Master these diagnostic tools, and you'll save weeks of frustration on real projects!
Common Pattern in Research:
This activity simulates real RL research workflows. These debugging skills are more valuable than knowing 10 more algorithms!
# Always include:
- Gradient clipping (max_norm=0.5 for PPO, 10.0 for DQN)
- Reward normalization (RunningMeanStd)
- Checkpoint saving (every 10K steps)
- Reproducibility (set seeds: torch, numpy, env)
- Error handling (try/except for env resets)
- Logging (TensorBoard for all metrics)
- Validation (test on held-out environments)
Check in this order:
90% of bugs are: wrong shapes, wrong signs, wrong indexes
Good luck! Debugging is a superpower in RL. Master it, and you'll be the person everyone asks for help! 🔧🚀