Practice and reinforce the concepts from Lesson 8
In this final RL activity, you'll practice essential production RL skills: debugging broken training code, implementing reward shaping, monitoring training with diagnostics, and deploying a trained agent. Real-world RL is 80% debugging - this activity prepares you!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-08-rl-in-practice.zipTemplates/AI25-Template-activity-08-rl-in-practice.zipactivity-08-rl-in-practice.ipynb to Google ColabExecute the first few cells to:
Scenario: A DQN agent for CartPole-v1 isn't learning (reward stays ~20-30 after 1000 episodes).
TODO 1: Diagnose and fix the bug(s)
Common bugs to check:
# Bug 1: Target network never updated
if episode % target_update_freq == 0:
# BUG: Missing target network update!
pass # TODO 1: Add target_network.load_state_dict(q_network.state_dict())
# Bug 2: Replay buffer not filling
if len(replay_buffer) > batch_size:
# BUG: Replay buffer size is 100 (too small!)
pass # TODO 1: Increase buffer capacity
# Bug 3: Learning rate too low
optimizer = optim.Adam(q_network.parameters(), lr=0.00001)
# BUG: Learning rate 1e-5 is too low for CartPole!
# TODO 1: Change to lr=0.001
# Bug 4: No exploration
epsilon = 0.0 # BUG: No exploration!
# TODO 1: Set epsilon=1.0 initially with decay
Scenario: Train an agent on MountainCar-v0 (very sparse rewards).
TODO 2: Implement reward shaping to guide learning
class RewardShapingWrapper(gym.Wrapper):
def __init__(self, env):
super().__init__(env)
self.prev_position = None
def step(self, action):
obs, reward, done, truncated, info = self.env.step(action)
# TODO 2: Implement reward shaping
# Original: reward = -1 (sparse)
# Shaped: reward = progress_bonus + height_bonus
#
# Hints:
# - reward for moving toward goal (obs[0] increasing)
# - reward for reaching higher positions (obs[0] > threshold)
# - terminal bonus for reaching goal
position = obs[0]
velocity = obs[1]
# Your reward shaping code here
shaped_reward = reward # Replace with your shaping
self.prev_position = position
return obs, shaped_reward, done, truncated, info
TODO 3: Implement comprehensive training monitoring
class TrainingMonitor:
def __init__(self, log_dir="./logs"):
self.writer = SummaryWriter(log_dir)
def log_episode(self, episode, reward, length, epsilon):
# TODO 3: Log episode metrics to TensorBoard
pass
def log_training_step(self, step, loss, q_values, gradients):
# TODO 3: Log training metrics
# - Loss values
# - Q-value statistics (mean, max, min)
# - Gradient norms
pass
def check_anomalies(self, loss, q_values, gradients):
# TODO 3: Detect training anomalies
# Check for:
# - NaN values
# - Exploding gradients (norm > threshold)
# - Q-value explosion (> threshold)
# Return: anomaly_detected (bool), anomaly_message (str)
pass
TODO 4: Run systematic hyperparameter search
# TODO 4: Define hyperparameter configurations
configs = [
{
"name": "baseline",
"learning_rate": 0.001,
"gamma": 0.99,
"epsilon_decay": 0.995,
"buffer_size": 10000,
"target_update_freq": 100,
},
{
"name": "high_lr",
# TODO 4: Your config (experiment with higher LR)
},
{
"name": "large_buffer",
# TODO 4: Your config (experiment with larger buffer)
},
# Add more configs
]
# TODO 4: Run training for each config and compare
results = {}
for config in configs:
agent = train_agent(config)
results[config["name"]] = {
"final_reward": agent.final_reward,
"episodes_to_solve": agent.episodes_to_solve,
"training_time": agent.training_time
}
TODO 5: Deploy trained agent with production safeguards
class ProductionAgent:
def __init__(self, model_path):
self.model = self.load_model(model_path)
self.fallback_controller = self.create_fallback()
self.anomaly_count = 0
def load_model(self, model_path):
# TODO 5: Load trained model safely (try-except)
pass
def predict(self, observation):
try:
# TODO 5: Get action from model
action = self.model.predict(observation)
# TODO 5: Safety checks
# - Check for NaN
# - Clip action to valid range
# - Log anomalies
return action
except Exception as e:
# TODO 5: Handle errors
# Log error, increment anomaly count
# Return safe fallback action
pass
def create_fallback(self):
# TODO 5: Create safe fallback controller
# For CartPole: oscillate left/right
# For others: zero action or emergency stop
pass
Utilities for reproducible experiments:
Before Fix:
After Fix:
Without Shaping (MountainCar):
<20%With Shaping:
Healthy Training:
✓ Episode 100: Reward = 50 ± 10
✓ Loss = 0.5 ± 0.1
✓ Q-values = 10 ± 5
✓ Gradient norm = 2.3 (stable)
✓ No anomalies detected
Unhealthy Training:
✗ Episode 100: Reward = 10 ± 50 (high variance)
✗ Loss = NaN
✗ Q-values = 1000+ (exploding)
✗ Gradient norm = 500 (exploding!)
✗ ANOMALY DETECTED: Gradient explosion
Expected Ranking:
Production Metrics:
✓ Model loaded successfully
✓ 1000 predictions in 0.5s (2000 FPS)
✓ 0 NaN actions
✓ 0 fallback activations
✓ Average reward: 195 ± 5
Your implementation is complete when:
Good Shaping:
# Guide toward goal without changing optimal policy
progress = current_position - previous_position
reward += 10 * progress # Reward for moving toward goal
Bad Shaping:
# Overshaping: Agent exploits reward signal
reward += 100 * abs(velocity) # Agent just oscillates fast!
✓ Model loaded and tested offline
✓ Action bounds enforced
✓ Fallback controller implemented
✓ Monitoring and logging active
✓ Error handling for all exceptions
✓ Performance profiling (FPS, latency)
✓ Gradual rollout plan (A/B testing)
Run each configuration with 10 different seeds:
for seed in range(10):
set_seed(seed)
result = train_agent(config, seed=seed)
results.append(result)
# Report mean ± std across seeds
Use Optuna for Bayesian optimization:
import optuna
def objective(trial):
lr = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
gamma = trial.suggest_uniform('gamma', 0.9, 0.999)
# ... more hyperparameters
agent = train_agent(learning_rate=lr, gamma=gamma)
return agent.final_reward
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
Simulate domain randomization:
class RandomizedEnv(gym.Wrapper):
def reset(self):
# Randomize environment parameters
self.env.gravity = np.random.uniform(9.0, 10.0)
self.env.mass = np.random.uniform(0.8, 1.2)
return self.env.reset()
Train on randomized sim, test on fixed "real" environment.
Set up automated testing:
def test_agent_performance():
agent = load_trained_agent("model.zip")
test_reward = evaluate(agent, n_episodes=100)
assert test_reward > 195, f"Agent performance degraded: {test_reward}"
Completed Notebook: activity-08-rl-in-practice.ipynb
Debugging Report:
Hyperparameter Analysis:
Deployment Demo:
🎉 Congratulations! You've completed the Reinforcement Learning Module!
RL Projects:
Next Module: Generative AI (Lessons 9-15)
This activity is graded on:
Passing Grade: 70% or higher
Excellent work completing the RL module! You're now ready to build production RL systems! 🚀🎯