ℹ️ Definition Practical RL involves debugging training failures, designing effective reward functions, diagnosing common issues, and deploying RL agents to production environments - skills critical for real-world RL applications.
By the end of this lesson, you will:
In Lessons 1-7, we learned RL algorithms that "work in theory." But in practice:
Reality:
❌ Agent learns nothing after 1 million steps
❌ Training diverges after showing progress
❌ Agent exploits reward function bugs
❌ Simulation performance doesn't transfer to real world
❌ Results not reproducible across runs
This lesson teaches you how to debug and deploy real RL systems.
Symptom: Agent performance stays random after millions of steps.
Diagnosis Checklist:
✓ Is the reward signal non-zero?
✓ Can any policy get positive reward?
✓ Is the learning rate reasonable (not 0 or too large)?
✓ Is the network updating? (check gradients)
✓ Is exploration sufficient? (check epsilon/entropy)
Example - CartPole:
# BUG: Reward scale too small
reward = 0.001 if not done else 0 # Agent can't distinguish good/bad
# FIX: Proper reward scaling
reward = 1.0 if not done else 0
Debugging Code:
# Check if network is updating
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad norm = {param.grad.norm().item()}")
else:
print(f"{name}: NO GRADIENT!") # Problem!
Symptom: Agent learns well, then suddenly forgets and performs poorly.
Causes:
Fixes:
# Increase replay buffer
buffer_size = 100_000 # → 1_000_000
# Reduce learning rate
lr = 0.001 # → 0.0001
# Increase target network update frequency
target_update_freq = 100 # → 1000
Symptom: Q-values grow unrealistically large (e.g., Q(s,a) = 10,000), training becomes unstable.
Cause: Max operator in Q-learning overestimates values.
Fixes:
# Use Double DQN
target_actions = q_network(next_states).argmax(dim=1)
target_values = target_network(next_states).gather(1, target_actions.unsqueeze(1))
# Clip rewards to [-1, 1]
reward = np.clip(reward, -1, 1)
# Add gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10.0)
Symptom: Agent achieves high reward but doesn't solve the task.
Example - Coast Runners:
Goal: Win boat race
Bug: Agent gets points for hitting targets
Exploit: Agent circles to repeatedly hit targets (ignores finishing!)
Prevention:
Sparse Rewards:
reward = +1 if reached_goal else 0
Pros: Simple, unambiguous goal Cons: Hard to learn (rare positive signal)
Dense Rewards:
reward = -distance_to_goal # Negative distance as reward
Pros: Easier to learn (continuous feedback) Cons: Risk of reward hacking
Principle: Add intermediate rewards that guide agent toward goal without changing optimal policy.
Example - Navigation:
# Sparse (hard to learn)
reward = +100 if at_goal else 0
# Shaped (easier to learn)
distance_old = distance_to_goal(old_state)
distance_new = distance_to_goal(new_state)
reward = (distance_old - distance_new) # Reward for getting closer
# Terminal bonus
if at_goal:
reward += 100
Potential-Based Shaping (Theoretically Sound):
F(s, s') = γ * Φ(s') - Φ(s)
Where Φ(s) is a potential function (e.g., negative distance to goal).
Guarantee: Adding F doesn't change optimal policy.
One. Start Simple
# Good first attempt
reward = +1 if goal else -1 if collision else 0
2. Avoid Conflicting Signals
# BAD: Conflicting rewards
reward = +100 for speed - 50 for energy use # Agent confused!
# GOOD: Single weighted objective
reward = 100 * speed - 0.5 * energy
3. Normalize Reward Scale
# Typical range: [-1, 1] or [0, 1]
reward = (raw_reward - mean) / std
4. Test with Random Policy First
# Can a random agent get ANY positive reward?
random_agent_reward = test_random_policy(env, episodes=100)
print(f"Random agent: {random_agent_reward}") # Should be > 0
Essential Metrics:
# Performance
episode_reward # Increasing over time?
episode_length # Reaching longer episodes?
success_rate # % of episodes reaching goal
# Learning dynamics
policy_loss # Decreasing?
value_loss # Decreasing?
explained_variance # > 0? (critic learning)
approx_kl # < 0.05? (PPO: not changing too fast)
entropy # > 0? (still exploring)
# Stability
grad_norm # Not exploding? (< 10)
q_value_mean # Not growing without bound?
clip_fraction # 0.1-0.3 for PPO (reasonable clipping)
import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/experiment')
def log_diagnostics(model, episode, reward, loss):
# Log scalar metrics
writer.add_scalar('Performance/reward', reward, episode)
writer.add_scalar('Loss/policy', loss, episode)
# Log gradient norms
total_norm = 0
for p in model.parameters():
if p.grad is not None:
param_norm = p.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** 0.5
writer.add_scalar('Diagnostics/grad_norm', total_norm, episode)
# Log weight histograms (every 100 episodes)
if episode % 100 == 0:
for name, param in model.named_parameters():
writer.add_histogram(name, param, episode)
# Visualize with TensorBoard
# tensorboard --logdir runs/
When training fails, check in order:
One. Environment
✓ Can you manually solve the task?
✓ Does the reward function make sense?
✓ Is the observation space reasonable?
✓ Are action bounds correct?
2. Algorithm
✓ Are hyperparameters in reasonable ranges?
✓ Is the network architecture appropriate?
✓ Is the algorithm correctly implemented?
✓ Are gradients flowing? (check with dummy data)
3. Training
✓ Is the agent exploring enough?
✓ Is the replay buffer being used correctly?
✓ Are rewards being normalized?
✓ Is the learning rate schedule appropriate?
Problem: Policies trained in simulation often fail on real robots.
Causes:
Idea: Randomize simulation parameters during training to make policy robust.
Example - Robot Grasping:
# Randomize object properties
object_mass = uniform(0.1, 1.0) # kg
object_friction = uniform(0.1, 0.9)
object_shape = random_choice(['cube', 'cylinder', 'sphere'])
# Randomize robot properties
joint_damping = uniform(0.01, 0.1)
motor_noise = gaussian(0, 0.05)
# Randomize environment
lighting = uniform(0.5, 1.5)
camera_position = uniform(-0.1, 0.1, size=3)
Result: Policy learns to handle variation -> transfers better to real world.
Idea: Measure real system parameters, adjust simulation to match.
Process:
Idea: Gradually transition from simulation to reality.
Steps:
Problem: RL results often not reproducible across:
Impact: Hard to compare algorithms, build on prior work.
One. Fix Random Seeds
import random
import numpy as np
import torch
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
# Make CuDNN deterministic (slower but reproducible)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
2. Log Hyperparameters
import json
hyperparameters = {
'algorithm': 'PPO',
'learning_rate': 3e-4,
'batch_size': 64,
'gamma': 0.99,
'n_steps': 2048,
'n_epochs': 10,
'clip_range': 0.2,
'seed': 42
}
with open('hyperparameters.json', 'w') as f:
json.dump(hyperparameters, f, indent=2)
3. Version Control
# Save exact code version
git rev-parse HEAD > git_commit.txt
# Save library versions
pip freeze > requirements.txt
4. Multiple Seeds
# Run with multiple seeds, report mean ± std
seeds = [42, 123, 456, 789, 1011]
results = []
for seed in seeds:
set_seed(seed)
reward = train_agent(seed=seed)
results.append(reward)
print(f"Mean: {np.mean(results):.2f} ± {np.std(results):.2f}")
Exporting Trained Policy:
# PyTorch: Export to TorchScript
model = PPO.load('trained_model.zip')
scripted_model = torch.jit.script(model.policy)
scripted_model.save('policy_scripted.pt')
# Load in production
policy = torch.jit.load('policy_scripted.pt')
action = policy(observation)
ONNX Export (for cross-platform deployment):
import torch.onnx
dummy_input = torch.randn(1, state_dim)
torch.onnx.export(
model.policy,
dummy_input,
'policy.onnx',
input_names=['state'],
output_names=['action']
)
One. Action Clipping
# Ensure actions are in safe range
action = np.clip(action, env.action_space.low, env.action_space.high)
2. Fallback Controller
# Use safe fallback if policy fails
try:
action = policy.predict(observation)
except Exception as e:
print(f"Policy failed: {e}")
action = safe_fallback_controller(observation)
3. Monitoring
# Log anomalies in production
if np.any(np.isnan(action)):
log_error("NaN action detected!")
action = default_safe_action
if reward < -100: # Unusually bad
log_warning(f"Abnormal reward: {reward}")
Gradual Rollout:
# Route 10% of traffic to new RL policy
if random.random() < 0.1:
action = rl_policy.predict(observation)
else:
action = baseline_policy(observation)
# Monitor metrics, gradually increase if performance good
Episode Replay:
from gym.wrappers import Monitor
env = Monitor(env, './videos/', force=True)
episode_rewards = []
for episode in range(10):
observation = env.reset()
done = False
episode_reward = 0
while not done:
action = agent.select_action(observation)
observation, reward, done, _ = env.step(action)
episode_reward += reward
episode_rewards.append(episode_reward)
print(f"Mean reward: {np.mean(episode_rewards)}")
# Videos saved to ./videos/ for review
State-Action Heatmaps:
import matplotlib.pyplot as plt
# Collect state-action pairs
states, actions = [], []
for episode in range(100):
s, a = collect_episode(env, agent)
states.extend(s)
actions.extend(a)
# Plot
plt.scatter(states, actions, alpha=0.1)
plt.xlabel('State')
plt.ylabel('Action')
plt.title('State-Action Distribution')
plt.savefig('state_action_heatmap.png')
Find Performance Bottlenecks:
import time
# Profile training loop
times = {}
start = time.time()
batch = replay_buffer.sample(batch_size)
times['sampling'] = time.time() - start
start = time.time()
loss = compute_loss(batch)
times['forward'] = time.time() - start
start = time.time()
optimizer.zero_grad()
loss.backward()
optimizer.step()
times['backward'] = time.time() - start
print(f"Time breakdown: {times}")
Task: Train CartPole agent with DQN
Expected: Solve in <1000 episodes
Actual: No learning after 5000 episodes
Step One: Check Reward Signal
# Run random policy
random_rewards = []
for _ in range(100):
reward = test_random_policy(env)
random_rewards.append(reward)
print(f"Random policy: {np.mean(random_rewards):.1f} ± {np.std(random_rewards):.1f}")
# Output: 22.3 ± 5.2 (reasonable baseline)
Step 2: Check Network Updates
# Check if gradients exist
for name, param in q_network.named_parameters():
print(f"{name}: grad = {param.grad.norm() if param.grad is not None else 'None'}")
# Output: All gradients present, norms ~1.0 (okay)
Step 3: Check Q-Values
# Print Q-values every 100 episodes
if episode % 100 == 0:
sample_state = env.reset()
q_values = q_network(torch.FloatTensor(sample_state))
print(f"Episode {episode}: Q-values = {q_values}")
# Output: Q-values stuck at ~0.0 (problem!)
Step 4: Check Target Network Updates
# BUG FOUND: Target network never updated!
if episode % target_update_freq == 0:
target_network.load_state_dict(q_network.state_dict())
print(f"Target network updated at episode {episode}")
# Output: No print statements (target_update_freq set incorrectly!)
Fix:
# Was: target_update_freq = 10000 (episodes)
# Should be: target_update_freq = 100 (episodes)
target_update_freq = 100
Result: Agent learns successfully after fix!
You've completed the Reinforcement Learning Module (Lessons 1-8)!
Next:
RL Project Ideas: