Demo Mode

No student ID available

Concept 8 of 18

Concept 8: RL in Practice - Debugging and Deployment

RL in Practice: Debugging and Deployment

ℹ️ Definition Practical RL involves debugging training failures, designing effective reward functions, diagnosing common issues, and deploying RL agents to production environments - skills critical for real-world RL applications.

Learning Objectives

By the end of this lesson, you will:

Diagnose common RL training failures and their root causes
Design effective reward functions through reward shaping
Implement debugging tools and training diagnostics
Understand sim-to-real transfer challenges
Deploy RL agents to production environments
Apply best practices for reproducible RL research

Introduction

In Lessons 1-7, we learned RL algorithms that "work in theory." But in practice:

Reality:

vbnet

❌ Agent learns nothing after 1 million steps
❌ Training diverges after showing progress
❌ Agent exploits reward function bugs
❌ Simulation performance doesn't transfer to real world
❌ Results not reproducible across runs

This lesson teaches you how to debug and deploy real RL systems.

Common RL Training Failures

Failure Mode 1: No Learning

Symptom: Agent performance stays random after millions of steps.

Diagnosis Checklist:

python

✓ Is the reward signal non-zero?
✓ Can any policy get positive reward?
✓ Is the learning rate reasonable (not 0 or too large)?
✓ Is the network updating? (check gradients)
✓ Is exploration sufficient? (check epsilon/entropy)

Example - CartPole:

python

# BUG: Reward scale too small
reward = 0.001 if not done else 0  # Agent can't distinguish good/bad

# FIX: Proper reward scaling
reward = 1.0 if not done else 0

Debugging Code:

python

# Check if network is updating
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm().item()}")
    else:
        print(f"{name}: NO GRADIENT!")  # Problem!

Failure Mode 2: Catastrophic Forgetting

Symptom: Agent learns well, then suddenly forgets and performs poorly.

Causes:

Replay buffer too small (DQN forgets old experiences)
Learning rate too high (overwrites previous learning)
Reward distribution changed
Target network updated too frequently

Fixes:

python

# Increase replay buffer
buffer_size = 100_000  # → 1_000_000

# Reduce learning rate
lr = 0.001  # → 0.0001

# Increase target network update frequency
target_update_freq = 100  # → 1000

Failure Mode 3: Overestimation Bias

Symptom: Q-values grow unrealistically large (e.g., Q(s,a) = 10,000), training becomes unstable.

Cause: Max operator in Q-learning overestimates values.

Fixes:

python

# Use Double DQN
target_actions = q_network(next_states).argmax(dim=1)
target_values = target_network(next_states).gather(1, target_actions.unsqueeze(1))

# Clip rewards to [-1, 1]
reward = np.clip(reward, -1, 1)

# Add gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10.0)

Failure Mode 4: Reward Hacking

Symptom: Agent achieves high reward but doesn't solve the task.

Example - Coast Runners:

vbnet

Goal: Win boat race
Bug: Agent gets points for hitting targets
Exploit: Agent circles to repeatedly hit targets (ignores finishing!)

Prevention:

Carefully design reward function
Add constraints or penalties
Use shaped rewards that guide toward true goal

Reward Function Design

Sparse vs Dense Rewards

Sparse Rewards:

python

reward = +1 if reached_goal else 0

Pros: Simple, unambiguous goal Cons: Hard to learn (rare positive signal)

Dense Rewards:

python

reward = -distance_to_goal  # Negative distance as reward

Pros: Easier to learn (continuous feedback) Cons: Risk of reward hacking

Reward Shaping

Principle: Add intermediate rewards that guide agent toward goal without changing optimal policy.

Example - Navigation:

python

# Sparse (hard to learn)
reward = +100 if at_goal else 0

# Shaped (easier to learn)
distance_old = distance_to_goal(old_state)
distance_new = distance_to_goal(new_state)
reward = (distance_old - distance_new)  # Reward for getting closer

# Terminal bonus
if at_goal:
    reward += 100

Potential-Based Shaping (Theoretically Sound):

scss

F(s, s') = γ * Φ(s') - Φ(s)

Where Φ(s) is a potential function (e.g., negative distance to goal).

Guarantee: Adding F doesn't change optimal policy.

Reward Design Best Practices

One. Start Simple

python

# Good first attempt
reward = +1 if goal else -1 if collision else 0

2. Avoid Conflicting Signals

python

# BAD: Conflicting rewards
reward = +100 for speed - 50 for energy use  # Agent confused!

# GOOD: Single weighted objective
reward = 100 * speed - 0.5 * energy

3. Normalize Reward Scale

python

# Typical range: [-1, 1] or [0, 1]
reward = (raw_reward - mean) / std

4. Test with Random Policy First

python

# Can a random agent get ANY positive reward?
random_agent_reward = test_random_policy(env, episodes=100)
print(f"Random agent: {random_agent_reward}")  # Should be > 0

Training Diagnostics

Monitoring Training

Essential Metrics:

python

# Performance
episode_reward        # Increasing over time?
episode_length        # Reaching longer episodes?
success_rate          # % of episodes reaching goal

# Learning dynamics
policy_loss           # Decreasing?
value_loss            # Decreasing?
explained_variance    # > 0? (critic learning)
approx_kl             # < 0.05? (PPO: not changing too fast)
entropy               # > 0? (still exploring)

# Stability
grad_norm             # Not exploding? (< 10)
q_value_mean          # Not growing without bound?
clip_fraction         # 0.1-0.3 for PPO (reasonable clipping)

Diagnostic Code

python

import torch
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment')

def log_diagnostics(model, episode, reward, loss):
    # Log scalar metrics
    writer.add_scalar('Performance/reward', reward, episode)
    writer.add_scalar('Loss/policy', loss, episode)

    # Log gradient norms
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5
    writer.add_scalar('Diagnostics/grad_norm', total_norm, episode)

    # Log weight histograms (every 100 episodes)
    if episode % 100 == 0:
        for name, param in model.named_parameters():
            writer.add_histogram(name, param, episode)

# Visualize with TensorBoard
# tensorboard --logdir runs/

Debugging Checklist

When training fails, check in order:

One. Environment

python

✓ Can you manually solve the task?
✓ Does the reward function make sense?
✓ Is the observation space reasonable?
✓ Are action bounds correct?

2. Algorithm

python

✓ Are hyperparameters in reasonable ranges?
✓ Is the network architecture appropriate?
✓ Is the algorithm correctly implemented?
✓ Are gradients flowing? (check with dummy data)

3. Training

python

✓ Is the agent exploring enough?
✓ Is the replay buffer being used correctly?
✓ Are rewards being normalized?
✓ Is the learning rate schedule appropriate?

Sim-to-Real Transfer

The Reality Gap

Problem: Policies trained in simulation often fail on real robots.

Causes:

Dynamics mismatch: Simulation physics != real physics
Sensor noise: Real sensors are noisy
Latency: Real actuators have delays
Unmodeled effects: Friction, backlash, wear

Domain Randomization

Idea: Randomize simulation parameters during training to make policy robust.

Example - Robot Grasping:

python

# Randomize object properties
object_mass = uniform(0.1, 1.0)  # kg
object_friction = uniform(0.1, 0.9)
object_shape = random_choice(['cube', 'cylinder', 'sphere'])

# Randomize robot properties
joint_damping = uniform(0.01, 0.1)
motor_noise = gaussian(0, 0.05)

# Randomize environment
lighting = uniform(0.5, 1.5)
camera_position = uniform(-0.1, 0.1, size=3)

Result: Policy learns to handle variation -> transfers better to real world.

System Identification

Idea: Measure real system parameters, adjust simulation to match.

Process:

Collect real-world trajectories
Estimate system parameters (mass, friction, etc.)
Update simulation parameters
Re-train policy in updated simulation
Test on real system, repeat if needed

Progressive Transfer

Idea: Gradually transition from simulation to reality.

Steps:

Train in pure simulation
Fine-tune with small amount of real data
Continue training with mix of sim + real data
Final tuning on real system only

Reproducibility

The Reproducibility Crisis

Problem: RL results often not reproducible across:

Different runs (different random seeds)
Different implementations
Different hardware (CPU vs GPU)

Impact: Hard to compare algorithms, build on prior work.

Best Practices

One. Fix Random Seeds

python

import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # Make CuDNN deterministic (slower but reproducible)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

2. Log Hyperparameters

python

import json

hyperparameters = {
    'algorithm': 'PPO',
    'learning_rate': 3e-4,
    'batch_size': 64,
    'gamma': 0.99,
    'n_steps': 2048,
    'n_epochs': 10,
    'clip_range': 0.2,
    'seed': 42
}

with open('hyperparameters.json', 'w') as f:
    json.dump(hyperparameters, f, indent=2)

3. Version Control

bash

# Save exact code version
git rev-parse HEAD > git_commit.txt

# Save library versions
pip freeze > requirements.txt

4. Multiple Seeds

python

# Run with multiple seeds, report mean ± std
seeds = [42, 123, 456, 789, 1011]
results = []

for seed in seeds:
    set_seed(seed)
    reward = train_agent(seed=seed)
    results.append(reward)

print(f"Mean: {np.mean(results):.2f} ± {np.std(results):.2f}")

Deployment

Model Serving

Exporting Trained Policy:

python

# PyTorch: Export to TorchScript
model = PPO.load('trained_model.zip')
scripted_model = torch.jit.script(model.policy)
scripted_model.save('policy_scripted.pt')

# Load in production
policy = torch.jit.load('policy_scripted.pt')
action = policy(observation)

ONNX Export (for cross-platform deployment):

python

import torch.onnx

dummy_input = torch.randn(1, state_dim)
torch.onnx.export(
    model.policy,
    dummy_input,
    'policy.onnx',
    input_names=['state'],
    output_names=['action']
)

Safety Considerations

One. Action Clipping

python

# Ensure actions are in safe range
action = np.clip(action, env.action_space.low, env.action_space.high)

2. Fallback Controller

python

# Use safe fallback if policy fails
try:
    action = policy.predict(observation)
except Exception as e:
    print(f"Policy failed: {e}")
    action = safe_fallback_controller(observation)

3. Monitoring

python

# Log anomalies in production
if np.any(np.isnan(action)):
    log_error("NaN action detected!")
    action = default_safe_action

if reward < -100:  # Unusually bad
    log_warning(f"Abnormal reward: {reward}")

A/B Testing in Production

Gradual Rollout:

python

# Route 10% of traffic to new RL policy
if random.random() < 0.1:
    action = rl_policy.predict(observation)
else:
    action = baseline_policy(observation)

# Monitor metrics, gradually increase if performance good

Debugging Tools

Visualization

Episode Replay:

python

from gym.wrappers import Monitor

env = Monitor(env, './videos/', force=True)
episode_rewards = []

for episode in range(10):
    observation = env.reset()
    done = False
    episode_reward = 0

    while not done:
        action = agent.select_action(observation)
        observation, reward, done, _ = env.step(action)
        episode_reward += reward

    episode_rewards.append(episode_reward)

print(f"Mean reward: {np.mean(episode_rewards)}")
# Videos saved to ./videos/ for review

State-Action Heatmaps:

python

import matplotlib.pyplot as plt

# Collect state-action pairs
states, actions = [], []
for episode in range(100):
    s, a = collect_episode(env, agent)
    states.extend(s)
    actions.extend(a)

# Plot
plt.scatter(states, actions, alpha=0.1)
plt.xlabel('State')
plt.ylabel('Action')
plt.title('State-Action Distribution')
plt.savefig('state_action_heatmap.png')

Profiling

Find Performance Bottlenecks:

python

import time

# Profile training loop
times = {}

start = time.time()
batch = replay_buffer.sample(batch_size)
times['sampling'] = time.time() - start

start = time.time()
loss = compute_loss(batch)
times['forward'] = time.time() - start

start = time.time()
optimizer.zero_grad()
loss.backward()
optimizer.step()
times['backward'] = time.time() - start

print(f"Time breakdown: {times}")

Case Study: Debugging a Failed RL Agent

Problem Description

Task: Train CartPole agent with DQN Expected: Solve in <1000 episodes Actual: No learning after 5000 episodes

Debugging Process

Step One: Check Reward Signal

python

# Run random policy
random_rewards = []
for _ in range(100):
    reward = test_random_policy(env)
    random_rewards.append(reward)

print(f"Random policy: {np.mean(random_rewards):.1f} ± {np.std(random_rewards):.1f}")
# Output: 22.3 ± 5.2 (reasonable baseline)

Step 2: Check Network Updates

python

# Check if gradients exist
for name, param in q_network.named_parameters():
    print(f"{name}: grad = {param.grad.norm() if param.grad is not None else 'None'}")
# Output: All gradients present, norms ~1.0 (okay)

Step 3: Check Q-Values

python

# Print Q-values every 100 episodes
if episode % 100 == 0:
    sample_state = env.reset()
    q_values = q_network(torch.FloatTensor(sample_state))
    print(f"Episode {episode}: Q-values = {q_values}")
# Output: Q-values stuck at ~0.0 (problem!)

Step 4: Check Target Network Updates

python

# BUG FOUND: Target network never updated!
if episode % target_update_freq == 0:
    target_network.load_state_dict(q_network.state_dict())
    print(f"Target network updated at episode {episode}")
# Output: No print statements (target_update_freq set incorrectly!)

Fix:

python

# Was: target_update_freq = 10000 (episodes)
# Should be: target_update_freq = 100 (episodes)
target_update_freq = 100

Result: Agent learns successfully after fix!

Key Takeaways

Debugging is critical: Most RL training time is debugging, not algorithm design
Reward function design significantly affects learning difficulty
Training diagnostics (metrics, logging) are essential for identifying issues
Sim-to-real gap requires domain randomization and careful system ID
Reproducibility requires fixing seeds, logging hyperparameters, version control
Deployment requires safety considerations, monitoring, gradual rollout

Looking Ahead

You've completed the Reinforcement Learning Module (Lessons 1-8)!

Next:

Lessons 9-15: Generative AI fundamentals (VAEs, GANs, Diffusion, Transformers)
Lessons 16-18: RL + GenAI convergence through RLHF

RL Project Ideas:

Project One: DQN Game Master (Atari)
Project 2: Autonomous Robot Navigation (PPO)

Summary

Common failures: No learning, catastrophic forgetting, overestimation, reward hacking
Reward shaping guides agents toward goals with intermediate rewards
Training diagnostics (metrics, logging, visualization) are essential for debugging
Sim-to-real transfer uses domain randomization and system identification
Reproducibility requires fixing seeds, logging hyperparameters, running multiple seeds
Deployment requires model serving, safety checks, monitoring, gradual rollout
Debugging tools: Visualization, profiling, systematic diagnosis

Concept 8 of 18

Concept 8: RL in Practice - Debugging and Deployment

RL in Practice: Debugging and Deployment

ℹ️ Definition Practical RL involves debugging training failures, designing effective reward functions, diagnosing common issues, and deploying RL agents to production environments - skills critical for real-world RL applications.

Learning Objectives

By the end of this lesson, you will:

Diagnose common RL training failures and their root causes
Design effective reward functions through reward shaping
Implement debugging tools and training diagnostics
Understand sim-to-real transfer challenges
Deploy RL agents to production environments
Apply best practices for reproducible RL research

Introduction

In Lessons 1-7, we learned RL algorithms that "work in theory." But in practice:

Reality:

vbnet

❌ Agent learns nothing after 1 million steps
❌ Training diverges after showing progress
❌ Agent exploits reward function bugs
❌ Simulation performance doesn't transfer to real world
❌ Results not reproducible across runs

This lesson teaches you how to debug and deploy real RL systems.

Common RL Training Failures

Failure Mode 1: No Learning

Symptom: Agent performance stays random after millions of steps.

Diagnosis Checklist:

python

✓ Is the reward signal non-zero?
✓ Can any policy get positive reward?
✓ Is the learning rate reasonable (not 0 or too large)?
✓ Is the network updating? (check gradients)
✓ Is exploration sufficient? (check epsilon/entropy)

Example - CartPole:

python

# BUG: Reward scale too small
reward = 0.001 if not done else 0  # Agent can't distinguish good/bad

# FIX: Proper reward scaling
reward = 1.0 if not done else 0

Debugging Code:

python

# Check if network is updating
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm().item()}")
    else:
        print(f"{name}: NO GRADIENT!")  # Problem!

Failure Mode 2: Catastrophic Forgetting

Symptom: Agent learns well, then suddenly forgets and performs poorly.

Causes:

Replay buffer too small (DQN forgets old experiences)
Learning rate too high (overwrites previous learning)
Reward distribution changed
Target network updated too frequently

Fixes:

python

# Increase replay buffer
buffer_size = 100_000  # → 1_000_000

# Reduce learning rate
lr = 0.001  # → 0.0001

# Increase target network update frequency
target_update_freq = 100  # → 1000

Failure Mode 3: Overestimation Bias

Symptom: Q-values grow unrealistically large (e.g., Q(s,a) = 10,000), training becomes unstable.

Cause: Max operator in Q-learning overestimates values.

Fixes:

python

# Use Double DQN
target_actions = q_network(next_states).argmax(dim=1)
target_values = target_network(next_states).gather(1, target_actions.unsqueeze(1))

# Clip rewards to [-1, 1]
reward = np.clip(reward, -1, 1)

# Add gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=10.0)

Failure Mode 4: Reward Hacking

Symptom: Agent achieves high reward but doesn't solve the task.

Example - Coast Runners:

vbnet

Goal: Win boat race
Bug: Agent gets points for hitting targets
Exploit: Agent circles to repeatedly hit targets (ignores finishing!)

Prevention:

Carefully design reward function
Add constraints or penalties
Use shaped rewards that guide toward true goal

Reward Function Design

Sparse vs Dense Rewards

Sparse Rewards:

python

reward = +1 if reached_goal else 0

Pros: Simple, unambiguous goal Cons: Hard to learn (rare positive signal)

Dense Rewards:

python

reward = -distance_to_goal  # Negative distance as reward

Pros: Easier to learn (continuous feedback) Cons: Risk of reward hacking

Reward Shaping

Principle: Add intermediate rewards that guide agent toward goal without changing optimal policy.

Example - Navigation:

python

# Sparse (hard to learn)
reward = +100 if at_goal else 0

# Shaped (easier to learn)
distance_old = distance_to_goal(old_state)
distance_new = distance_to_goal(new_state)
reward = (distance_old - distance_new)  # Reward for getting closer

# Terminal bonus
if at_goal:
    reward += 100

Potential-Based Shaping (Theoretically Sound):

scss

F(s, s') = γ * Φ(s') - Φ(s)

Where Φ(s) is a potential function (e.g., negative distance to goal).

Guarantee: Adding F doesn't change optimal policy.

Reward Design Best Practices

One. Start Simple

python

# Good first attempt
reward = +1 if goal else -1 if collision else 0

2. Avoid Conflicting Signals

python

# BAD: Conflicting rewards
reward = +100 for speed - 50 for energy use  # Agent confused!

# GOOD: Single weighted objective
reward = 100 * speed - 0.5 * energy

3. Normalize Reward Scale

python

# Typical range: [-1, 1] or [0, 1]
reward = (raw_reward - mean) / std

4. Test with Random Policy First

python

# Can a random agent get ANY positive reward?
random_agent_reward = test_random_policy(env, episodes=100)
print(f"Random agent: {random_agent_reward}")  # Should be > 0

Training Diagnostics

Monitoring Training

Essential Metrics:

python

# Performance
episode_reward        # Increasing over time?
episode_length        # Reaching longer episodes?
success_rate          # % of episodes reaching goal

# Learning dynamics
policy_loss           # Decreasing?
value_loss            # Decreasing?
explained_variance    # > 0? (critic learning)
approx_kl             # < 0.05? (PPO: not changing too fast)
entropy               # > 0? (still exploring)

# Stability
grad_norm             # Not exploding? (< 10)
q_value_mean          # Not growing without bound?
clip_fraction         # 0.1-0.3 for PPO (reasonable clipping)

Diagnostic Code

python

import torch
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment')

def log_diagnostics(model, episode, reward, loss):
    # Log scalar metrics
    writer.add_scalar('Performance/reward', reward, episode)
    writer.add_scalar('Loss/policy', loss, episode)

    # Log gradient norms
    total_norm = 0
    for p in model.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5
    writer.add_scalar('Diagnostics/grad_norm', total_norm, episode)

    # Log weight histograms (every 100 episodes)
    if episode % 100 == 0:
        for name, param in model.named_parameters():
            writer.add_histogram(name, param, episode)

# Visualize with TensorBoard
# tensorboard --logdir runs/

Debugging Checklist

When training fails, check in order:

One. Environment

python

✓ Can you manually solve the task?
✓ Does the reward function make sense?
✓ Is the observation space reasonable?
✓ Are action bounds correct?

2. Algorithm

python

✓ Are hyperparameters in reasonable ranges?
✓ Is the network architecture appropriate?
✓ Is the algorithm correctly implemented?
✓ Are gradients flowing? (check with dummy data)

3. Training

python

✓ Is the agent exploring enough?
✓ Is the replay buffer being used correctly?
✓ Are rewards being normalized?
✓ Is the learning rate schedule appropriate?

Sim-to-Real Transfer

The Reality Gap

Problem: Policies trained in simulation often fail on real robots.

Causes:

Dynamics mismatch: Simulation physics != real physics
Sensor noise: Real sensors are noisy
Latency: Real actuators have delays
Unmodeled effects: Friction, backlash, wear

Domain Randomization

Idea: Randomize simulation parameters during training to make policy robust.

Example - Robot Grasping:

python

# Randomize object properties
object_mass = uniform(0.1, 1.0)  # kg
object_friction = uniform(0.1, 0.9)
object_shape = random_choice(['cube', 'cylinder', 'sphere'])

# Randomize robot properties
joint_damping = uniform(0.01, 0.1)
motor_noise = gaussian(0, 0.05)

# Randomize environment
lighting = uniform(0.5, 1.5)
camera_position = uniform(-0.1, 0.1, size=3)

Result: Policy learns to handle variation -> transfers better to real world.

System Identification

Idea: Measure real system parameters, adjust simulation to match.

Process:

Collect real-world trajectories
Estimate system parameters (mass, friction, etc.)
Update simulation parameters
Re-train policy in updated simulation
Test on real system, repeat if needed

Progressive Transfer

Idea: Gradually transition from simulation to reality.

Steps:

Train in pure simulation
Fine-tune with small amount of real data
Continue training with mix of sim + real data
Final tuning on real system only

Reproducibility

The Reproducibility Crisis

Problem: RL results often not reproducible across:

Different runs (different random seeds)
Different implementations
Different hardware (CPU vs GPU)

Impact: Hard to compare algorithms, build on prior work.

Best Practices

One. Fix Random Seeds

python

import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    # Make CuDNN deterministic (slower but reproducible)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

2. Log Hyperparameters

python

import json

hyperparameters = {
    'algorithm': 'PPO',
    'learning_rate': 3e-4,
    'batch_size': 64,
    'gamma': 0.99,
    'n_steps': 2048,
    'n_epochs': 10,
    'clip_range': 0.2,
    'seed': 42
}

with open('hyperparameters.json', 'w') as f:
    json.dump(hyperparameters, f, indent=2)

3. Version Control

bash

# Save exact code version
git rev-parse HEAD > git_commit.txt

# Save library versions
pip freeze > requirements.txt

4. Multiple Seeds

python

# Run with multiple seeds, report mean ± std
seeds = [42, 123, 456, 789, 1011]
results = []

for seed in seeds:
    set_seed(seed)
    reward = train_agent(seed=seed)
    results.append(reward)

print(f"Mean: {np.mean(results):.2f} ± {np.std(results):.2f}")

Deployment

Model Serving

Exporting Trained Policy:

python

# PyTorch: Export to TorchScript
model = PPO.load('trained_model.zip')
scripted_model = torch.jit.script(model.policy)
scripted_model.save('policy_scripted.pt')

# Load in production
policy = torch.jit.load('policy_scripted.pt')
action = policy(observation)

ONNX Export (for cross-platform deployment):

python

import torch.onnx

dummy_input = torch.randn(1, state_dim)
torch.onnx.export(
    model.policy,
    dummy_input,
    'policy.onnx',
    input_names=['state'],
    output_names=['action']
)

Safety Considerations

One. Action Clipping

python

# Ensure actions are in safe range
action = np.clip(action, env.action_space.low, env.action_space.high)

2. Fallback Controller

python

# Use safe fallback if policy fails
try:
    action = policy.predict(observation)
except Exception as e:
    print(f"Policy failed: {e}")
    action = safe_fallback_controller(observation)

3. Monitoring

python

# Log anomalies in production
if np.any(np.isnan(action)):
    log_error("NaN action detected!")
    action = default_safe_action

if reward < -100:  # Unusually bad
    log_warning(f"Abnormal reward: {reward}")

A/B Testing in Production

Gradual Rollout:

python

# Route 10% of traffic to new RL policy
if random.random() < 0.1:
    action = rl_policy.predict(observation)
else:
    action = baseline_policy(observation)

# Monitor metrics, gradually increase if performance good

Debugging Tools

Visualization

Episode Replay:

python

from gym.wrappers import Monitor

env = Monitor(env, './videos/', force=True)
episode_rewards = []

for episode in range(10):
    observation = env.reset()
    done = False
    episode_reward = 0

    while not done:
        action = agent.select_action(observation)
        observation, reward, done, _ = env.step(action)
        episode_reward += reward

    episode_rewards.append(episode_reward)

print(f"Mean reward: {np.mean(episode_rewards)}")
# Videos saved to ./videos/ for review

State-Action Heatmaps:

python

import matplotlib.pyplot as plt

# Collect state-action pairs
states, actions = [], []
for episode in range(100):
    s, a = collect_episode(env, agent)
    states.extend(s)
    actions.extend(a)

# Plot
plt.scatter(states, actions, alpha=0.1)
plt.xlabel('State')
plt.ylabel('Action')
plt.title('State-Action Distribution')
plt.savefig('state_action_heatmap.png')

Profiling

Find Performance Bottlenecks:

python

import time

# Profile training loop
times = {}

start = time.time()
batch = replay_buffer.sample(batch_size)
times['sampling'] = time.time() - start

start = time.time()
loss = compute_loss(batch)
times['forward'] = time.time() - start

start = time.time()
optimizer.zero_grad()
loss.backward()
optimizer.step()
times['backward'] = time.time() - start

print(f"Time breakdown: {times}")

# Run random policy
random_rewards = []
for _ in range(100):
    reward = test_random_policy(env)
    random_rewards.append(reward)

print(f"Random policy: {np.mean(random_rewards):.1f} ± {np.std(random_rewards):.1f}")
# Output: 22.3 ± 5.2 (reasonable baseline)

Step 2: Check Network Updates

python

# Check if gradients exist
for name, param in q_network.named_parameters():
    print(f"{name}: grad = {param.grad.norm() if param.grad is not None else 'None'}")
# Output: All gradients present, norms ~1.0 (okay)

Step 3: Check Q-Values

python

# Print Q-values every 100 episodes
if episode % 100 == 0:
    sample_state = env.reset()
    q_values = q_network(torch.FloatTensor(sample_state))
    print(f"Episode {episode}: Q-values = {q_values}")
# Output: Q-values stuck at ~0.0 (problem!)

Step 4: Check Target Network Updates

python

# BUG FOUND: Target network never updated!
if episode % target_update_freq == 0:
    target_network.load_state_dict(q_network.state_dict())
    print(f"Target network updated at episode {episode}")
# Output: No print statements (target_update_freq set incorrectly!)

Fix:

python

# Was: target_update_freq = 10000 (episodes)
# Should be: target_update_freq = 100 (episodes)
target_update_freq = 100

Result: Agent learns successfully after fix!

Key Takeaways

Debugging is critical: Most RL training time is debugging, not algorithm design
Reward function design significantly affects learning difficulty
Training diagnostics (metrics, logging) are essential for identifying issues
Sim-to-real gap requires domain randomization and careful system ID
Reproducibility requires fixing seeds, logging hyperparameters, version control
Deployment requires safety considerations, monitoring, gradual rollout

Looking Ahead

You've completed the Reinforcement Learning Module (Lessons 1-8)!

Next:

Lessons 9-15: Generative AI fundamentals (VAEs, GANs, Diffusion, Transformers)
Lessons 16-18: RL + GenAI convergence through RLHF

RL Project Ideas:

Project One: DQN Game Master (Atari)
Project 2: Autonomous Robot Navigation (PPO)

Summary

Common failures: No learning, catastrophic forgetting, overestimation, reward hacking
Reward shaping guides agents toward goals with intermediate rewards
Training diagnostics (metrics, logging, visualization) are essential for debugging
Sim-to-real transfer uses domain randomization and system identification
Reproducibility requires fixing seeds, logging hyperparameters, running multiple seeds
Deployment requires model serving, safety checks, monitoring, gradual rollout
Debugging tools: Visualization, profiling, systematic diagnosis