Demo Mode

No student ID available

Activity 8 of 18

Activity 8: RL in Practice - Debugging and Deployment

Practice and reinforce the concepts from Lesson 8

Activity 08: RL in Practice - Debugging and Deployment

Overview

In this final RL activity, you'll practice essential production RL skills: debugging broken training code, implementing reward shaping, monitoring training with diagnostics, and deploying a trained agent. Real-world RL is 80% debugging - this activity prepares you!

Learning Objectives

By completing this activity, you will:

Debug common RL training failures (no learning, instability, catastrophic forgetting)
Implement reward shaping to guide agent learning
Build training diagnostics and monitoring dashboards
Deploy a trained RL agent with safety checks
Compare multiple hyperparameter configurations
Apply best practices for reproducible RL

Prerequisites

Completed Concept 08: RL in Practice - Debugging and Deployment
Completed Activities 01-07 (all RL algorithms)
Strong debugging skills and systematic problem-solving

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-08-rl-in-practice.zip
Location: Templates/AI25-Template-activity-08-rl-in-practice.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-08-rl-in-practice.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Install Gymnasium, Stable-Baselines3, TensorBoard
Import libraries
Set up logging and visualization utilities

What You'll Build

Part One: Debugging Broken DQN (YOU COMPLETE)

Scenario: A DQN agent for CartPole-v1 isn't learning (reward stays ~20-30 after 1000 episodes).

TODO 1: Diagnose and fix the bug(s)

Run the broken code
Use debugging checklist to identify issues
Fix the bugs
Verify agent learns successfully

Common bugs to check:

python

# Bug 1: Target network never updated
if episode % target_update_freq == 0:
    # BUG: Missing target network update!
    pass  # TODO 1: Add target_network.load_state_dict(q_network.state_dict())

# Bug 2: Replay buffer not filling
if len(replay_buffer) > batch_size:
    # BUG: Replay buffer size is 100 (too small!)
    pass  # TODO 1: Increase buffer capacity

# Bug 3: Learning rate too low
optimizer = optim.Adam(q_network.parameters(), lr=0.00001)
# BUG: Learning rate 1e-5 is too low for CartPole!
# TODO 1: Change to lr=0.001

# Bug 4: No exploration
epsilon = 0.0  # BUG: No exploration!
# TODO 1: Set epsilon=1.0 initially with decay

Part 2: Reward Shaping (YOU COMPLETE)

Scenario: Train an agent on MountainCar-v0 (very sparse rewards).

TODO 2: Implement reward shaping to guide learning

Original reward: -1 per step, 0 at goal (very sparse!)
Shaped reward: Add intermediate rewards based on progress

python

class RewardShapingWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.prev_position = None

    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)

        # TODO 2: Implement reward shaping
        # Original: reward = -1 (sparse)
        # Shaped: reward = progress_bonus + height_bonus
        #
        # Hints:
        # - reward for moving toward goal (obs[0] increasing)
        # - reward for reaching higher positions (obs[0] > threshold)
        # - terminal bonus for reaching goal

        position = obs[0]
        velocity = obs[1]

        # Your reward shaping code here
        shaped_reward = reward  # Replace with your shaping

        self.prev_position = position
        return obs, shaped_reward, done, truncated, info

Part 3: Training Diagnostics (YOU COMPLETE)

TODO 3: Implement comprehensive training monitoring

Track key metrics (reward, loss, Q-values, gradients)
Log to TensorBoard
Detect anomalies (NaN, exploding gradients, collapse)

python

class TrainingMonitor:
    def __init__(self, log_dir="./logs"):
        self.writer = SummaryWriter(log_dir)

    def log_episode(self, episode, reward, length, epsilon):
        # TODO 3: Log episode metrics to TensorBoard
        pass

    def log_training_step(self, step, loss, q_values, gradients):
        # TODO 3: Log training metrics
        # - Loss values
        # - Q-value statistics (mean, max, min)
        # - Gradient norms
        pass

    def check_anomalies(self, loss, q_values, gradients):
        # TODO 3: Detect training anomalies
        # Check for:
        # - NaN values
        # - Exploding gradients (norm > threshold)
        # - Q-value explosion (> threshold)
        # Return: anomaly_detected (bool), anomaly_message (str)
        pass

Part 4: Hyperparameter Comparison (YOU COMPLETE)

TODO 4: Run systematic hyperparameter search

Define 3-5 hyperparameter configurations
Train agent with each configuration
Track and compare performance

python

# TODO 4: Define hyperparameter configurations
configs = [
    {
        "name": "baseline",
        "learning_rate": 0.001,
        "gamma": 0.99,
        "epsilon_decay": 0.995,
        "buffer_size": 10000,
        "target_update_freq": 100,
    },
    {
        "name": "high_lr",
        # TODO 4: Your config (experiment with higher LR)
    },
    {
        "name": "large_buffer",
        # TODO 4: Your config (experiment with larger buffer)
    },
    # Add more configs
]

# TODO 4: Run training for each config and compare
results = {}
for config in configs:
    agent = train_agent(config)
    results[config["name"]] = {
        "final_reward": agent.final_reward,
        "episodes_to_solve": agent.episodes_to_solve,
        "training_time": agent.training_time
    }

Part 5: Agent Deployment (YOU COMPLETE)

TODO 5: Deploy trained agent with production safeguards

Load saved model
Implement safety checks (action clipping, fallback controller)
Monitor runtime performance
Handle errors gracefully

python

class ProductionAgent:
    def __init__(self, model_path):
        self.model = self.load_model(model_path)
        self.fallback_controller = self.create_fallback()
        self.anomaly_count = 0

    def load_model(self, model_path):
        # TODO 5: Load trained model safely (try-except)
        pass

    def predict(self, observation):
        try:
            # TODO 5: Get action from model
            action = self.model.predict(observation)

            # TODO 5: Safety checks
            # - Check for NaN
            # - Clip action to valid range
            # - Log anomalies

            return action

        except Exception as e:
            # TODO 5: Handle errors
            # Log error, increment anomaly count
            # Return safe fallback action
            pass

    def create_fallback(self):
        # TODO 5: Create safe fallback controller
        # For CartPole: oscillate left/right
        # For others: zero action or emergency stop
        pass

Part 6: Reproducibility (Pre-Built)

Utilities for reproducible experiments:

Seed fixing (random, NumPy, PyTorch, environment)
Hyperparameter logging (JSON export)
Git commit tracking
Requirements logging

Expected Results

Part One: Debugging

Before Fix:

Episode 1000: Reward = 22 +/- 5 (random performance)

After Fix:

Episode 200: Reward = 50-100 (learning started)
Episode 400: Reward = 195+ (solved!)

Part 2: Reward Shaping

Without Shaping (MountainCar):

Episodes to solve: 5000-10000 (if at all!)
Success rate: <20%

With Shaping:

Episodes to solve: 500-1500
Success rate: >80%

Part 3: Training Diagnostics

Healthy Training:

ini

✓ Episode 100: Reward = 50 ± 10
✓ Loss = 0.5 ± 0.1
✓ Q-values = 10 ± 5
✓ Gradient norm = 2.3 (stable)
✓ No anomalies detected

Unhealthy Training:

ini

✗ Episode 100: Reward = 10 ± 50 (high variance)
✗ Loss = NaN
✗ Q-values = 1000+ (exploding)
✗ Gradient norm = 500 (exploding!)
✗ ANOMALY DETECTED: Gradient explosion

Part 4: Hyperparameter Comparison

Expected Ranking:

Optimized config: 200-300 episodes to solve
Baseline config: 400-500 episodes to solve
Poor config: >1000 episodes or doesn't solve

Part 5: Agent Deployment

Production Metrics:

yaml

✓ Model loaded successfully
✓ 1000 predictions in 0.5s (2000 FPS)
✓ 0 NaN actions
✓ 0 fallback activations
✓ Average reward: 195 ± 5

Success Criteria

Your implementation is complete when:

Broken DQN code identified and fixed (agent solves CartPole)
Reward shaping implemented for MountainCar (significantly faster learning)
Training diagnostics catch anomalies (NaN, explosion, collapse)
Hyperparameter comparison identifies best configuration
Deployed agent runs reliably with safety checks
All experiments reproducible (fixed seeds, logged hyperparameters)

Tips for Success

Systematic Debugging Process

Verify environment: Can you manually solve the task?
Check data flow: Are observations, actions, rewards correct?
Check learning: Are gradients flowing? Is loss decreasing?
Check stability: Are Q-values/policies reasonable?
Check exploration: Is epsilon sufficient early on?

Reward Shaping Principles

Good Shaping:

python

# Guide toward goal without changing optimal policy
progress = current_position - previous_position
reward += 10 * progress  # Reward for moving toward goal

Bad Shaping:

python

# Overshaping: Agent exploits reward signal
reward += 100 * abs(velocity)  # Agent just oscillates fast!

Production Deployment Checklist

python

✓ Model loaded and tested offline
✓ Action bounds enforced
✓ Fallback controller implemented
✓ Monitoring and logging active
✓ Error handling for all exceptions
✓ Performance profiling (FPS, latency)
✓ Gradual rollout plan (A/B testing)

Extension Challenges

Challenge One: Multi-Seed Evaluation (Easy)

Run each configuration with 10 different seeds:

python

for seed in range(10):
    set_seed(seed)
    result = train_agent(config, seed=seed)
    results.append(result)

# Report mean ± std across seeds

Challenge 2: Automated Hyperparameter Tuning (Hard)

Use Optuna for Bayesian optimization:

python

import optuna

def objective(trial):
    lr = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
    gamma = trial.suggest_uniform('gamma', 0.9, 0.999)
    # ... more hyperparameters

    agent = train_agent(learning_rate=lr, gamma=gamma)
    return agent.final_reward

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Challenge 3: Sim-to-Real Transfer (Hard)

Simulate domain randomization:

python

class RandomizedEnv(gym.Wrapper):
    def reset(self):
        # Randomize environment parameters
        self.env.gravity = np.random.uniform(9.0, 10.0)
        self.env.mass = np.random.uniform(0.8, 1.2)
        return self.env.reset()

Train on randomized sim, test on fixed "real" environment.

Challenge 4: Continuous Integration (Medium)

Set up automated testing:

python

def test_agent_performance():
    agent = load_trained_agent("model.zip")
    test_reward = evaluate(agent, n_episodes=100)
    assert test_reward > 195, f"Agent performance degraded: {test_reward}"

Submission Requirements

What to Submit

Completed Notebook: activity-08-rl-in-practice.ipynb
- All code cells executed
- Output visible for all debugging steps
- All TODOs completed
Debugging Report:
- Bugs identified in Part 1
- Fixes applied
- Before/after performance comparison
Hyperparameter Analysis:
- Table comparing all configurations
- Best configuration identified
- Learning curves for each config
Deployment Demo:
- Production agent code
- Safety check implementation
- Runtime performance metrics

Submission Steps

Complete all debugging and training tasks
Run all cells from top to bottom
Export TensorBoard logs (if applicable)
Download notebook
Submit via [course portal link]

Resources

Documentation

Papers

Deep Reinforcement Learning that Matters (Reproducibility)
Implementation Matters in Deep RL

Reward shaping
Hyperparameter tuning
Production ML systems
Reproducibility

Next Steps

🎉 Congratulations! You've completed the Reinforcement Learning Module!

RL Projects:

Project 1: DQN Game Master (Atari from pixels)
Project 2: Autonomous Robot Navigation (PPO on robotics)

Next Module: Generative AI (Lessons 9-15)

Concept 09: Introduction to Generative Models
Activity 09: Build a Variational Autoencoder (VAE)
Learn GANs, Diffusion Models, and Large Language Models!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Problem Solving (20%): Effective debugging and solutions
Understanding (10%): Reports demonstrate grasp of production RL

Passing Grade: 70% or higher

Excellent work completing the RL module! You're now ready to build production RL systems! 🚀🎯

Activity 8 of 18

Activity 8: RL in Practice - Debugging and Deployment

Practice and reinforce the concepts from Lesson 8

Activity 08: RL in Practice - Debugging and Deployment

Overview

Learning Objectives

By completing this activity, you will:

Debug common RL training failures (no learning, instability, catastrophic forgetting)
Implement reward shaping to guide agent learning
Build training diagnostics and monitoring dashboards
Deploy a trained RL agent with safety checks
Compare multiple hyperparameter configurations
Apply best practices for reproducible RL

Prerequisites

Completed Concept 08: RL in Practice - Debugging and Deployment
Completed Activities 01-07 (all RL algorithms)
Strong debugging skills and systematic problem-solving

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-08-rl-in-practice.zip
Location: Templates/AI25-Template-activity-08-rl-in-practice.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-08-rl-in-practice.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Install Gymnasium, Stable-Baselines3, TensorBoard
Import libraries
Set up logging and visualization utilities

What You'll Build

Part One: Debugging Broken DQN (YOU COMPLETE)

Scenario: A DQN agent for CartPole-v1 isn't learning (reward stays ~20-30 after 1000 episodes).

TODO 1: Diagnose and fix the bug(s)

Run the broken code
Use debugging checklist to identify issues
Fix the bugs
Verify agent learns successfully

Common bugs to check:

python

# Bug 1: Target network never updated
if episode % target_update_freq == 0:
    # BUG: Missing target network update!
    pass  # TODO 1: Add target_network.load_state_dict(q_network.state_dict())

# Bug 2: Replay buffer not filling
if len(replay_buffer) > batch_size:
    # BUG: Replay buffer size is 100 (too small!)
    pass  # TODO 1: Increase buffer capacity

# Bug 3: Learning rate too low
optimizer = optim.Adam(q_network.parameters(), lr=0.00001)
# BUG: Learning rate 1e-5 is too low for CartPole!
# TODO 1: Change to lr=0.001

# Bug 4: No exploration
epsilon = 0.0  # BUG: No exploration!
# TODO 1: Set epsilon=1.0 initially with decay

Part 2: Reward Shaping (YOU COMPLETE)

Scenario: Train an agent on MountainCar-v0 (very sparse rewards).

TODO 2: Implement reward shaping to guide learning

Original reward: -1 per step, 0 at goal (very sparse!)
Shaped reward: Add intermediate rewards based on progress

python

class RewardShapingWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.prev_position = None

    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)

        # TODO 2: Implement reward shaping
        # Original: reward = -1 (sparse)
        # Shaped: reward = progress_bonus + height_bonus
        #
        # Hints:
        # - reward for moving toward goal (obs[0] increasing)
        # - reward for reaching higher positions (obs[0] > threshold)
        # - terminal bonus for reaching goal

        position = obs[0]
        velocity = obs[1]

        # Your reward shaping code here
        shaped_reward = reward  # Replace with your shaping

        self.prev_position = position
        return obs, shaped_reward, done, truncated, info

Part 3: Training Diagnostics (YOU COMPLETE)

TODO 3: Implement comprehensive training monitoring

Track key metrics (reward, loss, Q-values, gradients)
Log to TensorBoard
Detect anomalies (NaN, exploding gradients, collapse)

python

class TrainingMonitor:
    def __init__(self, log_dir="./logs"):
        self.writer = SummaryWriter(log_dir)

    def log_episode(self, episode, reward, length, epsilon):
        # TODO 3: Log episode metrics to TensorBoard
        pass

    def log_training_step(self, step, loss, q_values, gradients):
        # TODO 3: Log training metrics
        # - Loss values
        # - Q-value statistics (mean, max, min)
        # - Gradient norms
        pass

    def check_anomalies(self, loss, q_values, gradients):
        # TODO 3: Detect training anomalies
        # Check for:
        # - NaN values
        # - Exploding gradients (norm > threshold)
        # - Q-value explosion (> threshold)
        # Return: anomaly_detected (bool), anomaly_message (str)
        pass

Part 4: Hyperparameter Comparison (YOU COMPLETE)

TODO 4: Run systematic hyperparameter search

Define 3-5 hyperparameter configurations
Train agent with each configuration
Track and compare performance

python

# TODO 4: Define hyperparameter configurations
configs = [
    {
        "name": "baseline",
        "learning_rate": 0.001,
        "gamma": 0.99,
        "epsilon_decay": 0.995,
        "buffer_size": 10000,
        "target_update_freq": 100,
    },
    {
        "name": "high_lr",
        # TODO 4: Your config (experiment with higher LR)
    },
    {
        "name": "large_buffer",
        # TODO 4: Your config (experiment with larger buffer)
    },
    # Add more configs
]

# TODO 4: Run training for each config and compare
results = {}
for config in configs:
    agent = train_agent(config)
    results[config["name"]] = {
        "final_reward": agent.final_reward,
        "episodes_to_solve": agent.episodes_to_solve,
        "training_time": agent.training_time
    }

Part 5: Agent Deployment (YOU COMPLETE)

TODO 5: Deploy trained agent with production safeguards

Load saved model
Implement safety checks (action clipping, fallback controller)
Monitor runtime performance
Handle errors gracefully

python

class ProductionAgent:
    def __init__(self, model_path):
        self.model = self.load_model(model_path)
        self.fallback_controller = self.create_fallback()
        self.anomaly_count = 0

    def load_model(self, model_path):
        # TODO 5: Load trained model safely (try-except)
        pass

    def predict(self, observation):
        try:
            # TODO 5: Get action from model
            action = self.model.predict(observation)

            # TODO 5: Safety checks
            # - Check for NaN
            # - Clip action to valid range
            # - Log anomalies

            return action

        except Exception as e:
            # TODO 5: Handle errors
            # Log error, increment anomaly count
            # Return safe fallback action
            pass

    def create_fallback(self):
        # TODO 5: Create safe fallback controller
        # For CartPole: oscillate left/right
        # For others: zero action or emergency stop
        pass

Part 6: Reproducibility (Pre-Built)

Utilities for reproducible experiments:

Seed fixing (random, NumPy, PyTorch, environment)
Hyperparameter logging (JSON export)
Git commit tracking
Requirements logging

Expected Results

Part One: Debugging

Before Fix:

Episode 1000: Reward = 22 +/- 5 (random performance)

After Fix:

Episode 200: Reward = 50-100 (learning started)
Episode 400: Reward = 195+ (solved!)

Part 2: Reward Shaping

Without Shaping (MountainCar):

Episodes to solve: 5000-10000 (if at all!)
Success rate: <20%

With Shaping:

Episodes to solve: 500-1500
Success rate: >80%

Part 3: Training Diagnostics

Healthy Training:

ini

✓ Episode 100: Reward = 50 ± 10
✓ Loss = 0.5 ± 0.1
✓ Q-values = 10 ± 5
✓ Gradient norm = 2.3 (stable)
✓ No anomalies detected

Unhealthy Training:

ini

✗ Episode 100: Reward = 10 ± 50 (high variance)
✗ Loss = NaN
✗ Q-values = 1000+ (exploding)
✗ Gradient norm = 500 (exploding!)
✗ ANOMALY DETECTED: Gradient explosion

Part 4: Hyperparameter Comparison

Expected Ranking:

Optimized config: 200-300 episodes to solve
Baseline config: 400-500 episodes to solve
Poor config: >1000 episodes or doesn't solve

Part 5: Agent Deployment

Production Metrics:

yaml

✓ Model loaded successfully
✓ 1000 predictions in 0.5s (2000 FPS)
✓ 0 NaN actions
✓ 0 fallback activations
✓ Average reward: 195 ± 5

Success Criteria

Your implementation is complete when:

Broken DQN code identified and fixed (agent solves CartPole)
Reward shaping implemented for MountainCar (significantly faster learning)
Training diagnostics catch anomalies (NaN, explosion, collapse)
Hyperparameter comparison identifies best configuration
Deployed agent runs reliably with safety checks
All experiments reproducible (fixed seeds, logged hyperparameters)

Tips for Success

Systematic Debugging Process

Verify environment: Can you manually solve the task?
Check data flow: Are observations, actions, rewards correct?
Check learning: Are gradients flowing? Is loss decreasing?
Check stability: Are Q-values/policies reasonable?
Check exploration: Is epsilon sufficient early on?

Reward Shaping Principles

Good Shaping:

python

# Guide toward goal without changing optimal policy
progress = current_position - previous_position
reward += 10 * progress  # Reward for moving toward goal

Bad Shaping:

python

# Overshaping: Agent exploits reward signal
reward += 100 * abs(velocity)  # Agent just oscillates fast!

Production Deployment Checklist

python

✓ Model loaded and tested offline
✓ Action bounds enforced
✓ Fallback controller implemented
✓ Monitoring and logging active
✓ Error handling for all exceptions
✓ Performance profiling (FPS, latency)
✓ Gradual rollout plan (A/B testing)

Extension Challenges

Challenge One: Multi-Seed Evaluation (Easy)

Run each configuration with 10 different seeds:

python

for seed in range(10):
    set_seed(seed)
    result = train_agent(config, seed=seed)
    results.append(result)

# Report mean ± std across seeds

Challenge 2: Automated Hyperparameter Tuning (Hard)

Use Optuna for Bayesian optimization:

python

import optuna

def objective(trial):
    lr = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
    gamma = trial.suggest_uniform('gamma', 0.9, 0.999)
    # ... more hyperparameters

    agent = train_agent(learning_rate=lr, gamma=gamma)
    return agent.final_reward

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Challenge 3: Sim-to-Real Transfer (Hard)

Simulate domain randomization:

python

class RandomizedEnv(gym.Wrapper):
    def reset(self):
        # Randomize environment parameters
        self.env.gravity = np.random.uniform(9.0, 10.0)
        self.env.mass = np.random.uniform(0.8, 1.2)
        return self.env.reset()

Train on randomized sim, test on fixed "real" environment.

Challenge 4: Continuous Integration (Medium)

Set up automated testing:

python

def test_agent_performance():
    agent = load_trained_agent("model.zip")
    test_reward = evaluate(agent, n_episodes=100)
    assert test_reward > 195, f"Agent performance degraded: {test_reward}"

Submission Requirements

What to Submit

Completed Notebook: activity-08-rl-in-practice.ipynb
- All code cells executed
- Output visible for all debugging steps
- All TODOs completed
Debugging Report:
- Bugs identified in Part 1
- Fixes applied
- Before/after performance comparison
Hyperparameter Analysis:
- Table comparing all configurations
- Best configuration identified
- Learning curves for each config
Deployment Demo:
- Production agent code
- Safety check implementation
- Runtime performance metrics