Demo Mode

No student ID available

Activity 6 of 18

Activity 6: Proximal Policy Optimization (PPO)

Practice and reinforce the concepts from Lesson 6

Activity 06: Proximal Policy Optimization (PPO)

Overview

In this activity, you'll use Stable-Baselines3 to train a PPO agent on a challenging robotics task. You'll learn production RL workflows, hyperparameter tuning, and how to achieve state-of-the-art performance with PPO!

Learning Objectives

By completing this activity, you will:

Use Stable-Baselines3 for production-ready PPO
Train an agent on BipedalWalker-v3 (complex locomotion)
Tune PPO hyperparameters for optimal performance
Monitor training with TensorBoard and callbacks
Implement custom reward shaping
Compare PPO performance with previous algorithms

Prerequisites

Completed Concept 06: Proximal Policy Optimization (PPO)
Completed Activity 05: Actor-Critic Methods
Basic understanding of PPO clipping and trust regions

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-06-proximal-policy-optimization.zip
Location: Templates/AI25-Template-activity-06-proximal-policy-optimization.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-06-proximal-policy-optimization.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Stable-Baselines3 and dependencies
Import libraries
Set up TensorBoard logging

What You'll Build

Part One: BipedalWalker Environment (Pre-Built)

BipedalWalker-v3:

State: 24 continuous values (lidar readings, hull angle, joint positions/velocities)
Action: 4 continuous values (joint torques for 2 legs, 2 joints each)
Reward: +300 for walking forward, penalties for falling, energy use
Goal: Walk forward as far as possible without falling
Success: Average ``reward >300`` over 100 episodes

Challenge: High-dimensional continuous control, complex physics, requires stable learning!

Part 2: Basic PPO Training (YOU COMPLETE)

TODO 1: Create vectorized environment

Use make_vec_env to create parallel environments
Specify number of parallel envs (4-8 recommended)

python

from stable_baselines3.common.env_util import make_vec_env

# TODO 1: Create vectorized environment
env = make_vec_env(
    # Your code here: env_id, n_envs
)

TODO 2: Configure PPO model

Set policy type ("MlpPolicy" for vector states)
Configure hyperparameters (learning_rate, n_steps, batch_size, etc.)
Enable TensorBoard logging

python

from stable_baselines3 import PPO

# TODO 2: Create PPO model
model = PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=# TODO: Set learning rate,
    n_steps=# TODO: Steps per environment before update,
    batch_size=# TODO: Minibatch size,
    n_epochs=# TODO: PPO epochs per update,
    gamma=# TODO: Discount factor,
    gae_lambda=# TODO: GAE lambda,
    clip_range=# TODO: Clipping epsilon,
    verbose=1,
    tensorboard_log="./ppo_tensorboard/"
)

Part 3: Training with Callbacks (YOU COMPLETE)

TODO 3: Implement checkpoint callback

Save model every N steps
Useful for recovering from crashes or reverting to best model

python

from stable_baselines3.common.callbacks import CheckpointCallback

# TODO 3: Create checkpoint callback
checkpoint_callback = CheckpointCallback(
    save_freq=# TODO: Save frequency,
    save_path=# TODO: Save directory,
    name_prefix=# TODO: Model name prefix
)

TODO 4: Implement evaluation callback

Evaluate agent periodically
Track best model based on mean reward

python

from stable_baselines3.common.callbacks import EvalCallback

# TODO 4: Create evaluation callback
eval_callback = EvalCallback(
    eval_env,  # Separate evaluation environment
    best_model_save_path=# TODO: Path to save best model,
    log_path=# TODO: Log directory,
    eval_freq=# TODO: Evaluation frequency,
    n_eval_episodes=# TODO: Episodes per evaluation,
    deterministic=True
)

Part 4: Hyperparameter Tuning (YOU COMPLETE)

TODO 5: Experiment with different hyperparameter sets

Try at least 3 different configurations
Track which works best for BipedalWalker

python

# TODO 5: Define hyperparameter configurations to test
configs = [
    {
        "name": "baseline",
        "learning_rate": 3e-4,
        "n_steps": 2048,
        "batch_size": 64,
        "n_epochs": 10,
        "clip_range": 0.2,
    },
    {
        "name": "aggressive",
        # TODO: Your config
    },
    {
        "name": "conservative",
        # TODO: Your config
    }
]

Part 5: Custom Policy Network (EXTENSION)

TODO 6 (Optional): Define custom actor-critic architecture

Larger/smaller networks
Different activation functions
Separate actor/critic networks

python

policy_kwargs = dict(
    net_arch=[
        # TODO: Define network architecture
        # Example: dict(pi=[256, 256], vf=[256, 256])
    ]
)

model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, ...)

Expected Results

Training Progression

Steps 0-100K: Random walking, frequent falls, reward ~-100 to 0 Steps 100K-500K: Learning basic locomotion, reward reaches 50-150 Steps 500K-1M: Stable walking gait, reward reaches 200-280 Steps 1M+: Optimal walking, ``reward >300`` (SOLVED!)

Performance Milestones

100K steps: Average reward ~0-50 (unstable walking)
500K steps: Average reward ~150-200 (consistent forward motion)
1M steps: Average reward ~250-300 (efficient walking)
One.5M steps: Average ``reward >300`` (solved!)

Solved Criteria

BipedalWalker-v3 is considered "solved" when:

sql

Average reward ≥ 300 over 100 consecutive episodes

PPO typically solves this in 1-2M steps (with good hyperparameters).

Success Criteria

Your implementation is complete when:

PPO model created with Stable-Baselines3
Vectorized environment with 4+ parallel envs
Training runs without errors ``for >100``K steps
Checkpoint and evaluation callbacks implemented
TensorBoard logging works (can view training curves)
Agent achieves ``reward >200`` (partially walking)
Hyperparameter comparison completed (3+ configs)

Tips for Success

Hyperparameter Guidelines for BipedalWalker

Default (Good Starting Point):

python

learning_rate = 3e-4
n_steps = 2048
batch_size = 64
n_epochs = 10
gamma = 0.99
gae_lambda = 0.95
clip_range = 0.2

For Faster Learning (More Aggressive):

python

learning_rate = 5e-4  # Higher learning rate
n_steps = 4096  # More data per update
batch_size = 128  # Larger batches
n_epochs = 15  # More training epochs
clip_range = 0.3  # Allow larger policy changes

For More Stable Learning (Conservative):

python

learning_rate = 1e-4  # Lower learning rate
n_steps = 1024  # Less data per update
batch_size = 32  # Smaller batches
n_epochs = 5  # Fewer epochs
clip_range = 0.1  # Smaller policy changes

Monitoring Training

Key metrics to watch in TensorBoard:

python

# Performance
rollout/ep_rew_mean  # Average episode reward (should increase)
rollout/ep_len_mean  # Average episode length

# Policy metrics
train/policy_loss    # Policy loss (should decrease then stabilize)
train/approx_kl      # KL divergence (<0.05 is good)
train/clip_fraction  # Fraction clipped (0.1-0.3 is good)
train/entropy_loss   # Policy entropy (should decrease slowly)

# Value function
train/value_loss     # Critic loss (should decrease)
train/explained_variance  # How well critic predicts returns (>0.5 good)

Debugging Common Issues

Problem: Agent doesn't learn (reward stays negative)

Check environment setup (vectorization working?)
Increase n_steps (more data per update)
Try higher learning rate
Run for more steps (BipedalWalker is hard!)

Problem: Training unstable (reward oscillates)

Check approx_kl (``if >0.05``, updates too large)
Reduce learning rate
Reduce clip_range (0.2 -> 0.1)
Reduce n_epochs (10 -> 5)

Problem: Agent learns then forgets

Reduce learning rate
Increase n_steps (more stable gradient estimates)
Use evaluation callback to track best model

Problem: Slow training

Increase number of parallel environments
Use GPU (should be automatic with Stable-Baselines3)
Reduce n_epochs (10 -> 5) if updates are slow

Extension Challenges

Challenge One: Reward Shaping (Medium)

Add intermediate rewards to guide learning:

python

from gymnasium import Wrapper

class RewardShapingWrapper(Wrapper):
    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)

        # Add bonus for forward velocity
        forward_velocity = info.get('forward_velocity', 0)
        reward += 0.1 * forward_velocity

        # Penalize excessive joint torque (energy efficiency)
        torque_penalty = np.sum(np.abs(action))
        reward -= 0.01 * torque_penalty

        return obs, reward, done, truncated, info

Challenge 2: Curriculum Learning (Hard)

Gradually increase task difficulty:

python

# Start with easier variant (no hardcore obstacles)
env = gym.make('BipedalWalker-v3', hardcore=False)

# After agent learns, switch to hardcore
if mean_reward > 250:
    env = gym.make('BipedalWalker-v3', hardcore=True)

Challenge 3: Hyperparameter Search (Medium)

Use Optuna for automated hyperparameter tuning:

python

import optuna
from stable_baselines3.common.callbacks import EvalCallback

def objective(trial):
    # Sample hyperparameters
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-3)
    n_steps = trial.suggest_categorical('n_steps', [1024, 2048, 4096])
    # ... more hyperparameters

    # Train model
    model = PPO("MlpPolicy", env, learning_rate=learning_rate, n_steps=n_steps, ...)
    model.learn(total_timesteps=100000)

    # Evaluate
    mean_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
    return mean_reward

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

Challenge 4: Multi-Environment Training (Easy)

Train on multiple environments simultaneously:

python

env_ids = ['BipedalWalker-v3', 'LunarLanderContinuous-v2', 'Pendulum-v1']
# Create and train on each

Submission Requirements

What to Submit

Completed Notebook: activity-06-proximal-policy-optimization.ipynb
- All code cells executed
- Output visible for training logs
- All TODOs completed
Performance Report: Brief summary including:
- Best hyperparameter configuration
- Final average reward achieved
- Training time and total steps
- Comparison with baseline config
TensorBoard Logs: Export training curves
- Reward over time
- Policy loss, value loss
- KL divergence, clip fraction
Trained Model: Save best checkpoint
- best_model.zip from evaluation callback

Submission Steps

Train agent for at least 500K steps
Run all cells from top to bottom
Export TensorBoard plots (screenshots or export)
Download notebook and model files
Submit via [course portal link]

Resources

Documentation

Papers

Proximal Policy Optimization (Schulman et al., 2017)
Trust Region Policy Optimization (Schulman et al., 2015)

Trust regions
Clipped surrogate objective
Generalized Advantage Estimation
Parallel environments

Next Steps

After completing this activity:

Concept 07: Multi-Armed Bandits and Exploration
Activity 07: Contextual bandits for recommendations
Project 2: Autonomous Robot Navigation (PPO on robotics tasks)

You've now mastered the RL module! Next, we'll explore a different paradigm: multi-armed bandits.

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent achieves ``reward >200`` (stable walking)
Understanding (10%): Report demonstrates grasp of PPO and hyperparameter tuning

Passing Grade: 70% or higher

Good luck, and enjoy training state-of-the-art PPO agents! 🤖🚶

Activity 6 of 18

Activity 6: Proximal Policy Optimization (PPO)

Practice and reinforce the concepts from Lesson 6

Activity 06: Proximal Policy Optimization (PPO)

Overview

Learning Objectives

By completing this activity, you will:

Use Stable-Baselines3 for production-ready PPO
Train an agent on BipedalWalker-v3 (complex locomotion)
Tune PPO hyperparameters for optimal performance
Monitor training with TensorBoard and callbacks
Implement custom reward shaping
Compare PPO performance with previous algorithms

Prerequisites

Completed Concept 06: Proximal Policy Optimization (PPO)
Completed Activity 05: Actor-Critic Methods
Basic understanding of PPO clipping and trust regions

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-06-proximal-policy-optimization.zip
Location: Templates/AI25-Template-activity-06-proximal-policy-optimization.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-06-proximal-policy-optimization.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Stable-Baselines3 and dependencies
Import libraries
Set up TensorBoard logging

What You'll Build

Part One: BipedalWalker Environment (Pre-Built)

BipedalWalker-v3:

State: 24 continuous values (lidar readings, hull angle, joint positions/velocities)
Action: 4 continuous values (joint torques for 2 legs, 2 joints each)
Reward: +300 for walking forward, penalties for falling, energy use
Goal: Walk forward as far as possible without falling
Success: Average ``reward >300`` over 100 episodes

Challenge: High-dimensional continuous control, complex physics, requires stable learning!

Part 2: Basic PPO Training (YOU COMPLETE)

TODO 1: Create vectorized environment

Use make_vec_env to create parallel environments
Specify number of parallel envs (4-8 recommended)

python

from stable_baselines3.common.env_util import make_vec_env

# TODO 1: Create vectorized environment
env = make_vec_env(
    # Your code here: env_id, n_envs
)

TODO 2: Configure PPO model

Set policy type ("MlpPolicy" for vector states)
Configure hyperparameters (learning_rate, n_steps, batch_size, etc.)
Enable TensorBoard logging

python

from stable_baselines3 import PPO

# TODO 2: Create PPO model
model = PPO(
    policy="MlpPolicy",
    env=env,
    learning_rate=# TODO: Set learning rate,
    n_steps=# TODO: Steps per environment before update,
    batch_size=# TODO: Minibatch size,
    n_epochs=# TODO: PPO epochs per update,
    gamma=# TODO: Discount factor,
    gae_lambda=# TODO: GAE lambda,
    clip_range=# TODO: Clipping epsilon,
    verbose=1,
    tensorboard_log="./ppo_tensorboard/"
)

Part 3: Training with Callbacks (YOU COMPLETE)

TODO 3: Implement checkpoint callback

Save model every N steps
Useful for recovering from crashes or reverting to best model

python

from stable_baselines3.common.callbacks import CheckpointCallback

# TODO 3: Create checkpoint callback
checkpoint_callback = CheckpointCallback(
    save_freq=# TODO: Save frequency,
    save_path=# TODO: Save directory,
    name_prefix=# TODO: Model name prefix
)

TODO 4: Implement evaluation callback

Evaluate agent periodically
Track best model based on mean reward

python

from stable_baselines3.common.callbacks import EvalCallback

# TODO 4: Create evaluation callback
eval_callback = EvalCallback(
    eval_env,  # Separate evaluation environment
    best_model_save_path=# TODO: Path to save best model,
    log_path=# TODO: Log directory,
    eval_freq=# TODO: Evaluation frequency,
    n_eval_episodes=# TODO: Episodes per evaluation,
    deterministic=True
)

Part 4: Hyperparameter Tuning (YOU COMPLETE)

TODO 5: Experiment with different hyperparameter sets

Try at least 3 different configurations
Track which works best for BipedalWalker

python

# TODO 5: Define hyperparameter configurations to test
configs = [
    {
        "name": "baseline",
        "learning_rate": 3e-4,
        "n_steps": 2048,
        "batch_size": 64,
        "n_epochs": 10,
        "clip_range": 0.2,
    },
    {
        "name": "aggressive",
        # TODO: Your config
    },
    {
        "name": "conservative",
        # TODO: Your config
    }
]

Part 5: Custom Policy Network (EXTENSION)

TODO 6 (Optional): Define custom actor-critic architecture

Larger/smaller networks
Different activation functions
Separate actor/critic networks

python

policy_kwargs = dict(
    net_arch=[
        # TODO: Define network architecture
        # Example: dict(pi=[256, 256], vf=[256, 256])
    ]
)

model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, ...)

Expected Results

Training Progression

Performance Milestones

100K steps: Average reward ~0-50 (unstable walking)
500K steps: Average reward ~150-200 (consistent forward motion)
1M steps: Average reward ~250-300 (efficient walking)
One.5M steps: Average ``reward >300`` (solved!)

Solved Criteria

BipedalWalker-v3 is considered "solved" when:

sql

Average reward ≥ 300 over 100 consecutive episodes

PPO typically solves this in 1-2M steps (with good hyperparameters).

Success Criteria

Your implementation is complete when:

PPO model created with Stable-Baselines3
Vectorized environment with 4+ parallel envs
Training runs without errors ``for >100``K steps
Checkpoint and evaluation callbacks implemented
TensorBoard logging works (can view training curves)
Agent achieves ``reward >200`` (partially walking)
Hyperparameter comparison completed (3+ configs)

Tips for Success

Hyperparameter Guidelines for BipedalWalker

Default (Good Starting Point):

python

learning_rate = 3e-4
n_steps = 2048
batch_size = 64
n_epochs = 10
gamma = 0.99
gae_lambda = 0.95
clip_range = 0.2

For Faster Learning (More Aggressive):

python

learning_rate = 5e-4  # Higher learning rate
n_steps = 4096  # More data per update
batch_size = 128  # Larger batches
n_epochs = 15  # More training epochs
clip_range = 0.3  # Allow larger policy changes

For More Stable Learning (Conservative):

python

learning_rate = 1e-4  # Lower learning rate
n_steps = 1024  # Less data per update
batch_size = 32  # Smaller batches
n_epochs = 5  # Fewer epochs
clip_range = 0.1  # Smaller policy changes

Monitoring Training

Key metrics to watch in TensorBoard:

python

# Performance
rollout/ep_rew_mean  # Average episode reward (should increase)
rollout/ep_len_mean  # Average episode length

# Policy metrics
train/policy_loss    # Policy loss (should decrease then stabilize)
train/approx_kl      # KL divergence (<0.05 is good)
train/clip_fraction  # Fraction clipped (0.1-0.3 is good)
train/entropy_loss   # Policy entropy (should decrease slowly)

# Value function
train/value_loss     # Critic loss (should decrease)
train/explained_variance  # How well critic predicts returns (>0.5 good)

Debugging Common Issues

Problem: Agent doesn't learn (reward stays negative)

Check environment setup (vectorization working?)
Increase n_steps (more data per update)
Try higher learning rate
Run for more steps (BipedalWalker is hard!)

Problem: Training unstable (reward oscillates)

Check approx_kl (``if >0.05``, updates too large)
Reduce learning rate
Reduce clip_range (0.2 -> 0.1)
Reduce n_epochs (10 -> 5)

Problem: Agent learns then forgets

Reduce learning rate
Increase n_steps (more stable gradient estimates)
Use evaluation callback to track best model

Problem: Slow training

Increase number of parallel environments
Use GPU (should be automatic with Stable-Baselines3)
Reduce n_epochs (10 -> 5) if updates are slow

Extension Challenges

Challenge One: Reward Shaping (Medium)

Add intermediate rewards to guide learning:

python

from gymnasium import Wrapper

class RewardShapingWrapper(Wrapper):
    def step(self, action):
        obs, reward, done, truncated, info = self.env.step(action)

        # Add bonus for forward velocity
        forward_velocity = info.get('forward_velocity', 0)
        reward += 0.1 * forward_velocity

        # Penalize excessive joint torque (energy efficiency)
        torque_penalty = np.sum(np.abs(action))
        reward -= 0.01 * torque_penalty

        return obs, reward, done, truncated, info

Challenge 2: Curriculum Learning (Hard)

Gradually increase task difficulty:

python

# Start with easier variant (no hardcore obstacles)
env = gym.make('BipedalWalker-v3', hardcore=False)

# After agent learns, switch to hardcore
if mean_reward > 250:
    env = gym.make('BipedalWalker-v3', hardcore=True)

Challenge 3: Hyperparameter Search (Medium)

Use Optuna for automated hyperparameter tuning:

python

import optuna
from stable_baselines3.common.callbacks import EvalCallback

def objective(trial):
    # Sample hyperparameters
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-3)
    n_steps = trial.suggest_categorical('n_steps', [1024, 2048, 4096])
    # ... more hyperparameters

    # Train model
    model = PPO("MlpPolicy", env, learning_rate=learning_rate, n_steps=n_steps, ...)
    model.learn(total_timesteps=100000)

    # Evaluate
    mean_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
    return mean_reward

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)

Challenge 4: Multi-Environment Training (Easy)

Train on multiple environments simultaneously:

python

env_ids = ['BipedalWalker-v3', 'LunarLanderContinuous-v2', 'Pendulum-v1']
# Create and train on each

Submission Requirements

What to Submit

Completed Notebook: activity-06-proximal-policy-optimization.ipynb
- All code cells executed
- Output visible for training logs
- All TODOs completed
Performance Report: Brief summary including:
- Best hyperparameter configuration
- Final average reward achieved
- Training time and total steps
- Comparison with baseline config
TensorBoard Logs: Export training curves
- Reward over time
- Policy loss, value loss
- KL divergence, clip fraction
Trained Model: Save best checkpoint
- best_model.zip from evaluation callback

Submission Steps

Train agent for at least 500K steps
Run all cells from top to bottom
Export TensorBoard plots (screenshots or export)
Download notebook and model files
Submit via [course portal link]

Resources

Documentation

Papers

Proximal Policy Optimization (Schulman et al., 2017)
Trust Region Policy Optimization (Schulman et al., 2015)

Trust regions
Clipped surrogate objective
Generalized Advantage Estimation
Parallel environments

Next Steps

After completing this activity:

Concept 07: Multi-Armed Bandits and Exploration
Activity 07: Contextual bandits for recommendations
Project 2: Autonomous Robot Navigation (PPO on robotics tasks)

You've now mastered the RL module! Next, we'll explore a different paradigm: multi-armed bandits.

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent achieves ``reward >200`` (stable walking)
Understanding (10%): Report demonstrates grasp of PPO and hyperparameter tuning

Passing Grade: 70% or higher

Good luck, and enjoy training state-of-the-art PPO agents! 🤖🚶