Proximal Policy Optimization (PPO) - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand Proximal Policy Optimization (PPO), the gold standard of modern RL
Implement the clipped surrogate objective for stable policy updates
Apply Generalized Advantage Estimation (GAE) for variance reduction
Master Actor-Critic architecture with shared neural networks
Train agents on continuous control tasks (LunarLander-v2)
Analyze policy loss, value loss, KL divergence, and entropy during training
Achieve 200+ average reward on LunarLander ``in <50``K steps

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9 )
Watch the Magic: You'll see:
- ✅ LunarLander-v2 environment setup
- ✅ Actor-Critic neural network architecture
- ✅ Random baseline agent (reward ~-200)
- ✅ GAE (Generalized Advantage Estimation) implementation
- ✅ Training progress visualization

Expected First Run Time: ~90 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ LunarLander-v2 Environment: Continuous control with 8D observations, 4D actions
✅ Random Baseline: Agent with average reward ~-200 (crashes)
✅ Actor-Critic Network: Shared backbone, policy head, value head (PyTorch)
✅ GAE Implementation: Advantage estimation with λ=0.95
✅ Rollout Buffer: Stores trajectories for batch training
✅ Training Loop Framework: Multiple epochs per batch
✅ Visualization Tools: Policy loss, value loss, KL divergence, entropy plots

What Needs Your Work (35%):

⚠️ TODO 1: Implement PPO clipped surrogate objective (Hard)
⚠️ TODO 2: Implement advantage calculation (Medium)
⚠️ TODO 3: Implement entropy bonus (Easy)

📋 Tasks to Complete

TODO 1: Implement PPO Clipped Objective (Hard)

Location: Section 6 - "PPO Loss Functions"

Current State: Policy loss function exists but doesn't implement PPO's clipped objective

Your Task: Implement PPO's clipped surrogate objective to prevent large policy updates:

scss

L^CLIP(θ) = E[min(r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A)]

Where:

r(θ) = π_θ(a|s) / π_θ_old(a|s) = probability ratio (new policy / old policy)
A = advantage estimate (from GAE)
ε = clip parameter (typically 0.2)
clip(r, 1-ε, 1+ε) = clip ratio to [0.8, 1.2]

Why Clipping?

Without clipping: Large policy updates can destroy performance
With clipping: Conservative updates ensure stability
Intuition: Don't deviate too far from old policy in one step

Starter Code Provided:

python

def compute_policy_loss(log_probs, old_log_probs, advantages, clip_epsilon=0.2):
    # TODO: Implement PPO clipped objective
    # Step 1: Calculate ratio = exp(log_probs - old_log_probs)
    # Step 2: Calculate surrogate1 = ratio * advantages
    # Step 3: Calculate surrogate2 = clipped_ratio * advantages
    # Step 4: Return -mean(min(surrogate1, surrogate2))
    pass

Success Criteria:

Ratio calculated correctly: r = exp(new_log_prob - old_log_prob)
Ratio clipped to [1-ε, 1+ε] (typically [0.8, 1.2])
Policy loss is negative of clipped objective (we maximize objective -> minimize negative)
Agent achieves >150 average reward on LunarLander after 30K steps
KL divergence stays low (<0

Verification Test:

python

# Test case provided in notebook
log_probs = torch.tensor([0.0, 0.1, -0.1])
old_log_probs = torch.tensor([0.0, 0.0, 0.0])
advantages = torch.tensor([1.0, 1.0, -1.0])

loss = compute_policy_loss(log_probs, old_log_probs, advantages, clip_epsilon=0.2)
# Expected: Ratios are [1.0, 1.105, 0.905]
# Expected: Loss should be negative (we want to maximize objective)

TODO 2: Implement Advantage Calculation (Medium)

Location: Section 5 - "Generalized Advantage Estimation (GAE)"

Your Task: Calculate advantages from GAE-computed advantage values

Background: GAE is already implemented (pre-built), but you need to:

Normalize advantages (mean=0, std=1) for stable training
Handle edge cases (std=0)

Requirements:

python

def normalize_advantages(advantages):
    # TODO: Normalize advantages to mean=0, std=1
    # Step 1: Calculate mean and std
    # Step 2: Subtract mean, divide by (std + 1e-8)
    # Step 3: Return normalized advantages
    pass

Why Normalize?

Raw advantages can vary wildly in scale (-100 to +100)
Normalization ensures consistent learning across episodes
Standard practice in all PPO implementations

Success Criteria:

Advantages have mean ~= 0 (within 1e-6)
Advantages have std ~= 1 (within 1e-6)
No NaN or Inf values
Training is stable (no sudden performance collapses)

TODO 3: Implement Entropy Bonus (Easy)

Location: Section 6 - "PPO Loss Functions"

Your Task: Add entropy bonus to encourage exploration

Background: Entropy measures policy randomness:

High entropy = policy is uncertain (good early in training)
Low entropy = policy is confident (good later in training)
Entropy bonus prevents premature convergence to suboptimal policy

Formula:

scss

H(π) = -Σ π(a|s) log π(a|s)

For Gaussian policy (continuous actions):

ini

H = 0.5 * log(2πe * σ²)

Starter Code:

python

def compute_entropy_bonus(log_std):
    # TODO: Calculate entropy of Gaussian policy
    # Hint: entropy = 0.5 * (log(2*pi) + 1 + 2*log_std)
    # Return mean entropy
    pass

Requirements:

Entropy bonus added to total loss: loss = policy_loss + 0.5value_loss - 0.01entropy
Coefficient 0.01 balances exploration vs exploitation
Entropy should decrease over training (policy becomes more confident)

Success Criteria:

Entropy calculated correctly for Gaussian policy
Entropy decreases from ~2.0 to ~0.5 during training
Agent explores effectively early, exploits later
Final policy achieves 200+ average reward

🚀 Extension Challenges

Challenge One: KL Divergence Early Stopping (Medium)

Implement early stopping when KL divergence exceeds threshold:

Monitor KL divergence between old and new policy
If ``KL > 0.03``, stop epoch early (policy changed too much)
Prevents policy from deviating too far from old policy

Why?

PPO's clipping is first line of defense
KL early stopping is second line of defense
Together they ensure extremely stable training

Implementation Hints:

python

# Inside training loop
kl_div = (old_log_probs - log_probs).mean().item()
if kl_div > 0.03:
    print(f"Early stopping at epoch {epoch} due to high KL: {kl_div:.4f}")
    break

Challenge 2: Multiple Environments (Hard)

Train with parallel environments for faster data collection:

Use gym.vector.AsyncVectorEnv for parallel environments
Collect rollouts from N=4 environments simultaneously
4x data collection speed -> faster training

Expected Results:

Training time reduced by ~3x
More stable learning (more diverse data per batch)
Higher sample efficiency

Challenge 3: PPO vs REINFORCE Comparison (Medium)

Compare PPO to REINFORCE (Activity 01):

Train both algorithms on LunarLander-v2
Same hyperparameters (where applicable)
Compare: sample efficiency, final performance, stability

Expected Findings:

PPO converges in ~30K steps
REINFORCE needs ~100K+ steps
PPO is more stable (less variance in learning curve)
PPO achieves higher final reward

Challenge 4: Continuous vs Discrete Actions (Hard)

Adapt PPO for discrete actions (CartPole-v1):

Change policy head to output logits instead of mean/std
Use categorical distribution instead of Gaussian
Compare learning speed: continuous vs discrete

Key Differences:

Discrete: Output logits, sample with Categorical
Continuous: Output mean/std, sample with Normal
Entropy calculation changes (categorical vs Gaussian)

📊 Expected Results

Random Baseline (Pre-Built):

Average Reward: -200 (crashes immediately)
Episode Length: ~20 steps (early termination)
Success: Never lands safely

Your PPO Agent (TODO 1-3):

Average Reward: 150-200 after 30K steps
Average Reward: 200-250 after 50K steps (solved!)
Episode Length: 200+ steps (full episodes)
Success: Lands safely most of the time

Optimized PPO (Challenges):

Average Reward: 250+ (expert level)
Convergence: ~20K steps (with parallel envs)
Stability: Very smooth learning curve

PPO vs REINFORCE (Challenge 3):

Algorithm	Steps to Solve	Final Reward	Stability
REINFORCE	100K+	150-200	Low (high variance)
PPO	30-50K	200-250	High (stable)

Solved Threshold: Average ``reward >200`` over 100 consecutive episodes

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors
TODO 1 completed (PPO clipped objective working)
TODO 2 completed (advantage normalization working)
TODO 3 completed (entropy bonus working)
Agent ``achieves >100`` average reward on LunarLander
Visualizations show learning progress

Target Grade (for excellent work):

All 3 TODOs completed correctly
Agent ``achieves >200`` average reward (solves environment!)
KL divergence stays low (<0
Entropy decreases smoothly from ~2.0 to ~0.5
Analysis explains PPO advantages over REINFORCE
At least 1 extension challenge attempted

Exceptional Work (bonus points):

Multiple extension challenges completed
KL early stopping implemented
Parallel environments implemented
PPO vs REINFORCE comparison with plots
Clean, modular, production-quality code
Novel experiments or insights

🛠️ Troubleshooting

Issue: "Policy loss is NaN after a few updates"

Solution: Check advantage normalization:

Ensure advantages are normalized (mean=0, std=1)
Add small epsilon (1e-8) when dividing by std
Clip advantages to [-10, 10] to prevent extreme values
Reduce learning rate to 3e-4 (from default)

Issue: "Agent reward stays at -200 (not learning)"

Solution: Multiple possible causes:

PPO clipped objective: Verify ratio calculation is correct
Advantage signs: Check that positive advantages increase action probability
Value function: Ensure value loss is being minimized
Learning rate: Try increasing to 5e-4 if learning is too slow
Batch size: Increase to 2048 if updates are too noisy

Issue: "KL divergence explodes (>0.1)"

Solution: Policy is changing too fast:

Reduce learning rate by 50% (e.g., 3e-4 -> 1.5e-4)
Decrease clip epsilon (0.2 -> 0.1) for more conservative updates
Reduce number of epochs per batch (4 -> 2)
Implement KL early stopping (Challenge 1)

Issue: "Entropy drops to 0 too quickly"

Solution: Agent is becoming overconfident early:

Increase entropy coefficient (0.01 -> 0.02)
Initialize log_std higher (-0.5 -> 0.0)
Clip log_std to prevent it from going too negative
This encourages more exploration throughout training

Issue: "Agent learns then forgets (performance drops)"

Solution: Catastrophic forgetting:

Reduce learning rate (more conservative updates)
Decrease PPO epochs (4 -> 3) to prevent over-updating
Increase batch size (more stable gradient estimates)
Check value function isn't over-fitting (value loss should decrease)

Issue: "Training is too slow"

Solution: Optimize hyperparameters:

Increase batch size to 2048 (more updates per rollout)
Use parallel environments (Challenge 2)
Increase learning rate carefully (3e-4 -> 5e-4)
GPU acceleration: Ensure PyTorch is using GPU in Colab

📚 Resources

Documentation

Concept 05: Policy Gradient Methods (REINFORCE foundation)
Concept 06: Actor-Critic Methods (combining policy and value)
Concept 07: Proximal Policy Optimization (theory and derivation)

Key Papers

Schulman et al. (2017): "Proximal Policy Optimization Algorithms" - Original PPO paper
Schulman et al. (2016): "High-Dimensional Continuous Control Using GAE" - GAE paper
Mnih et al. (2016): "Asynchronous Methods for Deep RL" - A3C (PPO's predecessor)

Video Resources

OpenAI Spinning Up: PPO - Excellent explanation
Arxiv Insights: PPO - Visual intuition

🎮 Understanding LunarLander-v2

Environment: Land a lunar lander safely on a landing pad

Observation Space (8 dimensions):

Position (x, y)
Velocity (vx, vy)
Angle, angular velocity
Left/right leg contact (boolean)

Action Space (4 dimensions - continuous):

Main engine throttle [0, 1]
Left/right engine throttle [-1, 1]

Reward Structure:

Moving toward landing pad: positive reward
Moving away: negative reward
Crash: -100
Safe landing: +100-140 (bonus for fuel efficiency)
Using engines: small negative reward (fuel cost)

Episode Termination:

Lander crashes (touches ground at bad angle/speed)
Lander goes out of bounds
Episode reaches 1000 steps

Solved Criteria: Average ``reward >200`` over 100 consecutive episodes

💡 PPO Key Insights

Why PPO is the Gold Standard:

Sample Efficiency: Learns from data multiple times (K=4 epochs per batch)
Stability: Clipped objective prevents destructive updates
Simplicity: Easier to implement than TRPO, no KL constraint optimization
Generality: Works on discrete, continuous, multi-discrete action spaces
Performance: State-of-the-art on many benchmarks

PPO vs Other Algorithms:

Feature	REINFORCE	A2C	PPO
On/Off Policy	On-policy	On-policy	On-policy
Update Rule	Vanilla PG	One-step TD	Clipped objective
Data Efficiency	Low	Medium	High
Stability	Low	Medium	High
Complexity	Simple	Medium	Medium

When to Use PPO:

✅ Use PPO when:

You need stable, reliable learning
You want good sample efficiency
You have continuous or discrete actions
You want production-ready performance

❌ Don't use PPO when:

You need absolute best sample efficiency (use SAC)
You have very simple discrete tasks (DQN might be simpler)
You need off-policy learning (use SAC/TD3)

📤 Submission

Complete required TODOs (minimum: TODO 1-3)
Run entire notebook to generate all outputs
Export training plots: Save learning curves as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and plots.png

Submission Checklist:

Filename: activity-06-[YourName].ipynb
All code cells executed with visible outputs
Training curves show clear learning progress
Average ``reward >100`` demonstrated (>200 for excellent work)
Analysis section completed with insights
KL divergence and entropy plots included

🎉 What's Next?

After mastering PPO:

Move to Project 2: Train Your Own RL Agent
Apply PPO to custom environments
Explore advanced algorithms: SAC, TD3, DDPG (off-policy alternatives)
Learn multi-agent RL and hierarchical RL

Key Insight: PPO represents the state-of-the-art in on-policy RL. It's used by OpenAI, DeepMind, and industry for:

Robotic control
Game playing (Dota 2, StarCraft II)
Autonomous vehicles
Industrial optimization

Master PPO, and you're ready for production RL systems!

Good luck! PPO is the algorithm that powers most real-world RL applications. Understanding PPO deeply will make you dangerous in applied RL! 🚀

Template 6: Proximal Policy Optimization

📦 Project Files Included:

Proximal Policy Optimization (PPO) - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (35%):

📋 Tasks to Complete

TODO 1: Implement PPO Clipped Objective (Hard)

TODO 2: Implement Advantage Calculation (Medium)

TODO 3: Implement Entropy Bonus (Easy)

🚀 Extension Challenges

Challenge One: KL Divergence Early Stopping (Medium)

Challenge 2: Multiple Environments (Hard)

Challenge 3: PPO vs REINFORCE Comparison (Medium)

Challenge 4: Continuous vs Discrete Actions (Hard)

📊 Expected Results

Random Baseline (Pre-Built):

Your PPO Agent (TODO 1-3):

Optimized PPO (Challenges):

PPO vs REINFORCE (Challenge 3):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "Policy loss is NaN after a few updates"

Issue: "Agent reward stays at -200 (not learning)"

Issue: "KL divergence explodes (>0.1)"

Issue: "Entropy drops to 0 too quickly"

Issue: "Agent learns then forgets (performance drops)"

Issue: "Training is too slow"

📚 Resources

Documentation

Key Papers

Video Resources

🎮 Understanding LunarLander-v2

💡 PPO Key Insights

Why PPO is the Gold Standard:

PPO vs Other Algorithms:

When to Use PPO:

📤 Submission

🎉 What's Next?