By completing this activity, you will:
Understand Proximal Policy Optimization (PPO) , the gold standard of modern RL
Implement the clipped surrogate objective for stable policy updates
Apply Generalized Advantage Estimation (GAE) for variance reduction
Master Actor-Critic architecture with shared neural networks
Train agents on continuous control tasks (LunarLander-v2)
Analyze policy loss, value loss, KL divergence , and entropy during training
Achieve 200+ average reward on LunarLander ``in <50``K steps
Open in Google Colab : Upload this notebook to Google Colab
Run All Cells : Click Runtime -> Run all (or press Ctrl+F9 )
Watch the Magic : You'll see:
✅ LunarLander-v2 environment setup
✅ Actor-Critic neural network architecture
✅ Random baseline agent (reward ~-200)
✅ GAE (Generalized Advantage Estimation) implementation
✅ Training progress visualization
Expected First Run Time : ~90 seconds
The template comes with 65% working code :
✅ LunarLander-v2 Environment : Continuous control with 8D observations, 4D actions
✅ Random Baseline : Agent with average reward ~-200 (crashes)
✅ Actor-Critic Network : Shared backbone, policy head, value head (PyTorch)
✅ GAE Implementation : Advantage estimation with λ=0.95
✅ Rollout Buffer : Stores trajectories for batch training
✅ Training Loop Framework : Multiple epochs per batch
✅ Visualization Tools : Policy loss, value loss, KL divergence, entropy plots
⚠️ TODO 1 : Implement PPO clipped surrogate objective (Hard)
⚠️ TODO 2 : Implement advantage calculation (Medium)
⚠️ TODO 3 : Implement entropy bonus (Easy)
Location : Section 6 - "PPO Loss Functions"
Current State : Policy loss function exists but doesn't implement PPO's clipped objective
Your Task : Implement PPO's clipped surrogate objective to prevent large policy updates:
scss
L^CLIP (θ) = E[min(r(θ)·A, clip(r(θ), 1-ε, 1+ε)·A)]
Where:
r(θ) = π_θ(a|s) / π_θ_old(a|s) = probability ratio (new policy / old policy)
A = advantage estimate (from GAE)
ε = clip parameter (typically 0.2)
clip(r, 1-ε, 1+ε) = clip ratio to [0.8, 1.2]
Why Clipping?
Without clipping: Large policy updates can destroy performance
With clipping: Conservative updates ensure stability
Intuition: Don't deviate too far from old policy in one step
Starter Code Provided :
python
def compute_policy_loss (log_probs, old_log_probs, advantages, clip_epsilon=0.2 ):
pass
Success Criteria :
Verification Test :
python
log_probs = torch.tensor([0.0 , 0.1 , -0.1 ])
old_log_probs = torch.tensor([0.0 , 0.0 , 0.0 ])
advantages = torch.tensor([1.0 , 1.0 , -1.0 ])
loss = compute_policy_loss(log_probs, old_log_probs, advantages, clip_epsilon=0.2 )
Location : Section 5 - "Generalized Advantage Estimation (GAE)"
Your Task : Calculate advantages from GAE-computed advantage values
Background : GAE is already implemented (pre-built), but you need to:
Normalize advantages (mean=0, std=1) for stable training
Handle edge cases (std=0)
Requirements :
python
def normalize_advantages (advantages ):
pass
Why Normalize?
Raw advantages can vary wildly in scale (-100 to +100)
Normalization ensures consistent learning across episodes
Standard practice in all PPO implementations
Success Criteria :
Location : Section 6 - "PPO Loss Functions"
Your Task : Add entropy bonus to encourage exploration
Background : Entropy measures policy randomness:
High entropy = policy is uncertain (good early in training)
Low entropy = policy is confident (good later in training)
Entropy bonus prevents premature convergence to suboptimal policy
Formula :
scss
H (π) = -Σ π(a|s) log π(a|s)
For Gaussian policy (continuous actions):
ini
H = 0.5 * log(2 πe * σ²)
Starter Code :
python
def compute_entropy_bonus (log_std ):
pass
Requirements :
Entropy bonus added to total loss: loss = policy_loss + 0.5value_loss - 0.01 entropy
Coefficient 0.01 balances exploration vs exploitation
Entropy should decrease over training (policy becomes more confident)
Success Criteria :
Implement early stopping when KL divergence exceeds threshold:
Monitor KL divergence between old and new policy
If ``KL > 0.03``, stop epoch early (policy changed too much)
Prevents policy from deviating too far from old policy
Why?
PPO's clipping is first line of defense
KL early stopping is second line of defense
Together they ensure extremely stable training
Implementation Hints :
python
kl_div = (old_log_probs - log_probs).mean().item()
if kl_div > 0.03 :
print (f"Early stopping at epoch {epoch} due to high KL: {kl_div:.4 f} " )
break
Train with parallel environments for faster data collection:
Use gym.vector.AsyncVectorEnv for parallel environments
Collect rollouts from N=4 environments simultaneously
4x data collection speed -> faster training
Expected Results :
Training time reduced by ~3x
More stable learning (more diverse data per batch)
Higher sample efficiency
Compare PPO to REINFORCE (Activity 01):
Train both algorithms on LunarLander-v2
Same hyperparameters (where applicable)
Compare: sample efficiency, final performance, stability
Expected Findings :
PPO converges in ~30K steps
REINFORCE needs ~100K+ steps
PPO is more stable (less variance in learning curve)
PPO achieves higher final reward
Adapt PPO for discrete actions (CartPole-v1):
Change policy head to output logits instead of mean/std
Use categorical distribution instead of Gaussian
Compare learning speed: continuous vs discrete
Key Differences :
Discrete: Output logits, sample with Categorical
Continuous: Output mean/std, sample with Normal
Entropy calculation changes (categorical vs Gaussian)
Average Reward: -200 (crashes immediately)
Episode Length: ~20 steps (early termination)
Success: Never lands safely
Average Reward: 150-200 after 30K steps
Average Reward: 200-250 after 50K steps (solved!)
Episode Length: 200+ steps (full episodes)
Success: Lands safely most of the time
Average Reward: 250+ (expert level)
Convergence: ~20K steps (with parallel envs)
Stability: Very smooth learning curve
Algorithm
Steps to Solve
Final Reward
Stability
REINFORCE
100K+
150-200
Low (high variance)
PPO
30-50K
200-250
High (stable)
Solved Threshold : Average ``reward >200`` over 100 consecutive episodes
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution : Check advantage normalization:
Ensure advantages are normalized (mean=0, std=1)
Add small epsilon (1e-8) when dividing by std
Clip advantages to [-10, 10] to prevent extreme values
Reduce learning rate to 3e-4 (from default)
Solution : Multiple possible causes:
PPO clipped objective : Verify ratio calculation is correct
Advantage signs : Check that positive advantages increase action probability
Value function : Ensure value loss is being minimized
Learning rate : Try increasing to 5e-4 if learning is too slow
Batch size : Increase to 2048 if updates are too noisy
Solution : Policy is changing too fast:
Reduce learning rate by 50% (e.g., 3e-4 -> 1.5e-4)
Decrease clip epsilon (0.2 -> 0.1) for more conservative updates
Reduce number of epochs per batch (4 -> 2)
Implement KL early stopping (Challenge 1)
Solution : Agent is becoming overconfident early:
Increase entropy coefficient (0.01 -> 0.02)
Initialize log_std higher (-0.5 -> 0.0)
Clip log_std to prevent it from going too negative
This encourages more exploration throughout training
Solution : Catastrophic forgetting:
Reduce learning rate (more conservative updates)
Decrease PPO epochs (4 -> 3) to prevent over-updating
Increase batch size (more stable gradient estimates)
Check value function isn't over-fitting (value loss should decrease)
Solution : Optimize hyperparameters:
Increase batch size to 2048 (more updates per rollout)
Use parallel environments (Challenge 2)
Increase learning rate carefully (3e-4 -> 5e-4)
GPU acceleration: Ensure PyTorch is using GPU in Colab
Concept 05 : Policy Gradient Methods (REINFORCE foundation)
Concept 06 : Actor-Critic Methods (combining policy and value)
Concept 07 : Proximal Policy Optimization (theory and derivation)
Schulman et al. (2017): "Proximal Policy Optimization Algorithms" - Original PPO paper
Schulman et al. (2016): "High-Dimensional Continuous Control Using GAE" - GAE paper
Mnih et al. (2016): "Asynchronous Methods for Deep RL" - A3C (PPO's predecessor)
OpenAI Spinning Up: PPO - Excellent explanation
Arxiv Insights: PPO - Visual intuition
Environment : Land a lunar lander safely on a landing pad
Observation Space (8 dimensions):
Position (x, y)
Velocity (vx, vy)
Angle, angular velocity
Left/right leg contact (boolean)
Action Space (4 dimensions - continuous):
Main engine throttle [0, 1]
Left/right engine throttle [-1, 1]
Reward Structure :
Moving toward landing pad: positive reward
Moving away: negative reward
Crash: -100
Safe landing: +100-140 (bonus for fuel efficiency)
Using engines: small negative reward (fuel cost)
Episode Termination :
Lander crashes (touches ground at bad angle/speed)
Lander goes out of bounds
Episode reaches 1000 steps
Solved Criteria : Average ``reward >200`` over 100 consecutive episodes
Sample Efficiency : Learns from data multiple times (K=4 epochs per batch)
Stability : Clipped objective prevents destructive updates
Simplicity : Easier to implement than TRPO, no KL constraint optimization
Generality : Works on discrete, continuous, multi-discrete action spaces
Performance : State-of-the-art on many benchmarks
Feature
REINFORCE
A2C
PPO
On/Off Policy
On-policy
On-policy
On-policy
Update Rule
Vanilla PG
One-step TD
Clipped objective
Data Efficiency
Low
Medium
High
Stability
Low
Medium
High
Complexity
Simple
Medium
Medium
✅ Use PPO when :
You need stable, reliable learning
You want good sample efficiency
You have continuous or discrete actions
You want production-ready performance
❌ Don't use PPO when :
You need absolute best sample efficiency (use SAC)
You have very simple discrete tasks (DQN might be simpler)
You need off-policy learning (use SAC/TD3)
Complete required TODOs (minimum: TODO 1-3)
Run entire notebook to generate all outputs
Export training plots : Save learning curves as PNG
Download notebook : File -> Download -> Download .ipynb
Submit via portal : Upload .ipynb and plots.png
Submission Checklist :
After mastering PPO:
Move to Project 2: Train Your Own RL Agent
Apply PPO to custom environments
Explore advanced algorithms: SAC, TD3, DDPG (off-policy alternatives)
Learn multi-agent RL and hierarchical RL
Key Insight : PPO represents the state-of-the-art in on-policy RL. It's used by OpenAI, DeepMind, and industry for:
Robotic control
Game playing (Dota 2, StarCraft II)
Autonomous vehicles
Industrial optimization
Master PPO, and you're ready for production RL systems!
Good luck! PPO is the algorithm that powers most real-world RL applications. Understanding PPO deeply will make you dangerous in applied RL! 🚀