Apply your knowledge to build something amazing!
Duration: 2 weeks Points: 100 Prerequisites: Complete Lessons 4-6 (Policy Gradients, Actor-Critic, PPO) Difficulty: Advanced
In this project, you'll train a PPO (Proximal Policy Optimization) agent to navigate a simulated robot through complex 3D environments. Unlike discrete Atari games (Project 1), this project tackles continuous control - where actions are continuous values (velocities, forces) rather than discrete buttons. You'll implement PPO with continuous action spaces, train on physics-based simulations, and deploy a robust navigation policy.
Why This Matters: PPO is the workhorse of modern robotics and has been used by OpenAI, DeepMind, and Tesla for real-world applications. This project gives you hands-on experience with the algorithm powering self-driving cars, warehouse robots, and humanoid locomotion.
What You'll Build:
By completing this project, you will:
Your PPO agent must:
Your implementation must include:
project-02-autonomous-navigation/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── ppo_agent.py # PPO agent implementation
├── actor_critic.py # Neural network architectures
├── train.py # Training script with parallel envs
├── evaluate.py # Evaluation and benchmark script
├── models/ # Saved model checkpoints
│ ├── ppo_best.pth
│ ├── reinforce_baseline.pth
│ └── a2c_baseline.pth
├── logs/ # TensorBoard logs
└── videos/ # Recorded navigation videos
| Criterion | Points | Description |
|---|---|---|
| PPO Implementation | 30 | Correct PPO with clipped objective and GAE |
| Performance | 25 | Agent achieves 85% success rate and target reward |
| Continuous Actions | 15 | Proper Gaussian policy implementation |
| Benchmarking | 10 | PPO compared to REINFORCE and A2C baselines |
| Code Quality | 10 | Clean, modular, well-documented code |
| Documentation | 10 | Comprehensive README and results analysis |
| Total | 100 |
Bonus Points (+10 each):
Day 1-2: Environment Setup
Day 3-5: PPO Components
Day 6-7: Training Pipeline
Deliverable: Working PPO agent training on single environment
Day 8-9: Hyperparameter Tuning
Day 10-11: Baseline Comparison
Day 12-13: Robustness Testing
Day 14: Documentation and Presentation
Deliverable: Trained PPO agent, baselines, benchmark results, documentation
import torch
import torch.nn as nn
from torch.distributions import Normal
class ActorCritic(nn.Module):
def __init__(self, obs_dim, action_dim, hidden_size=256):
super().__init__()
# Shared feature extractor (optional)
self.shared = nn.Sequential(
nn.Linear(obs_dim, hidden_size),
nn.Tanh(),
nn.Linear(hidden_size, hidden_size),
nn.Tanh(),
)
# Actor (policy) head: outputs mean and log_std for Gaussian
self.actor_mean = nn.Linear(hidden_size, action_dim)
self.actor_log_std = nn.Parameter(torch.zeros(action_dim)) # Learnable std
# Critic (value) head: outputs state value V(s)
self.critic = nn.Linear(hidden_size, 1)
def forward(self, obs):
features = self.shared(obs)
# Actor: Gaussian policy
action_mean = self.actor_mean(features)
action_std = torch.exp(self.actor_log_std)
# Critic: State value
value = self.critic(features)
return action_mean, action_std, value
def get_action(self, obs, deterministic=False):
"""Sample action from policy"""
action_mean, action_std, value = self.forward(obs)
if deterministic:
return action_mean, value
# Sample from Gaussian distribution
dist = Normal(action_mean, action_std)
action = dist.sample()
log_prob = dist.log_prob(action).sum(dim=-1) # Sum over action dimensions
return action, log_prob, value
def train_ppo(envs, agent, num_steps=500_000):
"""
PPO training loop with parallel environments
envs: Vectorized environments (e.g., 8 parallel envs)
"""
obs = envs.reset()
for step in range(num_steps):
# Collect rollout (e.g., 2048 steps)
rollout = collect_rollout(envs, agent, rollout_steps=2048)
# Compute advantages using GAE
advantages = compute_gae(
rollout['rewards'],
rollout['values'],
rollout['dones'],
gamma=0.99,
gae_lambda=0.95
)
# Normalize advantages (important for stability)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# PPO update epochs (typically 10)
for epoch in range(10):
# Shuffle data
indices = torch.randperm(len(rollout['obs']))
# Mini-batch updates
for batch_start in range(0, len(indices), batch_size):
batch_indices = indices[batch_start:batch_start + batch_size]
# Get batch data
obs_batch = rollout['obs'][batch_indices]
actions_batch = rollout['actions'][batch_indices]
old_log_probs_batch = rollout['log_probs'][batch_indices]
advantages_batch = advantages[batch_indices]
returns_batch = rollout['returns'][batch_indices]
# Evaluate current policy
action_mean, action_std, values = agent.forward(obs_batch)
dist = Normal(action_mean, action_std)
log_probs = dist.log_prob(actions_batch).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1).mean()
# PPO clipped surrogate loss
ratio = torch.exp(log_probs - old_log_probs_batch)
clipped_ratio = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
policy_loss = -torch.min(
ratio * advantages_batch,
clipped_ratio * advantages_batch
).mean()
# Value loss (MSE)
value_loss = F.mse_loss(values.squeeze(), returns_batch)
# Total loss
loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
# Backprop
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(agent.parameters(), max_norm=0.5)
optimizer.step()
def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
"""
Compute GAE advantages
GAE formula:
A_t = δ_t + (γλ)δ_{t+1} + (γλ)^2 δ_{t+2} + ...
where δ_t = r_t + γV(s_{t+1}) - V(s_t)
"""
advantages = []
gae = 0
# Iterate backwards through episode
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0
else:
next_value = values[t + 1]
# TD error
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
# GAE accumulation
gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
advantages.insert(0, gae)
return torch.tensor(advantages)
| Hyperparameter | Recommended Value | Notes |
|---|---|---|
| Learning rate | 3e-4 | Adam optimizer |
| Clip epsilon | 0.2 | PPO clipping range |
| GAE lambda | 0.95 | Balance bias-variance |
| Discount (γ) | 0.99 | Standard for robotics |
| Rollout length | 2048 | Steps per update |
| Batch size | 64 | Mini-batch size |
| PPO epochs | 10 | Updates per rollout |
| Entropy coefficient | 0.01 | Encourage exploration |
| Value loss coefficient | 0.5 | Balance policy and value |
| Max gradient norm | 0.5 | Gradient clipping |
| Parallel envs | 8 | Vectorized environments |
Easy (Start here):
HalfCheetahBulletEnv-v0 - 2D locomotionAntBulletEnv-v0 - 4-legged robotHumanoidBulletEnv-v0 - Full humanoidMedium:
MinitaurBulletEnv-v0 - Quadruped robot with balance challengeRacecarBulletEnv-v0 - Car navigationHard:
HalfCheetah-v4 (MuJoCo)Ant-v4 (MuJoCo)Humanoid-v4 (MuJoCo)Note: MuJoCo requires license (free for students)
Your agent will be evaluated on:
Success Rate (25%):
Sample Efficiency (25%):
Robustness (20%):
Benchmark Comparison (15%):
Code and Documentation (15%):
benelot/pybullet-gymDLR-RM/stable-baselines3vwxyzjn/cleanrlopenai/spinningupCode Repository (GitHub)
Trained Models
Benchmark Report (3-4 pages)
Demo Video (3-5 minutes)
Submission Link: [Google Form/LMS Upload Link]
Deadline: 2 weeks from project start
<0.05)❌ Forgetting observation normalization: Causes unstable training ❌ Too large clipping epsilon: Reduces to vanilla policy gradient ❌ Too small rollout buffer: Not enough data per update ❌ Not using GAE: Reduces sample efficiency significantly
Demo Website/GitHub:
LinkedIn/Resume:
"Developed autonomous navigation system using Proximal Policy Optimization (PPO), achieving 92% success rate across varied terrains. Benchmarked against baseline algorithms, demonstrating superior sample efficiency and robustness."
After completing this project:
Good luck! PPO is the algorithm powering real-world robotics today. This project will give you the skills to deploy RL in safety-critical applications.
Related Projects: