Telebort | Learning to code made fun!

Demo Mode

No student ID available

Duration: 2 weeks Points: 100 Prerequisites: Complete Lessons 4-6 (Policy Gradients, Actor-Critic, PPO) Difficulty: Advanced

Project Overview

In this project, you'll train a PPO (Proximal Policy Optimization) agent to navigate a simulated robot through complex 3D environments. Unlike discrete Atari games (Project 1), this project tackles continuous control - where actions are continuous values (velocities, forces) rather than discrete buttons. You'll implement PPO with continuous action spaces, train on physics-based simulations, and deploy a robust navigation policy.

Why This Matters: PPO is the workhorse of modern robotics and has been used by OpenAI, DeepMind, and Tesla for real-world applications. This project gives you hands-on experience with the algorithm powering self-driving cars, warehouse robots, and humanoid locomotion.

What You'll Build:

Complete PPO agent with continuous action spaces (Gaussian policies)
Training pipeline for PyBullet robotics environments
Robust policy that generalizes to different terrains and obstacles
Performance benchmarks comparing PPO to baseline algorithms
Portfolio-ready demo video of autonomous navigation

Learning Objectives

By completing this project, you will:

Implement PPO algorithm with continuous action spaces (Gaussian policies)
Design actor-critic neural network architectures for robotics
Apply advantage estimation (GAE) and value function learning
Optimize clipping hyperparameters and learning rates for stability
Evaluate sim-to-real transfer potential and robustness
Benchmark PPO against REINFORCE and A2C baselines

Requirements

Functional Requirements

Your PPO agent must:

Navigate to target: Reach randomly placed target locations in 3D environment
Avoid obstacles: Navigate around static and dynamic obstacles
Maintain balance: Prevent robot from falling over (penalty for tipping)
Achieve target performance:
- Success ``rate >= 85``% (reaching target within episode)
- Average episode ``reward >= 250``
- Training converges within 500K steps
Demonstrate robustness: Perform well on 3 different environment variations (flat terrain, slopes, obstacles)
Continuous actions: Use continuous action space (not discretized)

Technical Requirements

Your implementation must include:

PPO Algorithm: Clipped surrogate objective with entropy regularization
Actor-Critic Networks: Separate or shared trunk architectures
GAE: Generalized Advantage Estimation with λ parameter
Parallel Environments: Train on multiple environments simultaneously (vectorized)
Normalization: Observation and reward normalization for stability
Logging: Track policy loss, value loss, entropy, KL divergence
Benchmarking: Compare PPO to REINFORCE and A2C on same task

Code Structure

graphql

project-02-autonomous-navigation/
├── README.md                    # Project documentation
├── requirements.txt             # Python dependencies
├── ppo_agent.py                # PPO agent implementation
├── actor_critic.py             # Neural network architectures
├── train.py                    # Training script with parallel envs
├── evaluate.py                 # Evaluation and benchmark script
├── models/                     # Saved model checkpoints
│   ├── ppo_best.pth
│   ├── reinforce_baseline.pth
│   └── a2c_baseline.pth
├── logs/                       # TensorBoard logs
└── videos/                     # Recorded navigation videos

Grading Rubric

Criterion	Points	Description
PPO Implementation	30	Correct PPO with clipped objective and GAE
Performance	25	Agent achieves 85% success rate and target reward
Continuous Actions	15	Proper Gaussian policy implementation
Benchmarking	10	PPO compared to REINFORCE and A2C baselines
Code Quality	10	Clean, modular, well-documented code
Documentation	10	Comprehensive README and results analysis
Total	100

Bonus Points (+10 each):

Implement curriculum learning (progressively harder environments)
Train on real robot dataset (e.g., Fetch Robotics, TurtleBot)
Implement domain randomization for sim-to-real transfer
Deploy policy on physical robot (Raspberry Pi + Arduino)

Milestones

Week One: PPO Implementation and Baseline Training

Day 1-2: Environment Setup

Install PyBullet and gymnasium
Test HalfCheetah or Ant environment
Verify continuous action spaces

Day 3-5: PPO Components

Implement actor-critic networks (Gaussian policy + value head)
Implement GAE for advantage estimation
Implement PPO loss (clipped surrogate + value loss + entropy)

Day 6-7: Training Pipeline

Set up parallel environments (vectorized)
Implement rollout collection
Start initial PPO training run

Deliverable: Working PPO agent training on single environment

Week 2: Optimization and Benchmarking

Day 8-9: Hyperparameter Tuning

Tune clipping epsilon (0.1, 0.2, 0.3)
Tune GAE lambda (0.9, 0.95, 0.99)
Tune learning rate schedules

Day 10-11: Baseline Comparison

Implement REINFORCE baseline
Implement A2C baseline
Train all 3 algorithms on same task

Day 12-13: Robustness Testing

Test on flat terrain, slopes, obstacles
Measure success rate across variations
Record demo videos

Day 14: Documentation and Presentation

Write comprehensive analysis
Create performance comparison graphs
Prepare portfolio demo

Deliverable: Trained PPO agent, baselines, benchmark results, documentation

Implementation Guide

Actor-Critic Architecture

python

import torch
import torch.nn as nn
from torch.distributions import Normal

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, action_dim, hidden_size=256):
        super().__init__()

        # Shared feature extractor (optional)
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, hidden_size),
            nn.Tanh(),
            nn.Linear(hidden_size, hidden_size),
            nn.Tanh(),
        )

        # Actor (policy) head: outputs mean and log_std for Gaussian
        self.actor_mean = nn.Linear(hidden_size, action_dim)
        self.actor_log_std = nn.Parameter(torch.zeros(action_dim))  # Learnable std

        # Critic (value) head: outputs state value V(s)
        self.critic = nn.Linear(hidden_size, 1)

    def forward(self, obs):
        features = self.shared(obs)

        # Actor: Gaussian policy
        action_mean = self.actor_mean(features)
        action_std = torch.exp(self.actor_log_std)

        # Critic: State value
        value = self.critic(features)

        return action_mean, action_std, value

    def get_action(self, obs, deterministic=False):
        """Sample action from policy"""
        action_mean, action_std, value = self.forward(obs)

        if deterministic:
            return action_mean, value

        # Sample from Gaussian distribution
        dist = Normal(action_mean, action_std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(dim=-1)  # Sum over action dimensions

        return action, log_prob, value

PPO Training Loop

python

def train_ppo(envs, agent, num_steps=500_000):
    """
    PPO training loop with parallel environments
    envs: Vectorized environments (e.g., 8 parallel envs)
    """
    obs = envs.reset()

    for step in range(num_steps):
        # Collect rollout (e.g., 2048 steps)
        rollout = collect_rollout(envs, agent, rollout_steps=2048)

        # Compute advantages using GAE
        advantages = compute_gae(
            rollout['rewards'],
            rollout['values'],
            rollout['dones'],
            gamma=0.99,
            gae_lambda=0.95
        )

        # Normalize advantages (important for stability)
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # PPO update epochs (typically 10)
        for epoch in range(10):
            # Shuffle data
            indices = torch.randperm(len(rollout['obs']))

            # Mini-batch updates
            for batch_start in range(0, len(indices), batch_size):
                batch_indices = indices[batch_start:batch_start + batch_size]

                # Get batch data
                obs_batch = rollout['obs'][batch_indices]
                actions_batch = rollout['actions'][batch_indices]
                old_log_probs_batch = rollout['log_probs'][batch_indices]
                advantages_batch = advantages[batch_indices]
                returns_batch = rollout['returns'][batch_indices]

                # Evaluate current policy
                action_mean, action_std, values = agent.forward(obs_batch)
                dist = Normal(action_mean, action_std)
                log_probs = dist.log_prob(actions_batch).sum(dim=-1)
                entropy = dist.entropy().sum(dim=-1).mean()

                # PPO clipped surrogate loss
                ratio = torch.exp(log_probs - old_log_probs_batch)
                clipped_ratio = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon)
                policy_loss = -torch.min(
                    ratio * advantages_batch,
                    clipped_ratio * advantages_batch
                ).mean()

                # Value loss (MSE)
                value_loss = F.mse_loss(values.squeeze(), returns_batch)

                # Total loss
                loss = policy_loss + 0.5 * value_loss - 0.01 * entropy

                # Backprop
                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(agent.parameters(), max_norm=0.5)
                optimizer.step()

Generalized Advantage Estimation (GAE)

python

def compute_gae(rewards, values, dones, gamma=0.99, gae_lambda=0.95):
    """
    Compute GAE advantages

    GAE formula:
    A_t = δ_t + (γλ)δ_{t+1} + (γλ)^2 δ_{t+2} + ...
    where δ_t = r_t + γV(s_{t+1}) - V(s_t)
    """
    advantages = []
    gae = 0

    # Iterate backwards through episode
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_value = 0
        else:
            next_value = values[t + 1]

        # TD error
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]

        # GAE accumulation
        gae = delta + gamma * gae_lambda * (1 - dones[t]) * gae
        advantages.insert(0, gae)

    return torch.tensor(advantages)

Hyperparameter Recommendations

Hyperparameter	Recommended Value	Notes
Learning rate	3e-4	Adam optimizer
Clip epsilon	0.2	PPO clipping range
GAE lambda	0.95	Balance bias-variance
Discount (γ)	0.99	Standard for robotics
Rollout length	2048	Steps per update
Batch size	64	Mini-batch size
PPO epochs	10	Updates per rollout
Entropy coefficient	0.01	Encourage exploration
Value loss coefficient	0.5	Balance policy and value
Max gradient norm	0.5	Gradient clipping
Parallel envs	8	Vectorized environments

Environment Options

Recommended: PyBullet Gym

Easy (Start here):

HalfCheetahBulletEnv-v0 - 2D locomotion
AntBulletEnv-v0 - 4-legged robot
HumanoidBulletEnv-v0 - Full humanoid

Medium:

MinitaurBulletEnv-v0 - Quadruped robot with balance challenge
RacecarBulletEnv-v0 - Car navigation

Hard:

Custom navigation task with obstacles

Alternative: Gymnasium

HalfCheetah-v4 (MuJoCo)
Ant-v4 (MuJoCo)
Humanoid-v4 (MuJoCo)

Note: MuJoCo requires license (free for students)

Evaluation Metrics

Your agent will be evaluated on:

Success Rate (25%):
- Percentage of episodes where target is reached
- Must ``achieve >=85``% on test set
Sample Efficiency (25%):
- Number of environment steps to reach target performance
- Should converge within 500K steps
Robustness (20%):
- Performance on 3 environment variations
- Standard deviation of rewards across variations
Benchmark Comparison (15%):
- PPO vs REINFORCE vs A2C
- Training curves and final performance
Code and Documentation (15%):
- Clean implementation
- Comprehensive analysis in README

Resources

Documentation

PyBullet Gym
benelot/pybullet-gym
View on GitHub
- Robotics environments
Stable-Baselines3 PPO - Reference implementation
Gymnasium Docs - Environment API

Research Papers

PPO (Schulman et al., 2017): Proximal Policy Optimization Algorithms
GAE (Schulman et al., 2016): High-Dimensional Continuous Control Using GAE
A2C/A3C (Mnih et al., 2016): Asynchronous Methods for Deep RL

Code Examples

Stable-Baselines3
DLR-RM/stable-baselines3
View on GitHub
CleanRL PPO
vwxyzjn/cleanrl
View on GitHub
SpinningUp PPO
openai/spinningup
View on GitHub

Submission Guidelines

Required Deliverables

Code Repository (GitHub)
- All source code
- README with setup and results
- requirements.txt
Trained Models
- PPO checkpoint (best)
- REINFORCE checkpoint
- A2C checkpoint
Benchmark Report (3-4 pages)
- Learning curves comparing 3 algorithms
- Success rate table across environment variations
- Hyperparameter sensitivity analysis
- Discussion of PPO advantages
Demo Video (3-5 minutes)
- Navigation demonstration (3 environments)
- Side-by-side comparison: PPO vs baselines
- Narration of learned behaviors

Submission Link: [Google Form/LMS Upload Link]

Deadline: 2 weeks from project start

Tips for Success

Training Tips

Normalize observations: Robotics observations have different scales (positions, velocities)
Monitor KL divergence: Should stay below 0.01-0.05 (indicates stable updates)
Use multiple parallel envs: 4-16 environments significantly speed up training
Watch first episode: If agent doesn't improve after 50K steps, check implementation

Debugging Checklist

Policy entropy is decreasing (agent becoming more deterministic)
Value loss is decreasing (value function learning)
Average reward is increasing
KL divergence stays small (<0.05)
Advantages are properly normalized

Common Pitfalls

❌ Forgetting observation normalization: Causes unstable training ❌ Too large clipping epsilon: Reduces to vanilla policy gradient ❌ Too small rollout buffer: Not enough data per update ❌ Not using GAE: Reduces sample efficiency significantly

Portfolio Presentation

Demo Website/GitHub:

Title: "PPO for Autonomous Robot Navigation"
GIF of robot navigating obstacles
Performance comparison graph (PPO vs baselines)
Technical highlights: "Implemented PPO with GAE, achieving 92% success rate"

LinkedIn/Resume:

"Developed autonomous navigation system using Proximal Policy Optimization (PPO), achieving 92% success rate across varied terrains. Benchmarked against baseline algorithms, demonstrating superior sample efficiency and robustness."

Next Steps

After completing this project:

Explore sim-to-real transfer: Read about domain randomization
Try multi-task learning: Train single policy on multiple tasks
Move to Project 5: Apply RLHF to align text generation models
Read advanced RL: Offline RL, model-based RL, meta-RL

Good luck! PPO is the algorithm powering real-world robotics today. This project will give you the skills to deploy RL in safety-critical applications.

Related Projects:

Project 1 - DQN Game Master <- (Discrete actions)
Project 5 - Text Generation with RLHF -> (PPO for language models)

Project 2: Autonomous Robot Navigation

PyBullet Gym

Stable-Baselines3

CleanRL PPO

SpinningUp PPO

Project 2: Autonomous Robot Navigation

PyBullet Gym

Stable-Baselines3

CleanRL PPO

SpinningUp PPO