Practice and reinforce the concepts from Lesson 6
In this activity, you'll use Stable-Baselines3 to train a PPO agent on a challenging robotics task. You'll learn production RL workflows, hyperparameter tuning, and how to achieve state-of-the-art performance with PPO!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-06-proximal-policy-optimization.zipTemplates/AI25-Template-activity-06-proximal-policy-optimization.zipactivity-06-proximal-policy-optimization.ipynb to Google ColabExecute the first few cells to:
BipedalWalker-v3:
Challenge: High-dimensional continuous control, complex physics, requires stable learning!
TODO 1: Create vectorized environment
make_vec_env to create parallel environmentsfrom stable_baselines3.common.env_util import make_vec_env
# TODO 1: Create vectorized environment
env = make_vec_env(
# Your code here: env_id, n_envs
)
TODO 2: Configure PPO model
from stable_baselines3 import PPO
# TODO 2: Create PPO model
model = PPO(
policy="MlpPolicy",
env=env,
learning_rate=# TODO: Set learning rate,
n_steps=# TODO: Steps per environment before update,
batch_size=# TODO: Minibatch size,
n_epochs=# TODO: PPO epochs per update,
gamma=# TODO: Discount factor,
gae_lambda=# TODO: GAE lambda,
clip_range=# TODO: Clipping epsilon,
verbose=1,
tensorboard_log="./ppo_tensorboard/"
)
TODO 3: Implement checkpoint callback
from stable_baselines3.common.callbacks import CheckpointCallback
# TODO 3: Create checkpoint callback
checkpoint_callback = CheckpointCallback(
save_freq=# TODO: Save frequency,
save_path=# TODO: Save directory,
name_prefix=# TODO: Model name prefix
)
TODO 4: Implement evaluation callback
from stable_baselines3.common.callbacks import EvalCallback
# TODO 4: Create evaluation callback
eval_callback = EvalCallback(
eval_env, # Separate evaluation environment
best_model_save_path=# TODO: Path to save best model,
log_path=# TODO: Log directory,
eval_freq=# TODO: Evaluation frequency,
n_eval_episodes=# TODO: Episodes per evaluation,
deterministic=True
)
TODO 5: Experiment with different hyperparameter sets
# TODO 5: Define hyperparameter configurations to test
configs = [
{
"name": "baseline",
"learning_rate": 3e-4,
"n_steps": 2048,
"batch_size": 64,
"n_epochs": 10,
"clip_range": 0.2,
},
{
"name": "aggressive",
# TODO: Your config
},
{
"name": "conservative",
# TODO: Your config
}
]
TODO 6 (Optional): Define custom actor-critic architecture
policy_kwargs = dict(
net_arch=[
# TODO: Define network architecture
# Example: dict(pi=[256, 256], vf=[256, 256])
]
)
model = PPO("MlpPolicy", env, policy_kwargs=policy_kwargs, ...)
Steps 0-100K: Random walking, frequent falls, reward ~-100 to 0 Steps 100K-500K: Learning basic locomotion, reward reaches 50-150 Steps 500K-1M: Stable walking gait, reward reaches 200-280 Steps 1M+: Optimal walking, ``reward >300`` (SOLVED!)
BipedalWalker-v3 is considered "solved" when:
Average reward ≥ 300 over 100 consecutive episodes
PPO typically solves this in 1-2M steps (with good hyperparameters).
Your implementation is complete when:
Default (Good Starting Point):
learning_rate = 3e-4
n_steps = 2048
batch_size = 64
n_epochs = 10
gamma = 0.99
gae_lambda = 0.95
clip_range = 0.2
For Faster Learning (More Aggressive):
learning_rate = 5e-4 # Higher learning rate
n_steps = 4096 # More data per update
batch_size = 128 # Larger batches
n_epochs = 15 # More training epochs
clip_range = 0.3 # Allow larger policy changes
For More Stable Learning (Conservative):
learning_rate = 1e-4 # Lower learning rate
n_steps = 1024 # Less data per update
batch_size = 32 # Smaller batches
n_epochs = 5 # Fewer epochs
clip_range = 0.1 # Smaller policy changes
Key metrics to watch in TensorBoard:
# Performance
rollout/ep_rew_mean # Average episode reward (should increase)
rollout/ep_len_mean # Average episode length
# Policy metrics
train/policy_loss # Policy loss (should decrease then stabilize)
train/approx_kl # KL divergence (<0.05 is good)
train/clip_fraction # Fraction clipped (0.1-0.3 is good)
train/entropy_loss # Policy entropy (should decrease slowly)
# Value function
train/value_loss # Critic loss (should decrease)
train/explained_variance # How well critic predicts returns (>0.5 good)
Problem: Agent doesn't learn (reward stays negative)
Problem: Training unstable (reward oscillates)
Problem: Agent learns then forgets
Problem: Slow training
Add intermediate rewards to guide learning:
from gymnasium import Wrapper
class RewardShapingWrapper(Wrapper):
def step(self, action):
obs, reward, done, truncated, info = self.env.step(action)
# Add bonus for forward velocity
forward_velocity = info.get('forward_velocity', 0)
reward += 0.1 * forward_velocity
# Penalize excessive joint torque (energy efficiency)
torque_penalty = np.sum(np.abs(action))
reward -= 0.01 * torque_penalty
return obs, reward, done, truncated, info
Gradually increase task difficulty:
# Start with easier variant (no hardcore obstacles)
env = gym.make('BipedalWalker-v3', hardcore=False)
# After agent learns, switch to hardcore
if mean_reward > 250:
env = gym.make('BipedalWalker-v3', hardcore=True)
Use Optuna for automated hyperparameter tuning:
import optuna
from stable_baselines3.common.callbacks import EvalCallback
def objective(trial):
# Sample hyperparameters
learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-3)
n_steps = trial.suggest_categorical('n_steps', [1024, 2048, 4096])
# ... more hyperparameters
# Train model
model = PPO("MlpPolicy", env, learning_rate=learning_rate, n_steps=n_steps, ...)
model.learn(total_timesteps=100000)
# Evaluate
mean_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)
return mean_reward
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)
Train on multiple environments simultaneously:
env_ids = ['BipedalWalker-v3', 'LunarLanderContinuous-v2', 'Pendulum-v1']
# Create and train on each
Completed Notebook: activity-06-proximal-policy-optimization.ipynb
Performance Report: Brief summary including:
TensorBoard Logs: Export training curves
Trained Model: Save best checkpoint
best_model.zip from evaluation callbackAfter completing this activity:
You've now mastered the RL module! Next, we'll explore a different paradigm: multi-armed bandits.
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy training state-of-the-art PPO agents! 🤖🚶