Practice and reinforce the concepts from Lesson 4
In this activity, you'll implement the REINFORCE algorithm (Monte Carlo Policy Gradient) from scratch and train an agent to land a spacecraft in LunarLander-v2. You'll learn to directly optimize policies and handle high-variance gradients with baselines!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-04-policy-gradient-methods.zipTemplates/AI25-Template-activity-04-policy-gradient-methods.zipactivity-04-policy-gradient-methods.ipynb to Google ColabExecute the first few cells to:
LunarLander-v2:
Challenge: Higher-dimensional state space, sparse rewards, physics simulation
TODO 1: Complete the policy network architecture
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, state):
# TODO 1: Complete forward pass
# Return logits (not probabilities yet!)
pass
def select_action(self, state):
# TODO 1: Sample action from policy
# 1. Get logits from forward()
# 2. Convert to probabilities (softmax)
# 3. Create Categorical distribution
# 4. Sample action
# 5. Return action and log_prob
pass
TODO 2: Implement discounted return calculation
def compute_returns(rewards, gamma=0.99):
# TODO 2: Compute discounted returns
# Hint: Work backwards, G_t = r_t + gamma * G_{t+1}
returns = []
G = 0
for r in reversed(rewards):
# Your code here
pass
return returns
TODO 3: Implement the policy gradient loss
def reinforce_update(policy, optimizer, log_probs, returns):
# TODO 3: Compute policy gradient loss
# 1. Normalize returns
# 2. Compute loss = -Σ log_prob * return
# 3. Backpropagate and update
pass
TODO 4: Add baseline for variance reduction
class Baseline:
def __init__(self):
self.returns = []
def get_baseline(self):
# TODO 4: Compute baseline (mean of recent returns)
pass
def update(self, return_value):
# TODO 4: Add new return to history
pass
The main training loop is provided, you'll complete:
Episodes 0-100: Random exploration, negative rewards (-200 to 0) Episodes 100-300: Learning lift-off and hovering, rewards improve to ~0-100 Episodes 300-600: Learning to land softly, rewards reach 100-150 Episodes 600-1000: Consistent landings, ``rewards >200`` (SOLVED!)
LunarLander-v2 is considered "solved" when:
Average reward ≥ 200 over 100 consecutive episodes
REINFORCE typically solves this in 600-1200 episodes.
Your implementation is complete when:
Key Insight:
# We want to increase probability of actions that led to high returns
loss = -(log_prob * return).mean()
# Gradient descent on this loss = gradient ascent on expected return
Why log probabilities?
# Network
learning_rate = 0.001 # Policy gradients need lower LR than DQN
hidden_size = 128
# Training
gamma = 0.99 # Discount factor
batch_size = 1 # REINFORCE uses full episodes
# Variance reduction
use_baseline = True
normalize_returns = True
# Entropy regularization (optional)
entropy_coef = 0.01 # Encourages exploration
Problem: High variance, unstable learning
Problem: Agent doesn't explore enough
Problem: Agent learns then forgets (catastrophic forgetting)
Problem: Rewards not improving
Compare training with and without baseline:
# Without baseline
loss = -(log_probs * returns).mean()
# With baseline
advantages = returns - baseline
loss = -(log_probs * advantages).mean()
Expected: Baseline reduces variance by ~30-50%, faster convergence.
Add entropy bonus to encourage exploration:
entropy = -(probs * torch.log(probs)).sum()
loss = policy_loss - entropy_coef * entropy
Implement GAE(λ) for better bias-variance trade-off:
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] - values[t]
gae = delta + gamma * lam * gae
advantages.insert(0, gae)
return advantages
Extend to continuous control (LunarLanderContinuous-v2):
class GaussianPolicy(nn.Module):
def forward(self, state):
mean = self.mean_layer(features)
log_std = self.log_std # Learnable parameter
return mean, torch.exp(log_std)
def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum()
return action, log_prob
Collect N episodes before each update (more stable gradients):
for update in range(num_updates):
# Collect N episodes
for _ in range(episodes_per_update):
episode_data = collect_episode(env, policy)
batch.append(episode_data)
# Update on batch
update_policy(policy, batch)
Completed Notebook: activity-04-policy-gradient-methods.ipynb
Performance Report: Brief summary including:
Visualizations:
After completing this activity:
In the next lesson, you'll combine policy gradients with value functions to get the best of both worlds!
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy your first policy gradient agent! 🌙🚀