Demo Mode

No student ID available

Activity 4 of 18

Activity 4: Policy Gradient Methods

Practice and reinforce the concepts from Lesson 4

Activity 04: Policy Gradient Methods

Overview

In this activity, you'll implement the REINFORCE algorithm (Monte Carlo Policy Gradient) from scratch and train an agent to land a spacecraft in LunarLander-v2. You'll learn to directly optimize policies and handle high-variance gradients with baselines!

Learning Objectives

By completing this activity, you will:

Implement the REINFORCE policy gradient algorithm
Build a stochastic policy network with softmax outputs
Apply the log probability trick for gradient computation
Implement baseline subtraction for variance reduction
Train an agent on LunarLander-v2 (continuous state, discrete actions)
Visualize policy distributions and entropy over training

Prerequisites

Completed Concept 04: Policy Gradient Methods
Completed Activity 03: Deep Q-Networks (DQN)
Strong PyTorch understanding (distributions, sampling, gradients)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-04-policy-gradient-methods.zip
Location: Templates/AI25-Template-activity-04-policy-gradient-methods.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-04-policy-gradient-methods.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Gymnasium with Box2D (for LunarLander)
Import PyTorch, NumPy, Matplotlib
Set random seeds for reproducibility

What You'll Build

Part One: LunarLander Environment (Pre-Built)

LunarLander-v2:

State: 8 continuous values (position, velocity, angle, angular velocity, leg contact)
Actions: 4 discrete (do nothing, fire left engine, fire main engine, fire right engine)
Reward:
- +100 for landing on pad
- -100 for crashing
- Small penalties for fuel use
Success: Average ``reward >200`` over 100 episodes

Challenge: Higher-dimensional state space, sparse rewards, physics simulation

Part 2: Policy Network (YOU COMPLETE)

TODO 1: Complete the policy network architecture

Input: State vector (8 dimensions)
Hidden layers: 128 -> 128 with ReLU
Output: Logits for each action (4 values) -> Softmax -> Action probabilities

python

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        # TODO 1: Complete forward pass
        # Return logits (not probabilities yet!)
        pass

    def select_action(self, state):
        # TODO 1: Sample action from policy
        # 1. Get logits from forward()
        # 2. Convert to probabilities (softmax)
        # 3. Create Categorical distribution
        # 4. Sample action
        # 5. Return action and log_prob
        pass

Part 3: Return Computation (YOU COMPLETE)

TODO 2: Implement discounted return calculation

Compute G_t = r_t + γr_t+1 + γ²r_t+2 + ...
Work backwards from end of episode
Return list of returns for each timestep

python

def compute_returns(rewards, gamma=0.99):
    # TODO 2: Compute discounted returns
    # Hint: Work backwards, G_t = r_t + gamma * G_{t+1}
    returns = []
    G = 0
    for r in reversed(rewards):
        # Your code here
        pass
    return returns

Part 4: REINFORCE Update (YOU COMPLETE)

TODO 3: Implement the policy gradient loss

Loss = -Σ log π(a_t|s_t) * G_t
Normalize returns (mean=0, std=1)
Gradient ascent (maximize reward -> minimize negative loss)

python

def reinforce_update(policy, optimizer, log_probs, returns):
    # TODO 3: Compute policy gradient loss
    # 1. Normalize returns
    # 2. Compute loss = -Σ log_prob * return
    # 3. Backpropagate and update
    pass

Part 5: Baseline Implementation (YOU COMPLETE)

TODO 4: Add baseline for variance reduction

Baseline = running average of returns
Advantage = G_t - baseline
Update loss to use advantages instead of raw returns

python

class Baseline:
    def __init__(self):
        self.returns = []

    def get_baseline(self):
        # TODO 4: Compute baseline (mean of recent returns)
        pass

    def update(self, return_value):
        # TODO 4: Add new return to history
        pass

Part 6: Training Loop (70% Complete)

The main training loop is provided, you'll complete:

Action selection and log probability storage
Return computation
REINFORCE updates with baseline
Entropy regularization (optional)

Expected Results

Training Progression

Episodes 0-100: Random exploration, negative rewards (-200 to 0) Episodes 100-300: Learning lift-off and hovering, rewards improve to ~0-100 Episodes 300-600: Learning to land softly, rewards reach 100-150 Episodes 600-1000: Consistent landings, ``rewards >200`` (SOLVED!)

Performance Milestones

Episode 100: Average reward ~-50 to 0 (barely flying)
Episode 300: Average reward ~50-100 (hovering, occasional landings)
Episode 600: Average reward ~150-200 (frequent successful landings)
Episode 1000: Average ``reward >200`` (consistent expert performance)

Solved Criteria

LunarLander-v2 is considered "solved" when:

sql

Average reward ≥ 200 over 100 consecutive episodes

REINFORCE typically solves this in 600-1200 episodes.

Success Criteria

Your implementation is complete when:

Policy network outputs correct action probability distributions
Log probabilities computed correctly for sampled actions
Discounted returns calculated accurately
Policy gradient loss formula matches REINFORCE equation
Baseline reduces variance (check with/without comparison)
Agent achieves average ``reward >200`` over 100 episodes

Tips for Success

Understanding Policy Gradients

Key Insight:

python

# We want to increase probability of actions that led to high returns
loss = -(log_prob * return).mean()

# Gradient descent on this loss = gradient ascent on expected return

Why log probabilities?

Mathematically convenient: ∇ log π(a|s) is easy to compute
Numerically stable: avoids very small probability values

Hyperparameter Starting Points

python

# Network
learning_rate = 0.001  # Policy gradients need lower LR than DQN
hidden_size = 128

# Training
gamma = 0.99  # Discount factor
batch_size = 1  # REINFORCE uses full episodes

# Variance reduction
use_baseline = True
normalize_returns = True

# Entropy regularization (optional)
entropy_coef = 0.01  # Encourages exploration

Debugging Common Issues

Problem: High variance, unstable learning

Solution: Implement baseline (TODO 4)
Solution: Normalize returns
Solution: Increase episodes per update

Problem: Agent doesn't explore enough

Solution: Add entropy bonus to loss
Solution: Increase initial learning rate
Solution: Check policy isn't becoming deterministic too early

Problem: Agent learns then forgets (catastrophic forgetting)

Solution: Reduce learning rate
Solution: Use smaller gradient steps
Solution: Average gradients over multiple episodes

Problem: Rewards not improving

Check log probabilities are being computed correctly
Check returns are calculated backwards (G_t includes future rewards)
Check gradient signs (should maximize reward -> minimize negative loss)

Variance Reduction Effectiveness

Compare training with and without baseline:

python

# Without baseline
loss = -(log_probs * returns).mean()

# With baseline
advantages = returns - baseline
loss = -(log_probs * advantages).mean()

Expected: Baseline reduces variance by ~30-50%, faster convergence.

Extension Challenges

Challenge One: Entropy Regularization (Easy)

Add entropy bonus to encourage exploration:

python

entropy = -(probs * torch.log(probs)).sum()
loss = policy_loss - entropy_coef * entropy

Challenge 2: Generalized Advantage Estimation (Medium)

Implement GAE(λ) for better bias-variance trade-off:

python

def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * values[t+1] - values[t]
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    return advantages

Challenge 3: Continuous Action Space (Hard)

Extend to continuous control (LunarLanderContinuous-v2):

python

class GaussianPolicy(nn.Module):
    def forward(self, state):
        mean = self.mean_layer(features)
        log_std = self.log_std  # Learnable parameter
        return mean, torch.exp(log_std)

    def select_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum()
        return action, log_prob

Challenge 4: Multiple Episodes Per Update (Medium)

Collect N episodes before each update (more stable gradients):

python

for update in range(num_updates):
    # Collect N episodes
    for _ in range(episodes_per_update):
        episode_data = collect_episode(env, policy)
        batch.append(episode_data)

    # Update on batch
    update_policy(policy, batch)

Submission Requirements

What to Submit

Completed Notebook: activity-04-policy-gradient-methods.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (avg ``reward >200``)
- Training time (minutes)
- Hyperparameters used
- Variance reduction effectiveness (with/without baseline)
Visualizations:
- Training curve (reward over episodes)
- Policy entropy over time
- Return distribution histogram

Submission Steps

Train agent until solved (avg ``reward >200``)
Run all cells from top to bottom
Verify all visualizations display
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Resources

Documentation

Papers

Simple Statistical Gradient-Following Algorithms (REINFORCE) (Williams, 1992)
Policy Gradient Methods (Lil'Log)

Policy gradient theorem
Log probability trick
Variance reduction with baselines
Entropy regularization

Next Steps

After completing this activity:

Concept 05: Actor-Critic Methods
Activity 05: A2C for continuous control
Project 2: Autonomous Robot Navigation (PPO)

In the next lesson, you'll combine policy gradients with value functions to get the best of both worlds!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves LunarLander (avg ``reward >200``)
Understanding (10%): Report demonstrates grasp of REINFORCE

Passing Grade: 70% or higher

Good luck, and enjoy your first policy gradient agent! 🌙🚀

Activity 4 of 18

Activity 4: Policy Gradient Methods

Practice and reinforce the concepts from Lesson 4

Activity 04: Policy Gradient Methods

Overview

Learning Objectives

By completing this activity, you will:

Implement the REINFORCE policy gradient algorithm
Build a stochastic policy network with softmax outputs
Apply the log probability trick for gradient computation
Implement baseline subtraction for variance reduction
Train an agent on LunarLander-v2 (continuous state, discrete actions)
Visualize policy distributions and entropy over training

Prerequisites

Completed Concept 04: Policy Gradient Methods
Completed Activity 03: Deep Q-Networks (DQN)
Strong PyTorch understanding (distributions, sampling, gradients)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-04-policy-gradient-methods.zip
Location: Templates/AI25-Template-activity-04-policy-gradient-methods.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-04-policy-gradient-methods.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Gymnasium with Box2D (for LunarLander)
Import PyTorch, NumPy, Matplotlib
Set random seeds for reproducibility

What You'll Build

Part One: LunarLander Environment (Pre-Built)

LunarLander-v2:

State: 8 continuous values (position, velocity, angle, angular velocity, leg contact)
Actions: 4 discrete (do nothing, fire left engine, fire main engine, fire right engine)
Reward:
- +100 for landing on pad
- -100 for crashing
- Small penalties for fuel use
Success: Average ``reward >200`` over 100 episodes

Challenge: Higher-dimensional state space, sparse rewards, physics simulation

Part 2: Policy Network (YOU COMPLETE)

TODO 1: Complete the policy network architecture

Input: State vector (8 dimensions)
Hidden layers: 128 -> 128 with ReLU
Output: Logits for each action (4 values) -> Softmax -> Action probabilities

python

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        # TODO 1: Complete forward pass
        # Return logits (not probabilities yet!)
        pass

    def select_action(self, state):
        # TODO 1: Sample action from policy
        # 1. Get logits from forward()
        # 2. Convert to probabilities (softmax)
        # 3. Create Categorical distribution
        # 4. Sample action
        # 5. Return action and log_prob
        pass

Part 3: Return Computation (YOU COMPLETE)

TODO 2: Implement discounted return calculation

Compute G_t = r_t + γr_t+1 + γ²r_t+2 + ...
Work backwards from end of episode
Return list of returns for each timestep

python

def compute_returns(rewards, gamma=0.99):
    # TODO 2: Compute discounted returns
    # Hint: Work backwards, G_t = r_t + gamma * G_{t+1}
    returns = []
    G = 0
    for r in reversed(rewards):
        # Your code here
        pass
    return returns

Part 4: REINFORCE Update (YOU COMPLETE)

TODO 3: Implement the policy gradient loss

Loss = -Σ log π(a_t|s_t) * G_t
Normalize returns (mean=0, std=1)
Gradient ascent (maximize reward -> minimize negative loss)

python

def reinforce_update(policy, optimizer, log_probs, returns):
    # TODO 3: Compute policy gradient loss
    # 1. Normalize returns
    # 2. Compute loss = -Σ log_prob * return
    # 3. Backpropagate and update
    pass

Part 5: Baseline Implementation (YOU COMPLETE)

TODO 4: Add baseline for variance reduction

Baseline = running average of returns
Advantage = G_t - baseline
Update loss to use advantages instead of raw returns

python

class Baseline:
    def __init__(self):
        self.returns = []

    def get_baseline(self):
        # TODO 4: Compute baseline (mean of recent returns)
        pass

    def update(self, return_value):
        # TODO 4: Add new return to history
        pass

Part 6: Training Loop (70% Complete)

The main training loop is provided, you'll complete:

Action selection and log probability storage
Return computation
REINFORCE updates with baseline
Entropy regularization (optional)

Episode 100: Average reward ~-50 to 0 (barely flying)
Episode 300: Average reward ~50-100 (hovering, occasional landings)
Episode 600: Average reward ~150-200 (frequent successful landings)
Episode 1000: Average ``reward >200`` (consistent expert performance)

Solved Criteria

LunarLander-v2 is considered "solved" when:

sql

Average reward ≥ 200 over 100 consecutive episodes

REINFORCE typically solves this in 600-1200 episodes.

Success Criteria

Your implementation is complete when:

Policy network outputs correct action probability distributions
Log probabilities computed correctly for sampled actions
Discounted returns calculated accurately
Policy gradient loss formula matches REINFORCE equation
Baseline reduces variance (check with/without comparison)
Agent achieves average ``reward >200`` over 100 episodes

Tips for Success

Understanding Policy Gradients

Key Insight:

python

# We want to increase probability of actions that led to high returns
loss = -(log_prob * return).mean()

# Gradient descent on this loss = gradient ascent on expected return

Why log probabilities?

Mathematically convenient: ∇ log π(a|s) is easy to compute
Numerically stable: avoids very small probability values

Hyperparameter Starting Points

python

# Network
learning_rate = 0.001  # Policy gradients need lower LR than DQN
hidden_size = 128

# Training
gamma = 0.99  # Discount factor
batch_size = 1  # REINFORCE uses full episodes

# Variance reduction
use_baseline = True
normalize_returns = True

# Entropy regularization (optional)
entropy_coef = 0.01  # Encourages exploration

Debugging Common Issues

Problem: High variance, unstable learning

Solution: Implement baseline (TODO 4)
Solution: Normalize returns
Solution: Increase episodes per update

Problem: Agent doesn't explore enough

Solution: Add entropy bonus to loss
Solution: Increase initial learning rate
Solution: Check policy isn't becoming deterministic too early

Problem: Agent learns then forgets (catastrophic forgetting)

Solution: Reduce learning rate
Solution: Use smaller gradient steps
Solution: Average gradients over multiple episodes

Problem: Rewards not improving

Check log probabilities are being computed correctly
Check returns are calculated backwards (G_t includes future rewards)
Check gradient signs (should maximize reward -> minimize negative loss)

Variance Reduction Effectiveness

Compare training with and without baseline:

python

# Without baseline
loss = -(log_probs * returns).mean()

# With baseline
advantages = returns - baseline
loss = -(log_probs * advantages).mean()

Expected: Baseline reduces variance by ~30-50%, faster convergence.

Extension Challenges

Challenge One: Entropy Regularization (Easy)

Add entropy bonus to encourage exploration:

python

entropy = -(probs * torch.log(probs)).sum()
loss = policy_loss - entropy_coef * entropy

Challenge 2: Generalized Advantage Estimation (Medium)

Implement GAE(λ) for better bias-variance trade-off:

python

def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        delta = rewards[t] + gamma * values[t+1] - values[t]
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    return advantages

Challenge 3: Continuous Action Space (Hard)

Extend to continuous control (LunarLanderContinuous-v2):

python

class GaussianPolicy(nn.Module):
    def forward(self, state):
        mean = self.mean_layer(features)
        log_std = self.log_std  # Learnable parameter
        return mean, torch.exp(log_std)

    def select_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum()
        return action, log_prob

Challenge 4: Multiple Episodes Per Update (Medium)

Collect N episodes before each update (more stable gradients):

python

for update in range(num_updates):
    # Collect N episodes
    for _ in range(episodes_per_update):
        episode_data = collect_episode(env, policy)
        batch.append(episode_data)

    # Update on batch
    update_policy(policy, batch)

Submission Requirements

What to Submit

Completed Notebook: activity-04-policy-gradient-methods.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (avg ``reward >200``)
- Training time (minutes)
- Hyperparameters used
- Variance reduction effectiveness (with/without baseline)
Visualizations:
- Training curve (reward over episodes)
- Policy entropy over time
- Return distribution histogram

Submission Steps

Train agent until solved (avg ``reward >200``)
Run all cells from top to bottom
Verify all visualizations display
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Simple Statistical Gradient-Following Algorithms (REINFORCE) (Williams, 1992)
Policy Gradient Methods (Lil'Log)

Policy gradient theorem
Log probability trick
Variance reduction with baselines
Entropy regularization

Next Steps

After completing this activity:

Concept 05: Actor-Critic Methods
Activity 05: A2C for continuous control
Project 2: Autonomous Robot Navigation (PPO)

In the next lesson, you'll combine policy gradients with value functions to get the best of both worlds!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves LunarLander (avg ``reward >200``)
Understanding (10%): Report demonstrates grasp of REINFORCE

Passing Grade: 70% or higher

Good luck, and enjoy your first policy gradient agent! 🌙🚀