Template 4 of 7

Template 4: Policy Gradient Methods

Student starter code (30% baseline)

📦 Project Files Included:

📄index.html- Main HTML page
📜script.js- JavaScript logic
🎨styles.css- Styling and layout
📦package.json- Dependencies
⚙️setup.sh- Setup script
📖README.md- Instructions (below)

💡 Download the ZIP, extract it, and follow the instructions below to get started!

Policy Gradient Methods - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand policy-based reinforcement learning vs value-based methods
Implement the REINFORCE algorithm (Monte Carlo Policy Gradient)
Apply the policy gradient theorem to update policy parameters
Use baseline subtraction to reduce gradient variance
Compare policy gradients to Q-Learning from Activity 02
Master PyTorch neural networks for policy representation

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ CartPole environment setup (same as Activity 01!)
- ✅ Policy network architecture visualization
- ✅ Training progress with reward curves
- ✅ Policy entropy tracking

Expected First Run Time: ~90 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ CartPole Environment: Same CartPole-v1 from Activity 01 for comparison
✅ Policy Network: Neural network that outputs action probabilities
✅ Training Loop: Framework for collecting episodes and updating policy
✅ Visualization Tools: Reward curves, policy entropy, baseline analysis
✅ Episode Collection: Gather trajectories with states, actions, rewards
✅ Return Calculation: Compute discounted cumulative rewards

What Needs Your Work (35%):

⚠️ TODO 1: Implement policy network forward pass (Easy)
⚠️ TODO 2: Implement REINFORCE algorithm (Hard)
⚠️ TODO 3: Implement baseline subtraction (Medium)

📋 Tasks to Complete

TODO 1: Implement Policy Network Forward Pass (Easy)

Location: Section 4 - "Policy Network"

Current State: Network structure is defined but forward pass is incomplete

Your Task: Complete the forward pass to convert states to action probabilities

Requirements:

Input: state tensor (batch_size x 4 for CartPole)
Hidden layer: Linear -> ReLU activation
Output layer: Linear -> Softmax (converts logits to probabilities)
Output: action probabilities (batch_size x 2 for CartPole)

Starter Code Provided:

python

class PolicyNetwork(nn.Module):
    def forward(self, state):
        # TODO: Implement forward pass
        # Hint: x = F.relu(self.fc1(state))
        # Hint: logits = self.fc2(x)
        # Hint: action_probs = F.softmax(logits, dim=-1)
        pass

Success Criteria:

Forward pass runs without errors
Output has shape (batch_size, 2) for CartPole
Output probabilities sum to 1.0 for each sample
Network successfully samples actions from probability distribution

TODO 2: Implement REINFORCE Algorithm (Hard)

Location: Section 6 - "Training Algorithm"

Current State: Episode collection works but policy updates are missing

Your Task: Implement the REINFORCE update rule using policy gradient theorem

The Policy Gradient Theorem:

scss

∇J(θ) = E[∇log π(a|s) × G_t]

Where:

π(a|s) = policy probability of action a in state s
G_t = return (discounted sum of future rewards)
∇log π(a|s) = score function (gradient of log probability)

PyTorch Implementation:

python

# For each (state, action, return) in episode:
# 1. Get action probabilities: probs = policy_network(state)
# 2. Calculate log probability: log_prob = torch.log(probs[action])
# 3. Calculate loss: -log_prob × return
# 4. Accumulate loss over episode
# 5. Backpropagate and update

Starter Code Provided:

python

def reinforce_update(policy_net, optimizer, episode_data, gamma=0.99):
    # TODO: Implement REINFORCE algorithm
    # 1. Calculate returns from rewards
    # 2. For each step, compute log probability
    # 3. Compute policy gradient loss
    # 4. Backpropagate and update policy
    pass

Success Criteria:

Agent achieves average ``reward > 400`` after 2000 episodes
Reward curve shows clear upward trend
Policy converges to stable high-reward behavior
Success ``rate > 90``% (episodes reaching ``reward >= 400``)

Key Insight:

Activity 01 (random agent): ~22 reward
Activity 02 (Q-Learning on FrozenLake): 50-70% success
Activity 04 (REINFORCE on CartPole): 400+ reward (learns optimal policy!)

TODO 3: Implement Baseline Subtraction (Medium)

Location: Section 7 - "Variance Reduction"

Current State: REINFORCE works but has high variance

Your Task: Reduce gradient variance using a baseline

Why Baselines Help:

Raw returns have high variance -> unstable learning
Baseline (usually mean return) centers the values
Advantage = Return - Baseline (positive = better than average)
Lower variance -> faster, more stable learning

Modified Update Rule:

scss

∇J(θ) = E[∇log π(a|s) × (G_t - b)]

Where b = average return over episode

Requirements:

Calculate baseline: b = mean(all_returns_in_episode)
Compute advantages: A_t = G_t - b for each timestep
Use advantages instead of raw returns in loss calculation
Track baseline effectiveness (variance reduction)

Starter Code Provided:

python

def reinforce_with_baseline(policy_net, optimizer, episode_data, gamma=0.99):
    # TODO: Implement baseline subtraction
    # 1. Calculate returns (same as TODO 2)
    # 2. Calculate baseline: baseline = mean(returns)
    # 3. Calculate advantages: advantages = returns - baseline
    # 4. Use advantages in loss instead of returns
    pass

Success Criteria:

Training converges faster than without baseline
Gradient variance decreases (plot variance over time)
Reaches 400+ reward ``in < 1500`` episodes (vs 2000 without baseline)
More stable learning curve (less jittery)

🚀 Extension Challenges

Once you've completed all TODOs, try these advanced challenges:

Challenge One: Actor-Critic Architecture (Hard)

Implement Actor-Critic instead of pure REINFORCE:

Actor: Policy network (what you have)
Critic: Value network V(s) that estimates state value
Advantage: A(s,a) = R + γV(s') - V(s) (TD advantage)
Compare: Actor-Critic vs REINFORCE with baseline

Challenge 2: Generalized Advantage Estimation (Very Hard)

Implement GAE(λ) for better bias-variance tradeoff:

Blend n-step returns with different λ values
Formula: A^GAE = Σ(γλ)^t δ_t where δ_t = R + γV(s') - V(s)
Test λ values: [0.9, 0.95, 0.99]
Compare convergence speed

Challenge 3: Entropy Regularization (Medium)

Add entropy bonus to encourage exploration:

Entropy: H(π) = -Σ π(a|s) log π(a|s)
Modified loss: -log_prob × advantage - β × entropy
Test β values: [0.001, 0.01, 0.1]
Does higher entropy improve exploration?

Challenge 4: Multi-Environment Testing (Hard)

Test your REINFORCE implementation on other environments:

MountainCar-v0: Harder than CartPole, requires momentum
LunarLander-v2: Continuous control, 4 actions
Acrobot-v1: Double pendulum, complex dynamics
Compare: Which environments work well with policy gradients?

Challenge 5: PPO Preview (Very Hard)

Implement simplified Proximal Policy Optimization:

Clip probability ratios to prevent large updates
Ratio: r = π_new(a|s) / π_old(a|s)
Clipped loss: min(r × A, clip(r, 1-ε, 1+ε) × A)
Compare: PPO vs vanilla REINFORCE stability

📊 Expected Results

Random Baseline (from Activity 01):

Average Reward: ~22 +/- 8
Success Rate: 0%
Exploration: Completely random

Your REINFORCE Agent (TODO 1-2):

Average Reward: 400-475 after 2000 episodes
Success Rate: 90-95%
Policy: Learned optimal balancing strategy

REINFORCE with Baseline (TODO 3):

Average Reward: 450-500 after 1500 episodes (faster!)
Success Rate: 95-98%
Variance: 50-70% reduction vs no baseline

Actor-Critic (Challenge 1):

Average Reward: 475-500 after 1000 episodes (fastest!)
Success Rate: 98-100%
Bootstrap learning: Much more sample efficient

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors in Google Colab
TODO 1 completed (policy network forward pass)
TODO 2 completed (REINFORCE algorithm implemented)
Agent achieves average ``reward > 300``
Code is well-commented explaining policy gradient logic

Target Grade (for excellent work):

All 3 TODOs completed
Agent achieves average ``reward > 400``
TODO 3 shows measurable variance reduction
Visualizations clearly demonstrate learning progress
At least 1 extension challenge attempted

Exceptional Work (bonus points):

Multiple extension challenges completed
Actor-Critic or PPO implementation
Multi-environment testing with analysis
Novel experiments (entropy regularization, GAE, etc.)
Clear comparison: REINFORCE vs Q-Learning (Activity 02)

🛠️ Troubleshooting

Issue: "Policy network outputs NaN values"

Solution:

Check softmax is applied correctly in forward pass
Ensure log probabilities use torch.log() not torch.log10()
Add small epsilon: torch.log(probs + 1e-8) to avoid log(0)
Verify learning rate isn't too high (try 0.001 instead of 0.01)

Issue: "Agent doesn't learn, rewards stay around 20"

Solution:

Verify returns are calculated correctly (should be discounted cumulative rewards)
Check loss sign: should be -log_prob × return (negative!)
Ensure optimizer.zero_grad() is called before each update
Print log probabilities - should be negative and changing over time
Try higher learning rate (0.01) or more episodes (3000+)

Issue: "Training is very unstable, rewards jump up and down"

Solution:

Implement TODO 3 (baseline subtraction) - this is the main fix!
Reduce learning rate to 0.001 for more stable updates
Normalize returns: returns = (returns - mean) / (std + 1e-8)
Consider using Adam optimizer instead of SGD
Try entropy regularization (Challenge 3) to prevent premature convergence

Issue: "Rewards plateau around 200-300, never reach 400"

Solution:

Train longer (3000+ episodes)
Check discount factor γ - should be 0.99 for CartPole
Verify action selection samples from probability distribution (not argmax)
Add entropy bonus to encourage continued exploration
Try larger network (128 or 256 hidden units instead of 64)

Issue: "PyTorch CUDA out of memory"

Solution: CartPole is lightweight, this shouldn't happen. If it does:

Reduce batch size (fewer episodes per update)
Use CPU instead: device = torch.device('cpu')
Clear GPU cache: torch.cuda.empty_cache()

📚 Resources

Documentation

Concept 04: Policy Gradient Methods (theory behind REINFORCE)
Concept 05: Actor-Critic Methods (combining value and policy learning)
Activity 02: Q-Learning (value-based comparison)

Key Papers

REINFORCE: Williams (1992) - "Simple Statistical Gradient-Following Algorithms"
Policy Gradient Theorem: Sutton et al. (1999) - theoretical foundation
Actor-Critic: Konda & Tsitsiklis (1999) - combining policy and value
PPO: Schulman et al. (2017) - stable policy gradient updates

Additional Reading

Spinning Up: Policy Gradient Intro
Lil'Log: Policy Gradient Algorithms
Sutton & Barto Chapter 13: Policy Gradient Methods

🔍 Key Concepts Review

Policy-Based vs Value-Based RL

Aspect	Value-Based (Q-Learning)	Policy-Based (REINFORCE)
What it learns	Q(s,a) values	π(a\|s) probabilities
Action selection	argmax Q(s,a)	Sample from π(a\|s)
Storage	Q-table or Q-network	Policy network
Exploration	Epsilon-greedy	Stochastic policy
Continuous actions	Hard (must discretize)	Natural (can output continuous distributions)
Convergence	Can diverge with function approximation	More stable with neural networks
Sample efficiency	More efficient (TD learning)	Less efficient (Monte Carlo)

When to Use Policy Gradients?

✅ Continuous action spaces (e.g., robot control)
✅ Stochastic optimal policies (e.g., rock-paper-scissors)
✅ High-dimensional discrete actions (e.g., large action sets)
✅ Stability matters with neural networks
❌ Avoid if sample efficiency is critical (use DQN or Actor-Critic)

The Policy Gradient Trick

Goal: Maximize expected return J(θ) = E[G_t]

Problem: Can't differentiate expectation over stochastic policy

Solution: Policy gradient theorem transforms it to:

scss

∇J(θ) = E[∇log π(a|s) × G_t]

Why it works:

∇log π(a|s) tells us which direction increases probability of action a
Multiply by G_t to increase probability of high-return actions
Expectation averages over multiple episodes

PyTorch automatically computes ∇log π(a|s) via backpropagation!

📤 Submission

Complete required TODOs (minimum: TODO 1-2 for passing, TODO 3 for excellent)
Run entire notebook to generate all outputs
Verify performance: Agent ``achieves >300`` reward (>400 for excellent)
Export training curves: Save reward plot as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and reward_curve.png

Submission Checklist:

Filename: activity-04-[YourName].ipynb
All code cells executed successfully
Training curves visible in notebook
Agent achieves target performance (>300 reward minimum)
Comments explain your implementation
Results section includes analysis comparing to Activity 02

🎉 What's Next?

After completing this activity:

Move on to Activity 05: Actor-Critic Methods
Learn how to combine value and policy learning
Achieve faster convergence with bootstrapping
Prepare for advanced algorithms like PPO and A3C

Key Insight:

You've now learned BOTH major families of RL algorithms:

Value-Based (Activity 02: Q-Learning) -> Good for discrete actions
Policy-Based (Activity 04: REINFORCE) -> Good for continuous actions

In Activity 05, you'll combine them into Actor-Critic, getting the best of both worlds!

Comparison to Activity 02:

Q-Learning (FrozenLake): 50-70% success on 16-state problem
REINFORCE (CartPole): 95%+ success on continuous-state problem
Both achieve superhuman performance on their respective tasks!

Good luck! Policy gradients are the foundation of modern deep RL (PPO, TRPO, SAC). Master this, and you'll understand how robots learn to walk and game AIs reach championship level! 🚀

Policy Gradient Methods - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand policy-based reinforcement learning vs value-based methods
Implement the REINFORCE algorithm (Monte Carlo Policy Gradient)
Apply the policy gradient theorem to update policy parameters
Use baseline subtraction to reduce gradient variance
Compare policy gradients to Q-Learning from Activity 02
Master PyTorch neural networks for policy representation

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ CartPole environment setup (same as Activity 01!)
- ✅ Policy network architecture visualization
- ✅ Training progress with reward curves
- ✅ Policy entropy tracking

Expected First Run Time: ~90 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ CartPole Environment: Same CartPole-v1 from Activity 01 for comparison
✅ Policy Network: Neural network that outputs action probabilities
✅ Training Loop: Framework for collecting episodes and updating policy
✅ Visualization Tools: Reward curves, policy entropy, baseline analysis
✅ Episode Collection: Gather trajectories with states, actions, rewards
✅ Return Calculation: Compute discounted cumulative rewards

What Needs Your Work (35%):

⚠️ TODO 1: Implement policy network forward pass (Easy)
⚠️ TODO 2: Implement REINFORCE algorithm (Hard)
⚠️ TODO 3: Implement baseline subtraction (Medium)

📋 Tasks to Complete

TODO 1: Implement Policy Network Forward Pass (Easy)

Location: Section 4 - "Policy Network"

Current State: Network structure is defined but forward pass is incomplete

Your Task: Complete the forward pass to convert states to action probabilities

Requirements:

Input: state tensor (batch_size x 4 for CartPole)
Hidden layer: Linear -> ReLU activation
Output layer: Linear -> Softmax (converts logits to probabilities)
Output: action probabilities (batch_size x 2 for CartPole)

Starter Code Provided:

python

class PolicyNetwork(nn.Module):
    def forward(self, state):
        # TODO: Implement forward pass
        # Hint: x = F.relu(self.fc1(state))
        # Hint: logits = self.fc2(x)
        # Hint: action_probs = F.softmax(logits, dim=-1)
        pass

Success Criteria:

Forward pass runs without errors
Output has shape (batch_size, 2) for CartPole
Output probabilities sum to 1.0 for each sample
Network successfully samples actions from probability distribution

TODO 2: Implement REINFORCE Algorithm (Hard)

Location: Section 6 - "Training Algorithm"

Current State: Episode collection works but policy updates are missing

Your Task: Implement the REINFORCE update rule using policy gradient theorem

The Policy Gradient Theorem:

scss

∇J(θ) = E[∇log π(a|s) × G_t]

Where:

π(a|s) = policy probability of action a in state s
G_t = return (discounted sum of future rewards)
∇log π(a|s) = score function (gradient of log probability)

PyTorch Implementation:

python

# For each (state, action, return) in episode:
# 1. Get action probabilities: probs = policy_network(state)
# 2. Calculate log probability: log_prob = torch.log(probs[action])
# 3. Calculate loss: -log_prob × return
# 4. Accumulate loss over episode
# 5. Backpropagate and update

Starter Code Provided:

python

def reinforce_update(policy_net, optimizer, episode_data, gamma=0.99):
    # TODO: Implement REINFORCE algorithm
    # 1. Calculate returns from rewards
    # 2. For each step, compute log probability
    # 3. Compute policy gradient loss
    # 4. Backpropagate and update policy
    pass

Success Criteria:

Agent achieves average ``reward > 400`` after 2000 episodes
Reward curve shows clear upward trend
Policy converges to stable high-reward behavior
Success ``rate > 90``% (episodes reaching ``reward >= 400``)

Key Insight:

Activity 01 (random agent): ~22 reward
Activity 02 (Q-Learning on FrozenLake): 50-70% success
Activity 04 (REINFORCE on CartPole): 400+ reward (learns optimal policy!)

TODO 3: Implement Baseline Subtraction (Medium)

Location: Section 7 - "Variance Reduction"

Current State: REINFORCE works but has high variance

Your Task: Reduce gradient variance using a baseline

Why Baselines Help:

Raw returns have high variance -> unstable learning
Baseline (usually mean return) centers the values
Advantage = Return - Baseline (positive = better than average)
Lower variance -> faster, more stable learning

Modified Update Rule:

scss

∇J(θ) = E[∇log π(a|s) × (G_t - b)]

Where b = average return over episode

Requirements:

Calculate baseline: b = mean(all_returns_in_episode)
Compute advantages: A_t = G_t - b for each timestep
Use advantages instead of raw returns in loss calculation
Track baseline effectiveness (variance reduction)

Starter Code Provided:

python

def reinforce_with_baseline(policy_net, optimizer, episode_data, gamma=0.99):
    # TODO: Implement baseline subtraction
    # 1. Calculate returns (same as TODO 2)
    # 2. Calculate baseline: baseline = mean(returns)
    # 3. Calculate advantages: advantages = returns - baseline
    # 4. Use advantages in loss instead of returns
    pass

Success Criteria:

Training converges faster than without baseline
Gradient variance decreases (plot variance over time)
Reaches 400+ reward ``in < 1500`` episodes (vs 2000 without baseline)
More stable learning curve (less jittery)

🚀 Extension Challenges

Once you've completed all TODOs, try these advanced challenges:

Challenge One: Actor-Critic Architecture (Hard)

Implement Actor-Critic instead of pure REINFORCE:

Actor: Policy network (what you have)
Critic: Value network V(s) that estimates state value
Advantage: A(s,a) = R + γV(s') - V(s) (TD advantage)
Compare: Actor-Critic vs REINFORCE with baseline

Challenge 2: Generalized Advantage Estimation (Very Hard)

Implement GAE(λ) for better bias-variance tradeoff:

Blend n-step returns with different λ values
Formula: A^GAE = Σ(γλ)^t δ_t where δ_t = R + γV(s') - V(s)
Test λ values: [0.9, 0.95, 0.99]
Compare convergence speed

Challenge 3: Entropy Regularization (Medium)

Add entropy bonus to encourage exploration:

Entropy: H(π) = -Σ π(a|s) log π(a|s)
Modified loss: -log_prob × advantage - β × entropy
Test β values: [0.001, 0.01, 0.1]
Does higher entropy improve exploration?

Challenge 4: Multi-Environment Testing (Hard)

Test your REINFORCE implementation on other environments:

MountainCar-v0: Harder than CartPole, requires momentum
LunarLander-v2: Continuous control, 4 actions
Acrobot-v1: Double pendulum, complex dynamics
Compare: Which environments work well with policy gradients?

Challenge 5: PPO Preview (Very Hard)

Implement simplified Proximal Policy Optimization:

Clip probability ratios to prevent large updates
Ratio: r = π_new(a|s) / π_old(a|s)
Clipped loss: min(r × A, clip(r, 1-ε, 1+ε) × A)
Compare: PPO vs vanilla REINFORCE stability

📊 Expected Results

Random Baseline (from Activity 01):

Average Reward: ~22 +/- 8
Success Rate: 0%
Exploration: Completely random

Your REINFORCE Agent (TODO 1-2):

Average Reward: 400-475 after 2000 episodes
Success Rate: 90-95%
Policy: Learned optimal balancing strategy

REINFORCE with Baseline (TODO 3):

Average Reward: 450-500 after 1500 episodes (faster!)
Success Rate: 95-98%
Variance: 50-70% reduction vs no baseline

Actor-Critic (Challenge 1):

Average Reward: 475-500 after 1000 episodes (fastest!)
Success Rate: 98-100%
Bootstrap learning: Much more sample efficient

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors in Google Colab
TODO 1 completed (policy network forward pass)
TODO 2 completed (REINFORCE algorithm implemented)
Agent achieves average ``reward > 300``
Code is well-commented explaining policy gradient logic

Target Grade (for excellent work):

All 3 TODOs completed
Agent achieves average ``reward > 400``
TODO 3 shows measurable variance reduction
Visualizations clearly demonstrate learning progress
At least 1 extension challenge attempted

Exceptional Work (bonus points):

Multiple extension challenges completed
Actor-Critic or PPO implementation
Multi-environment testing with analysis
Novel experiments (entropy regularization, GAE, etc.)
Clear comparison: REINFORCE vs Q-Learning (Activity 02)

🛠️ Troubleshooting

Issue: "Policy network outputs NaN values"

Solution:

Check softmax is applied correctly in forward pass
Ensure log probabilities use torch.log() not torch.log10()
Add small epsilon: torch.log(probs + 1e-8) to avoid log(0)
Verify learning rate isn't too high (try 0.001 instead of 0.01)

Issue: "Agent doesn't learn, rewards stay around 20"

Solution:

Verify returns are calculated correctly (should be discounted cumulative rewards)
Check loss sign: should be -log_prob × return (negative!)
Ensure optimizer.zero_grad() is called before each update
Print log probabilities - should be negative and changing over time
Try higher learning rate (0.01) or more episodes (3000+)

Issue: "Training is very unstable, rewards jump up and down"

Solution:

Implement TODO 3 (baseline subtraction) - this is the main fix!
Reduce learning rate to 0.001 for more stable updates
Normalize returns: returns = (returns - mean) / (std + 1e-8)
Consider using Adam optimizer instead of SGD
Try entropy regularization (Challenge 3) to prevent premature convergence

Issue: "Rewards plateau around 200-300, never reach 400"

Solution:

Train longer (3000+ episodes)
Check discount factor γ - should be 0.99 for CartPole
Verify action selection samples from probability distribution (not argmax)
Add entropy bonus to encourage continued exploration
Try larger network (128 or 256 hidden units instead of 64)

Issue: "PyTorch CUDA out of memory"

Solution: CartPole is lightweight, this shouldn't happen. If it does:

Reduce batch size (fewer episodes per update)
Use CPU instead: device = torch.device('cpu')
Clear GPU cache: torch.cuda.empty_cache()

📚 Resources

Documentation

Concept 04: Policy Gradient Methods (theory behind REINFORCE)
Concept 05: Actor-Critic Methods (combining value and policy learning)
Activity 02: Q-Learning (value-based comparison)

Key Papers

REINFORCE: Williams (1992) - "Simple Statistical Gradient-Following Algorithms"
Policy Gradient Theorem: Sutton et al. (1999) - theoretical foundation
Actor-Critic: Konda & Tsitsiklis (1999) - combining policy and value
PPO: Schulman et al. (2017) - stable policy gradient updates

Additional Reading

Spinning Up: Policy Gradient Intro
Lil'Log: Policy Gradient Algorithms
Sutton & Barto Chapter 13: Policy Gradient Methods

🔍 Key Concepts Review

Policy-Based vs Value-Based RL

Aspect	Value-Based (Q-Learning)	Policy-Based (REINFORCE)
What it learns	Q(s,a) values	π(a\|s) probabilities
Action selection	argmax Q(s,a)	Sample from π(a\|s)
Storage	Q-table or Q-network	Policy network
Exploration	Epsilon-greedy	Stochastic policy
Continuous actions	Hard (must discretize)	Natural (can output continuous distributions)
Convergence	Can diverge with function approximation	More stable with neural networks
Sample efficiency	More efficient (TD learning)	Less efficient (Monte Carlo)

When to Use Policy Gradients?

✅ Continuous action spaces (e.g., robot control)
✅ Stochastic optimal policies (e.g., rock-paper-scissors)
✅ High-dimensional discrete actions (e.g., large action sets)
✅ Stability matters with neural networks
❌ Avoid if sample efficiency is critical (use DQN or Actor-Critic)

The Policy Gradient Trick

Goal: Maximize expected return J(θ) = E[G_t]

Problem: Can't differentiate expectation over stochastic policy

Solution: Policy gradient theorem transforms it to:

scss

∇J(θ) = E[∇log π(a|s) × G_t]

Why it works:

∇log π(a|s) tells us which direction increases probability of action a
Multiply by G_t to increase probability of high-return actions
Expectation averages over multiple episodes

PyTorch automatically computes ∇log π(a|s) via backpropagation!

📤 Submission

Complete required TODOs (minimum: TODO 1-2 for passing, TODO 3 for excellent)
Run entire notebook to generate all outputs
Verify performance: Agent ``achieves >300`` reward (>400 for excellent)
Export training curves: Save reward plot as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and reward_curve.png

Submission Checklist:

Filename: activity-04-[YourName].ipynb
All code cells executed successfully
Training curves visible in notebook
Agent achieves target performance (>300 reward minimum)
Comments explain your implementation
Results section includes analysis comparing to Activity 02

🎉 What's Next?

After completing this activity:

Move on to Activity 05: Actor-Critic Methods
Learn how to combine value and policy learning
Achieve faster convergence with bootstrapping
Prepare for advanced algorithms like PPO and A3C

Key Insight:

You've now learned BOTH major families of RL algorithms:

Value-Based (Activity 02: Q-Learning) -> Good for discrete actions
Policy-Based (Activity 04: REINFORCE) -> Good for continuous actions

In Activity 05, you'll combine them into Actor-Critic, getting the best of both worlds!

Comparison to Activity 02:

Q-Learning (FrozenLake): 50-70% success on 16-state problem
REINFORCE (CartPole): 95%+ success on continuous-state problem
Both achieve superhuman performance on their respective tasks!

Good luck! Policy gradients are the foundation of modern deep RL (PPO, TRPO, SAC). Master this, and you'll understand how robots learn to walk and game AIs reach championship level! 🚀